MZN · Public Reference Atlas · Provisional Map · v1.1 · 2026-05-11

LLM Framework Index
+ Provisional Position Map

A phase-safe index for the MZN LLM Framework: a public reference atlas of major capability areas behind a modern LLM company — mapped to 21 slots and 529 sub-endpoints — with MZN’s provisional self-positioning for Phase 3 technical review.

Scope boundary: this page is a public technical orientation layer, not final validation, not a valuation document, and not the full HUAI / asset inventory. MZN’s position labels are provisional self-positioning based on public, restricted, or reserved evidence layers and should be tested by qualified Phase 3 reviewers.
21
Capability slots
529
Sub-endpoints mapped
9
Frontier deep-essays
7 / 13 / 1
Strong Evidence / Partial / Gap
Mode: Public reference baseline Disclosure: Framework + provisional position only MZN proprietary content: selected layers reserved
Provenance & Context Signals

About this public reference index.

A reference atlas authored by Mohammad Rahimi during the bounded Phase 2 solo formation period. The credentials, context signals, and provenance below give context for what follows.

Author
Mohammad Rahimi
Founder, CEO & Chief Architect · MZN Company
Build period
Bounded Phase 2 window · 2025 – early 2026
Web Summit ALPHA · context
Slush 100 · context
WSA Nominee · context
EUIPO guidance · context
Priority / patent-grade candidate record · 2026-03-22
Crunchbase profile →
Context-signal boundary: Crunchbase signal, dated May 22, 2026: #2 in People across all categories; #1 outside the United States; #1 in Machine Learning and Cyber Security filters. Rankings may change over time and are not official endorsement, technical validation, valuation, or IP validation. Festival and institutional signals are reasons to review, not validation of this LLM framework index or MZN’s position map.

Beyond its content, the provenance of this document is itself unusual. Industry reference atlases of comparable depth — capability decomposition, sub-endpoint mapping, and frontier-position analysis at 21-slot and 529-endpoint granularity — are typically produced by analyst teams over multi-year engagements at research firms or industry consortiums.

This atlas was authored by Mohammad Rahimi during the bounded Phase 2 solo formation period, without a human team, agency, contractor stack, API stack, or agent workforce. It remains subject to independent technical review.

To the author’s knowledge, comparable public solo-produced LLM-company anatomy maps at this depth are uncommon. This is a benchmarkable claim and should be independently compared against analyst-team, research-team, and technical-consulting outputs.

The atlas that follows is therefore intended as two artifacts in one: a demonstration of the author's depth of LLM-industry knowledge and a documented instance of single-person production capacity claim prepared for review in a category of work that has historically required institutional resources. Should the broader question of whether the One-Person Unicorn pattern — single-individual operation reaching the productive capacity historically associated with venture-backed teams — has arrived in frontier AI be raised, this document is part of the evidence to consider.

Preface

Why this document exists

Many architectural choices and capability claims in the broader portfolio — across commerce, hardware, security, alignment, and foundational research — are informed by a broad picture of the capability areas behind a serious LLM company. This document is a public reference version of that picture. Named assets, valuations, and detailed inventory are documented separately in the strategic asset portfolio.

Pillar 01
Foundation, not naive optimism
The portfolio's capability claims are positioned against the full 21-slot industry anatomy below. Each level (Strong Evidence / Partial / Gap) is determined by observable evidence at that slot — running artifacts, signed manifests, patent-grade candidates, production telemetry — not by aspirational framing.
Pillar 02
~40% of the portfolio is reserved
This atlas demonstrates the framework knowledge and the positioning. Specific implementations, internal mechanics, named architectures, and detailed protocol disclosures are held in the proprietary asset portfolio or restricted review layers. They are not fully disclosed here. They are reserved for partnership stage under appropriate confidentiality.
Pillar 03
The 13-section series stands on this
The public "LLM Complement" 13-section series (mzncompany.com/llm-complement) makes architectural claims that may at first read seem ambitious. This atlas is the substrate they rest on. When the series argues that data quality, user-context patterns, cost trajectory, and defensibility horizon together require an architecture beyond foundation-model-alone — that claim is informed by every endpoint below.

The 21-slot framework is a synthesized reference map based on publicly visible frontier-AI practice, academic literature, open-source infrastructure, and observed industry patterns through 2026 — five pre-training slots, three post-training and alignment slots, four evaluation and safety slots, four inference and production slots, and four cross-cutting slots. Each is a major capability area; absence or weakness in any slot should be treated as a diligence question for a serious LLM operation.

For each slot the document gives: definition, current state of the art, key decisions, trade-offs, numbers and ablations from the published literature, open questions, an analyst-level frontier position, examples, academic references, and the full sub-endpoint anatomy (top-level 21 → mid-level → deeper detail), with a 529-item map total. Then, prominently at the head of each slot, MZN's provisional position: Strong, Partial, or Gap, with concise reasoning.

All company-, lab-, and product-specific identifiers in the body of this document have been generalized to categorical references (frontier lab, leading open-weights model, current-generation accelerator, etc.). Academic paper authors, technique names, benchmark names, and dataset names are preserved as standard scientific references. No proprietary MZN content — code, hashes, internal codenames, partner identities, or pipeline detail — appears in this document.

The Framework

The 21-slot framework

Five groups. Twenty-one slots. 529 sub-endpoints below. Each slot is a major prerequisite at frontier scale — A blocks B blocks C blocks D, while E is cross-cutting.

MZN Provisional Position Snapshot

Where MZN stands, slot by slot

Of the 20 major capability slots, the portfolio operates 7 at Strong Evidence, 13 at Partial, and 1 at Gap. The criteria below are explicit; each slot earns one status based on observable evidence rather than a 1-to-10 score.

7
Strong Evidence
13
Partial
1
Gap
Strong Evidence · Documented and/or worked evidence — running code, signed manifests, patent-grade candidates, provenance records, production telemetry, or worked cases
Partial · Architecture-documented or Phase-1-rooted, pending context — architecture and reference inventory exist; no running artifacts at frontier scale
Gap · Acknowledged and not yet addressed — named in inventory; no documentation or architecture sketch
SlotTitleLevelPosition summary
A1 Data STRONG EVIDENCE Phase 1 consent-first product data evidence; LLM-readiness pending review
A2 Tokenizer STRONG EVIDENCE Multilingual tokenizer expertise; under-served-script focus
A3 Architecture PARTIAL Patent-grade candidate architectural innovations; implementation validation pending
A4 Training PARTIAL Training methodology documented; frontier-scale execution pending
A5 Compute GAP No cluster under solo operation
B1 SFT PARTIAL Demonstration-data shaping methodology documented
B2 Preference Optimization PARTIAL Output-conformance methodology informs preference design
B3 Constitutional Methods PARTIAL Principle-based alignment substrate (theoretical layer)
C1 Capability Evaluation PARTIAL Phase 1 product telemetry and user-behavior evaluation context
C2 Safety Evaluation STRONG EVIDENCE Documented safety architecture; red-team validation pending
C3 Robustness PARTIAL Security-driven robustness research
C4 Output Safety STRONG EVIDENCE Output-conformance safety templates and egress controls
D1 Serving PARTIAL Phase 1 application/platform serving experience across Mazzaneh modules
D2 Inference Optimization STRONG EVIDENCE Patent-grade candidate inference frameworks; benchmark validation pending
D3 Monitoring STRONG EVIDENCE Monitoring architecture / GPU Sentinel route; implementation or pilot validation pending
D4 Deployment PARTIAL Phase 1 application/platform deployment experience across Mazzaneh modules
E1 Data Governance PARTIAL Consent-first data governance by design
E2 Security STRONG EVIDENCE Multi-tier security architecture documented; adversarial validation pending
E3 Privacy PARTIAL Consent-first privacy posture; Phase 3 privacy/compliance review required
E4 Compliance PARTIAL EUIPO guidance · context + separate patent filing
The Atlas

Slot-by-slot reference anatomy

Each slot below opens with MZN's provisional position, then a reference industry view (definition, state of the art, decisions, trade-offs, numbers, open questions, frontier analyst position, examples, references), and concludes with an expandable 529-item sub-endpoint anatomy.

A1

Data

66 sub-endpoints mapped
MZN Provisional Position · Strong Evidence
Phase 1 consent-first product data evidence; LLM-readiness pending review
Phase 1 Mazzaneh operated a multi-module product platform with consent-first behavioral and product signals across 168K+ users. These signals may be relevant to future data strategy, but LLM training-readiness requires Phase 3 legal, consent, privacy, data-governance, and technical review. Multilingual context includes Persian and Arabic-script depth.
Phase context: A1 draws partly on Phase 1 Mazzaneh product-data evidence. It should not be read as proof of a frontier-scale LLM training corpus without Phase 3 consent, privacy, governance, and technical review.
Definition

The pre-training corpus is the model's universe of evidence: every text, code file, image-caption pair, and audio transcript that establishes what the model considers possible. Corpus engineering — acquisition, extraction, filtering, deduplication, mixing — sets the absolute capability ceiling. Post-training can refine and align, but cannot exceed what is latent in the data.

State of the Art (2025–2026)

Frontier models (2025-2026) train on 10-30 trillion text tokens plus billions of multimodal pairs. A leading open-weights flagship model used 15T tokens. An open-weights frontier model (V3 class) used 14.8T tokens. A current-generation frontier model estimated similar order. Frontier shift: from raw web scale to curated quality (FineWeb-Edu, DCLM-baseline). Multimodal natively integrated from pre-training (a multimodal frontier model, a frontier multimodal model, a long-context frontier model).

Key Decisions
  • Corpus size (tokens)
  • Source mixture (web/code/math/books/multimodal)
  • Quality vs. quantity trade-off
  • Multilingual ratio
  • Recency vs. archival
  • Deduplication aggressiveness
  • License/safety filtering severity
Trade-offs
  • More tokens → diminishing returns past Chinchilla-optimal
  • Higher quality filtering → smaller corpus but better downstream performance
  • Multilingual breadth → English depth slightly lower per token
  • Web-heavy → broader knowledge but lower factual accuracy
Numbers & Ablations
  • Chinchilla-optimal: ~20 tokens per parameter (Hoffmann 2022). A leading open-weights flagship model used 37 tokens/param — substantially over-trained, deliberately for inference cost.
  • Quality filtering: FineWeb-Edu retains ~3% of CC after model-based filtering, matches 5× larger raw corpora.
  • Multilingual cost: each non-English language added consumes ~2-5% of effective English capacity at fixed parameter count (Conneau et al., 2019; Pfeiffer et al., 2022).
  • Synthetic data ceiling: a synthetic-heavy small frontier model demonstrated ~70B-class performance at 3.8B with synthetic-heavy training; ratio collapses past ~50% synthetic mix (mode collapse, Shumailov et al. 2024).
  • Code corpus contribution to general reasoning: ~3-7% MMLU gain attributable to code in pre-training (Aryabumi et al., 2024).
Open Questions
  • What is the actual scaling law for synthetic data quality vs. quantity? a synthetic-heavy small frontier model demonstrated existence proof but not optimal mix.
  • Is there a multilingual scaling law analogous to Chinchilla? Adding 100 languages vs. 10 with same compute — no published frontier-scale result.
  • How much of frontier model capability comes from data quality vs. quantity vs. mix curriculum? a leading open-weights model paper hints curriculum matters; no isolated study at scale.
  • Copyright-clean training: can a frontier model be trained on only permissively-licensed data without significant capability loss? No public attempt at frontier scale.

Reference analyst note. Quality > quantity is now consensus, but the field has overcorrected — most labs underweight diversity in chase of curated quality. The 'textbook quality' direction (a synthetic-heavy small frontier model) is a local optimum, not a global one. Frontier 2026-2027 will rebalance toward curated-but-diverse, with synthetic data filling specific holes (math reasoning, agent traces) not as bulk replacement.

Reference Analyst Note

Quality > quantity is now consensus, but the field has overcorrected — most labs underweight diversity in chase of curated quality. The 'textbook quality' direction (a synthetic-heavy small frontier model) is a local optimum, not a global one. Frontier 2026-2027 will rebalance toward curated-but-diverse, with synthetic data filling specific holes (math reasoning, agent traces) not as bulk replacement.

Examples

A leading open-weights model: 15T tokens, 5% non-English, code 17% · an open-weights frontier model (V3 class): 14.8T tokens, multilingual focus · FineWeb-Edu: 1.3T high-quality educational tokens (open) · RedPajama-V2: 30T tokens (open) · DCLM-baseline: 4T tokens with model-based filtering

References (Academic)

Hoffmann et al., Chinchilla (2022) · Penedo et al., FineWeb (2024) · Li et al., DataComp-LM (2024) · Soldaini et al., Dolma (2024)

Sub-endpoint anatomy — 66 items mapped
A1.1 Source Registry
Web crawl is the dominant corpus source. Common Crawl provides ~250B web pages across 100+ snapshots since 2008. Raw CC is heavily redundant, multilingual-imbalanced, and contains substantial low-quality content. Modern pipelines extract WARC files, run language identification, perform URL/document/paragraph-level deduplication, and apply quality classifiers. SOTA: FineWeb-Edu (an open-model hub, 2024) demonstrated quality classification using a leading open-weights model (70B class) as labeler → distilled into small classifier. 1.3T retained tokens match performance of larger raw corpora. DCLM-baseline (a confidential-computing frontier lab/UW, 2024) used similar approach with model-based filtering producing 4T high-quality tokens. e.g. Common Crawl: 250B+ pages raw · RefinedWeb (an open-weights model team, 2023): 600B tokens deduplicated · C4 (a multimodal frontier lab, 2019): 750B tokens, an encoder-decoder model-era
A1.1.1 Web crawl sources
Common Crawl snapshots and derivatives. The bulk of pre-training corpora. Industry standard: Multiple Common Crawl snapshots, deduplicated and filtered. A leading open-weights model used 15T tokens primarily from CC.
+ deeper detail (3 leaves)
  • A1.1.1.1 Common Crawl snapshot selection Which CC dumps to include — recent, historical, or both. Industry standard: Multiple snapshots (e.g. 95+ in FineWeb), spanning years to capture historical text and reduce recency bias.
  • A1.1.1.2 WARC vs WET extraction WARC contains raw HTML; WET contains pre-extracted text. Extraction-from-WARC yields better text but costs 10-100x compute. Industry standard: Frontier labs increasingly re-extract from WARC. RefinedWeb and Dolma use trafilatura/resiliparse on WARC.
  • A1.1.1.3 Crawl coverage gaps Languages, domains, and content types under-represented in CC. Industry standard: Supplement CC with targeted crawls for low-resource languages, code (GitHub), academic (ArXiv, PubMed), books.
A1.1.2 Curated text corpora
High-quality non-web sources: Wikipedia, books, academic papers, news. Industry standard: a 2023-generation open-weights model/3 disclose Wikipedia, ArXiv, books in mixture; a leading frontier lab/a constitutional-methods frontier lab opaque on specific corpora.
+ deeper detail (3 leaves)
  • A1.1.2.1 Wikipedia corpus Wikipedia dumps in multiple languages. Industry standard: Multiple language editions; extracted via WikiExtractor or mwparserfromhell to plain text.
  • A1.1.2.2 Book corpora Books3 (deprecated), Project Gutenberg, licensed publisher feeds, scanned books. Industry standard: Mixed sourcing. Some labs license; some use Books3-derivatives (legally contested).
  • A1.1.2.3 Academic corpora ArXiv, PubMed, S2ORC, ACL Anthology, etc. Industry standard: Heavy use; ArXiv especially common. Math/code-heavy LaTeX requires special pre-processing.
A1.1.3 Code corpora
GitHub, StackOverflow, code-specific datasets like The Stack. Industry standard: The Stack (BigCode, 2023) is the public reference. Frontier labs use proprietary code crawls.
+ deeper detail (2 leaves)
  • A1.1.3.1 License-aware code filtering Excluding code under restrictive licenses (GPL, AGPL) from training. Industry standard: The Stack v1.2 includes only permissive licenses (MIT, Apache, BSD); exclusion of restrictive licenses standard at frontier labs.
  • A1.1.3.2 Repository quality signals Star count, fork count, file size, language detection accuracy. Industry standard: Filter by star count threshold, exclude minified/auto-generated code, detect language by linguist library.
A1.1.4 License posture catalog
License classification per source: public domain, permissive (CC-BY, MIT), restrictive (CC-BY-NC, GPL), proprietary (licensed), unclear. Industry standard: Maintain explicit license registry. Frontier labs increasingly licensed-only or licensed-preferred to reduce litigation surface.
A1.1.5 Source provenance hashing
SHA-256 (or similar) of every source artifact to enable later provenance queries. Industry standard: Standard practice at frontier labs; less common at smaller labs.
A1.2 Cleaning Pipeline
Code corpus provides programming language coverage essential for code generation, agent tools, and reasoning capability. Code data also improves general reasoning (correlations observed in multiple model families). Sources include GitHub public repos, Stack Exchange, programming Q&A, technical documentation. SOTA: The Stack v2 (BigCode, 2024): 3T+ tokens permissively-licensed, 600+ languages. License filtering excludes GPL/restrictive. Repository-level context (not file-level chunks) increasingly used for long-context code understanding. StarCoder 2 trained on this corpus. e.g. The Stack v2 (BigCode, 2024) · GitHub raw (petabyte-scale) · RedPajama Code: 50B tokens
A1.2.1 Boilerplate removal
Stripping navigation menus, footers, ads, cookie banners, repeated page templates. Industry standard: Trafilatura, jusText, or custom rule-based extraction. Required for WARC-based pipelines.
+ deeper detail (2 leaves)
  • A1.2.1.1 Boilerplate detection method Algorithm choice: rule-based, ML-based, or hybrid. Industry standard: Hybrid: HTML-rule-based extraction (trafilatura) + density-based heuristics.
  • A1.2.1.2 Cross-template repetition handling Same boilerplate appearing across millions of pages (e.g. WordPress footer). Industry standard: Detected via document-level n-gram repetition; pages with high boilerplate ratio dropped.
A1.2.2 Encoding normalization
Converting all text to UTF-8, fixing Mojibake, handling BOMs. Industry standard: Standard normalization to UTF-8; ftfy library for Mojibake repair.
A1.2.3 Language identification
Per-document language tagging. Industry standard: fastText language ID (Joulin et al.) is the dominant choice; cld3 also used.
+ deeper detail (2 leaves)
  • A1.2.3.1 Confidence threshold for language ID Probability cutoff below which a document is rejected or flagged as mixed-language. Industry standard: Typically 0.65 for fastText. Higher for low-resource languages to reduce false positives.
  • A1.2.3.2 Multi-language documents Documents containing significant amounts of two or more languages. Industry standard: Either split by paragraph or assign primary language. No consensus on best handling.
A1.2.4 NSFW / unsafe content filtering
Pre-training removal of explicit content, gore, harmful material. Industry standard: URL blocklists + keyword filters + classifier-based (DSIR-style). A 2023-generation open-weights model paper documents this approach.
A1.2.5 Length filtering
Removing documents below minimum or above maximum length. Industry standard: Common thresholds: drop documents <50 tokens or <200 characters; cap at very long lengths handled in A4 packing.
A1.3 Deduplication
Mathematical and scientific corpus provides reasoning depth. Sources: arXiv (~2M papers), PubMed, scientific textbooks, Math StackExchange, OpenWebMath. Math performance benchmarks (GSM8K, MATH, AIME) correlate strongly with pre-training math token volume and quality. SOTA: OpenWebMath (Paster 2023): 14.7B tokens of high-quality math web. DeepSeekMath corpus: 120B tokens. Synthetic math: procedurally generated problems with verified solutions. Math performance correlates strongly with pre-train math token volume — 100B+ tokens common at frontier. e.g. arXiv: 2M papers, ~50B tokens full-text · OpenWebMath: 14.7B tokens · DeepSeekMath corpus: 120B
A1.3.1 Exact-match deduplication
Hash-based detection of identical documents. Industry standard: URL-level + SHA-256-level + line-level.
+ deeper detail (3 leaves)
  • A1.3.1.1 URL-level dedup Removing duplicate URLs across crawl snapshots. Industry standard: First pass; trivially cheap.
  • A1.3.1.2 Document-hash dedup SHA-256 of normalized document text. Industry standard: Standard. Catches identical content under different URLs.
  • A1.3.1.3 Line-level dedup Removing globally repeated lines across the corpus. Industry standard: Used selectively; aggressive line dedup damages legitimate quoted text.
A1.3.2 Near-duplicate detection (MinHash + LSH)
Probabilistic detection of documents with high Jaccard similarity. Industry standard: MinHash signatures + LSH banding. Standard parameters: 100-200 hashes, Jaccard threshold 0.8-0.85.
+ deeper detail (4 leaves)
  • A1.3.2.1 Shingling (n-gram) parameter Token n-gram size used to construct MinHash input set. Industry standard: 5-gram shingles common; some pipelines use 7-gram or word-13-grams.
  • A1.3.2.2 MinHash signature length Number of hash functions used to build the MinHash signature. Industry standard: 100-200 hashes. Tradeoff: more hashes = higher precision, more compute.
  • A1.3.2.3 LSH banding Locality-sensitive hashing parameters: number of bands × rows-per-band. Industry standard: Tuned to target Jaccard threshold. e.g. 20 bands × 9 rows ≈ threshold 0.8.
  • A1.3.2.4 Jaccard threshold Minimum Jaccard similarity for two documents to be considered near-duplicates. Industry standard: 0.8 (Lee et al. 2022 reference); some pipelines use 0.85 or 0.7 depending on tolerance.
A1.3.3 Semantic deduplication
Embedding-based detection of semantically duplicate content. Industry standard: Emerging; SemDeDup (Abbas 2023) is the public reference. Frontier labs may use proprietary methods.
+ deeper detail (2 leaves)
  • A1.3.3.1 Embedding model choice Which encoder (E5, BGE, GTE, a leading frontier lab ada) generates the document embeddings. Industry standard: Open-source encoders (E5, BGE) for reproducibility; proprietary at large labs.
  • A1.3.3.2 Cosine similarity threshold Cutoff for considering two embeddings as semantic duplicates. Industry standard: 0.95+ for near-duplicate semantic level; below this, content variation expected.
A1.3.4 Cross-corpus deduplication
Dedup across heterogeneous sources (web vs books vs academic). Industry standard: Run dedup globally after corpus assembly, not per-source. Otherwise cross-source duplicates survive.
A1.4 Quality Filtering
Multimodal corpus pairs text with images, video, audio, and structured data. Native multimodal pre-training (vs. late fusion) enables cross-modal reasoning, image-to-text grounding, and video understanding. The shift from CLIP-style late fusion (separate encoders) to native multimodal (interleaved tokens) characterizes 2024+ frontier models. SOTA: Native multimodal pre-training (vs late fusion) became dominant 2024+. A frontier multimodal model, a million-token-context frontier model/2.0, a long-context frontier model native interleaved. LAION-5B (5.8B image-text pairs) is open backbone. Image resolution dynamic up to 1024² for detail. Video: 1B+ video-text pairs at frontier. e.g. LAION-5B (5.8B image-text) · DataComp 12.8B (filtered) · WebVid 10M video-text
A1.4.1 Heuristic filters
Rule-based filters: line length, punctuation density, repetition rate, word distribution.
+ deeper detail (4 leaves)
  • A1.4.1.1 Line-length distribution Mean line length, max line length, lines per document. Industry standard: Drop documents with unusual line distributions (very long lines = likely scraped tables; very short = navigation).
  • A1.4.1.2 Repetition detection Repeated lines, repeated paragraphs, repeated n-grams within a document. Industry standard: Drop documents where >X% of lines repeat. RefinedWeb uses thresholds in 0.2-0.3 range.
  • A1.4.1.3 Symbol-to-text ratio Ratio of non-alphanumeric characters to total characters. Industry standard: High symbol ratio → likely code, table, or noise. Filter or route to code-specific path.
  • A1.4.1.4 Stopword presence Documents lacking common stopwords are likely lists, tables, or non-natural text. Industry standard: Require minimum stopword density (typically >2%) for general-text classification.
A1.4.2 Perplexity filtering
Use a smaller reference language model to score documents; drop high-perplexity (likely noise) and very-low-perplexity (likely repetition). Industry standard: Used in CCNet (Wenzek 2020), some leading open-weights model-family pipelines. KenLM 5-gram on Wikipedia common as reference.
A1.4.3 Classifier-based filtering
Train a binary quality classifier on a curated 'good' reference set; score every document. Industry standard: Standard at frontier labs. FineWeb-Edu (Penedo 2024) and DataComp-LM (Li 2024) are the public references.
+ deeper detail (2 leaves)
  • A1.4.3.1 Reference set construction What counts as 'good' for training the classifier. Industry standard: Typically Wikipedia, books, academic papers, or LLM-judged 'educational' web pages (FineWeb-Edu).
  • A1.4.3.2 Classifier architecture fastText, an encoder-only model-based, or LLM-as-judge. Industry standard: fastText for scale (RefinedWeb), small an encoder-only model or distilled models for higher quality.
A1.4.4 Pruning by influence
Removing documents that hurt downstream loss (Marion 2023, DSIR). Industry standard: Emerging research direction. Not yet standard frontier practice.
A1.5 Domain Mixing & Weighting
Deduplication removes near-duplicate content that would otherwise dominate training and reduce effective coverage. Three levels: URL-level (exact), document-level (MinHash/SimHash), and substring-level (suffix array). Aggressive dedup typically removes 30-70% of raw corpus and improves downstream performance. SOTA: Standard pipeline: URL dedup → MinHash near-dup (Jaccard 0.8) → optional substring dedup. Typical removal 30-70% of raw corpus. SemDeDup (Abbas 2023) uses embedding similarity for semantic dedup, removing 50% of LAION with no perf loss. Frontier: also benchmark contamination filtering. e.g. A leading open-weights model: MinHash + URL dedup · an earlier frontier model: aggressive dedup described in paper · FineWeb: full pipeline including SemDeDup variants
A1.5.1 Fixed mixture
Proportions chosen by data team based on intuition and small-scale context. Industry standard: a 2023-generation open-weights model mixture (web 67%, code 4.5%, etc.) is the publicly documented reference. Adjustments per generation.
A1.5.2 Learned mixture (DoReMi)
Use a small proxy model to optimize mixture weights against downstream loss. Industry standard: DoReMi (Xie et al. 2023) shows 2.6× faster pre-training. Adopted increasingly at frontier labs.
A1.5.3 Curriculum (data ordering)
Order in which data is presented during training. Curriculum learning vs random shuffling. Industry standard: Random shuffling dominant at scale. Curriculum used in some final-phase fine-tuning.
A1.5.4 Up-sampling rare domains
Repeating low-frequency-but-important domains (e.g. math, code, low-resource languages). Industry standard: Standard. Up-sample rates of 2-4× common for math, code, multilingual.
A1.6 Contamination Control
Quality filtering separates valuable content from noise. Approaches: heuristic (perplexity, length, repetition, language confidence), classifier-based (small model predicts 'high-quality'), or model-as-judge (large model labels samples for distillation). Quality filtering typically retains 10-30% of raw web content. SOTA: Model-based filtering dominates 2024+. FineWeb-Edu used a leading open-weights model (70B class) → distilled small classifier. DCLM-baseline similar. Heuristic filters (perplexity, length) coarse but fast. Trade-off: classifier inference at corpus scale costs $100K-$1M, small relative to pre-training. e.g. FineWeb-Edu: a leading open-weights model (70B class) labels → distilled · DCLM-baseline: classifier-filtered · C4: heuristic only
A1.6.1 N-gram overlap detection
Match against benchmark text using n-gram overlap. Industry standard: a 2023-generation open-weights model used 8-gram, a leading open-weights model used 13-gram. Token-level (after tokenizer) more common than character-level.
A1.6.2 Benchmark suite registry
Maintained list of benchmarks to scrub against. Industry standard: MMLU, GSM8K, HumanEval, BIG-Bench, ARC, HellaSwag, etc. A 2023-generation open-weights model paper lists ~30 benchmarks.
A1.6.3 Per-benchmark contamination disclosure
Reporting per-benchmark contamination rate alongside results. Industry standard: Sainz 2023 argues current standard is insufficient; per-benchmark disclosure increasingly expected.
A1.7 Provenance Ledger
Mixing strategy determines the proportion of each data source in training. Optimal mix depends on target capabilities: heavier code → better code generation, heavier math → better reasoning, heavier multilingual → broader language coverage. Mix is often staged (early training: broad mix; late training: high-quality + domain-specific). SOTA: DoReMi (Xie 2023) optimizes mix via small reference models with proxy losses. A leading open-weights model used annealing — high-quality data weighted higher in final 40B tokens. Frontier typical: 50-70% web, 5-10% code, 2-5% math, multilingual proportional. Curriculum (early broad, late specialized) increasingly common. e.g. A leading open-weights model: 50% web, 25% math/reasoning, 17% code, 8% multilingual · a synthetic-heavy small frontier model: heavy synthetic + textbook quality · an open-weights model: web-heavy
A1.7.1 Shard-level metadata
Source identifier, retrieval timestamp, filter pipeline version, weight category for each shard. Industry standard: Standard at frontier labs. Format varies.
A1.7.2 Token-batch traceability
Ability to query 'which source produced this token batch'. Industry standard: Partial at most labs. Full token-level provenance is research-grade.
A1.7.3 Cryptographic anchoring
Merkle-tree or similar cryptographic commitment to corpus state. Industry standard: Not yet standard. Proposed for compliance use.
A1.8 Synthetic Data Integration
Synthetic data generation uses existing models to create training data: instruction-response pairs (FLAN, self-instruct methodology), reasoning traces (chain-of-thought), filling-in gaps in coverage (rare languages, niche domains), or constitutional / RLHF preference data. Synthetic data accelerates iteration and fills distributional holes. SOTA: a synthetic-heavy small frontier model (a synthetic-data-focused lab) trained heavily on synthetic textbook-quality data — 7B model with 70B-class capability. Cosmopedia (HF): 25B synthetic textbook tokens. RL-from-AI-Feedback (RLAIF) used in Constitutional methods. Synthetic-heavy past 30-50% mix risks distributional artifacts (Shumailov 2024 model collapse). e.g. a synthetic-heavy small frontier model series: synthetic-heavy · Cosmopedia: 25B synthetic textbook · OpenAssistant: synthetic conversation
A1.8.1 Synthetic generation prompts
Prompt templates used to generate synthetic training data. Industry standard: Proprietary at frontier labs; small synthetic-heavy model family papers describe textbook-style prompts.
A1.8.2 Diversity and collapse controls
Techniques to prevent distributional collapse from over-reliance on model-generated data. Industry standard: Diversity sampling, multiple-teacher ensembles, periodic refresh from human data.
A1.8.3 Synthetic-vs-human ratio policy
Maximum proportion of synthetic data per training run. Industry standard: No public standard; varies by lab and stage. Pre-training tends low (<10%); SFT can be majority synthetic.
A2

Tokenizer

68 sub-endpoints mapped
MZN Provisional Position · Strong Evidence
Multilingual tokenizer expertise with under-served-script operational experience
Direct production work in Persian (an Arabic-script language documented as systematically over-fragmented in byte-level BPE) gives concrete operational understanding of tokenizer fairness, joining behavior, diacritic normalization, and letter-form collisions. Documented exposure to the 2–4× token-cost gap for non-Latin scripts. Patent-documented tokenizer architecture work; specifics held in the proprietary portfolio.
Definition

The tokenizer maps raw text into discrete tokens — the model's vocabulary. Tokenizer choice is permanent: it defines vocabulary size, multilingual coverage, code handling, and context-window efficiency. A bad tokenizer wastes context (more tokens per character), degrades multilingual performance, and cannot be changed without retraining. Frontier tokenizers are byte-level BPE or SentencePiece with 100K-256K vocabulary.

State of the Art (2025–2026)

a current-generation frontier model tokenizer (cl100k_base, 100K vocab) and a leading open-weights model tokenizer (128K vocab, multilingual) are reference points. Byte-level BPE (open-source BPE tokenizer libraries, a foundational decoder-only model lineage) handles any UTF-8 input gracefully. SentencePiece (open-weights models) supports both BPE and Unigram. Multimodal tokenizers add image tokens (256-1024 per image) and audio tokens.

Key Decisions
  • Vocabulary size (32K → 256K)
  • Algorithm (BPE vs. Unigram)
  • Byte-level fallback
  • Pre-tokenization regex (whitespace, digits)
  • Special tokens design
  • Multilingual balance
Trade-offs
  • Larger vocab → fewer tokens per text but larger embedding matrix (linear in vocab)
  • BPE → simple, deterministic; Unigram → probabilistic, slightly better for some languages
  • Pre-tokenization affects compositional generalization
Numbers & Ablations
  • Tokenizer compression efficiency: a leading open-weights model (128K vocab) compresses Persian text 4.2× better than a 2023-generation open-weights model (32K). Korean: 5.1×. Hindi: 3.8× (Petrov 2023 + community measurements).
  • Vocabulary size cost: each doubling of vocab adds ~1B params at 8K hidden_dim, ~3B at 12K hidden_dim (frontier scale).
  • Tied embeddings save ~50% of vocab parameter cost; standard at frontier dense, sometimes untied in MoE.
  • Encoding speed: a byte-level BPE tokenizer library ~1M tokens/sec/CPU; an open-model hub tokenizers (Rust) ~700K. Negligible relative to inference compute.
  • Glitch token incidence: ~0.01-0.1% of tokens in BPE vocabularies (untrained tail). Detected via embedding magnitude analysis.
Open Questions
  • Tokenizer-free architectures (MambaByte, MEGABYTE): why have they not matched BPE at frontier scale despite theoretical advantages? Compute hypothesis vs. fundamental limit unclear.
  • Is there an optimal vocab size for a given (model size, data mix, target language portfolio)? Current choices are heuristic.
  • Cross-lingual transfer in shared-vocab tokenizers: how much capability is shared vs. language-isolated? Limited mechanistic understanding.
  • Multimodal tokenization: image tokens at 256 vs. 1024 per image — what is the actual quality-vs-cost frontier? AnyRes (LLaVA-NeXT) provides one data point, not a curve.

Reference analyst note. Tokenizer choice is a permanent commitment that constrains everything downstream. Frontier labs underinvest here — most use SentencePiece-defaults trained on subset of data. The next frontier capability gain may come from rethinking tokenization (entropy-aware dynamic tokenization, byte-level with efficient training). Anyone aiming for genuine multilingual frontier should treat tokenizer as a first-class capability investment.

Reference Analyst Note

Tokenizer choice is a permanent commitment that constrains everything downstream. Frontier labs underinvest here — most use SentencePiece-defaults trained on subset of data. The next frontier capability gain may come from rethinking tokenization (entropy-aware dynamic tokenization, byte-level with efficient training). Anyone aiming for genuine multilingual frontier should treat tokenizer as a first-class capability investment.

Examples

a current-generation frontier model: cl100k_base, 100K vocab, byte-level BPE · a leading open-weights model: 128K vocab, multilingual SentencePiece BPE · a leading frontier model: ~65K vocab · a multimodal frontier model: tokenizer designed for multimodal

References (Academic)

Sennrich et al., Subword Units / BPE (2015) · Kudo & Richardson, SentencePiece (2018) · Petrov et al., Tokenizer Choice (2023)

Sub-endpoint anatomy — 68 items mapped
A2.1 Tokenization Algorithm Family
BPE (Byte-Pair Encoding) iteratively merges most-frequent character pairs. Variants: word-level BPE (a foundational decoder-only model, an earlier frontier model era), byte-level BPE (operates on UTF-8 bytes, never produces unknown tokens), and SentencePiece-BPE (whitespace-aware). Byte-level BPE is the dominant choice for production LLMs because it handles any input. SOTA: Byte-level BPE with carefully-designed pre-tokenization regex (handling whitespace, digits, contractions) is standard. Leading open-weights tokenizers regex splits digits into individual tokens for better arithmetic. Modern implementations (open-source BPE tokenizer libraries, an open-model hub tokenizers) achieve sub-millisecond encoding for thousands of tokens. e.g. multiple frontier model generations: byte-level BPE · leading open-weights model: SentencePiece BPE with byte fallback · an open-weights frontier lab: same lineage as leading open-weights model
A2.1.1 Word-level tokenization
Each whitespace-separated unit is a token. Industry standard: Effectively obsolete for LLMs. OOV explosion makes it incompatible with web-scale training.
A2.1.2 Character-level tokenization
Each character is a token. Industry standard: Used in CharFormer, character CNNs. Not used by frontier general-purpose LLMs because attention cost dominates.
A2.1.3 Byte-level (raw)
Each UTF-8 byte is a token. Vocabulary fixed at 256. Industry standard: Used as base layer of Byte-Level BPE. Pure byte-level only in tokenizer-free models like a byte-level encoder-decoder model.
A2.1.4 BPE family (Byte Pair Encoding)
Iteratively merge most-frequent adjacent token pair until target vocabulary size reached. Industry standard: Dominant family for frontier LLMs. Byte-Level BPE (a leading frontier model family) is most common.
+ deeper detail (3 leaves)
  • A2.1.4.1 Standard BPE (character-based) Original BPE, applied over Unicode characters. Industry standard: Mostly superseded by byte-level BPE for general LLMs. Some MT systems still use.
  • A2.1.4.2 Byte-Level BPE BPE applied over the 256-byte alphabet rather than characters. Guarantees no OOV at byte level. Industry standard: multiple frontier model generations, a leading open-weights model (a byte-level BPE tokenizer library-style). Considered dominant frontier choice for English-heavy + code workloads.
  • A2.1.4.3 SentencePiece-BPE BPE implementation in SentencePiece library, raw text input without pre-tokenization. Industry standard: Common in multilingual models (a multilingual encoder-decoder model, a multilingual translation model). Treats whitespace as a regular character.
A2.1.5 Unigram LM (Kudo)
Probabilistic subword model; iteratively prune low-probability subwords from initial large vocabulary. Industry standard: an encoder-decoder model, a multilingual encoder-decoder model, an early open-weights model/2 use SentencePiece-Unigram. Theoretically allows multiple segmentations (subword regularization).
+ deeper detail (2 leaves)
  • A2.1.5.1 Standard Unigram LM training EM-based pruning from initial seed vocabulary.
  • A2.1.5.2 Subword regularization (sampling) Train-time sampling of alternative segmentations to improve robustness. Industry standard: Used in some MT models. Less common in modern LLMs.
A2.1.6 WordPiece
an encoder-only model-style subword model; greedy longest-match segmentation. Industry standard: Used in an encoder-only model family. Less common in modern decoder-only LLMs.
A2.1.7 Tokenizer-free
Operate directly on bytes or characters without learned vocabulary. Industry standard: a byte-level encoder-decoder model, CANINE, MEGABYTE. Research direction; computational cost limits frontier deployment.
A2.2 Vocabulary Design
Vocabulary size is a primary hyperparameter. Larger vocab → fewer tokens per text → longer effective context → faster inference per character. But also: larger embedding matrix (vocab × hidden_dim parameters) and softmax cost over vocab. Frontier 2024-2026 trend: 128K-256K vocab. SOTA: A leading open-weights model expanded from 32K (a 2023-generation open-weights model) to 128K, citing multilingual coverage. A multimodal frontier model, a frontier multimodal model use larger vocabs (~200K+ estimated). Trade-off: vocab × hidden_dim adds parameters: 128K × 8192 = 1B parameters in embedding alone for a large model. This is offset by fewer tokens and tied input/output embeddings. e.g. a 2023-generation open-weights model: 32K · a leading open-weights model: 128K · a frontier multimodal model: ~200K (estimated)
A2.2.1 Vocab size selection
Total number of tokens in vocabulary. Industry standard: 32K (an early open-weights model/2), 50K (a foundational decoder-only model/3), 100K-128K (a current-generation frontier model, a leading open-weights model, a long-context frontier model estimated). Trend toward larger vocabularies for multilingual + code coverage.
+ deeper detail (2 leaves)
  • A2.2.1.1 Compute trade-off Larger vocab = larger embedding matrix, larger output projection, more compute per step. Industry standard: Embedding cost scales linearly with vocab size; for very large models, vocab cost is small fraction of total compute.
  • A2.2.1.2 Coverage vs sparsity Larger vocab = better per-language coverage but rarer tokens. Industry standard: Sweet spot empirical. A leading open-weights model increased vocab from 32K to 128K specifically for multilingual + code.
A2.2.2 Special tokens
Reserved tokens for protocol use: BOS, EOS, PAD, chat templates, tool calls, system roles.
+ deeper detail (4 leaves)
  • A2.2.2.1 BOS / EOS / PAD Beginning-of-sequence, end-of-sequence, and padding tokens. Industry standard: Universal. Specific token IDs vary; some models conflate BOS=EOS, others separate.
  • A2.2.2.2 Chat template tokens Tokens marking message roles (user, assistant, system) and turn boundaries. Industry standard: ChatML (a leading frontier lab), recent open-weights models templates with [INST]...[/INST], a leading frontier model with Human:/Assistant: convention.
  • A2.2.2.3 Tool / function-call tokens Special tokens for function invocation, tool result returns, structured output. Industry standard: Increasingly reserved. A leading open-weights model added tool tokens; a leading frontier lab uses structured wrapper tokens.
  • A2.2.2.4 Reserved / unused token policy Slots reserved for future special tokens. Industry standard: A leading open-weights model reserved 256 special token slots. Allows post-training extension without retraining tokenizer.
A2.2.3 Vocab freezing strategy
Whether vocabulary is frozen post-training or extensible. Industry standard: Frozen at training start. Extension requires careful procedure (see A2.2.4).
A2.2.4 Vocab extension policy
Adding tokens post-training (new languages, domains, special tokens). Industry standard: Add to reserved slots; new embeddings randomly initialized + fine-tuned. A leading open-weights model demonstrates.
A2.3 Multilingual & Multi-Script Coverage
Special tokens demarcate roles, modalities, and structural boundaries. Standard set: BOS (begin), EOS (end), PAD, UNK (rare with byte-level), and chat tokens (system/user/assistant role markers). Tool-using models add tokens for function calls, results, and reasoning steps. Multimodal adds image-start/end and audio markers. SOTA: ChatML (a leading frontier lab) and a leading open-weights model's chat template define standard chat structure with explicit role tokens. Tool use tokens (function_call_start, function_result) increasingly standardized. Reasoning models (o1, R1) use special thinking/answer tokens. Reserved tokens (e.g., 200+ in a leading open-weights model) allow post-hoc additions without retraining. e.g. ChatML: <|im_start|>, <|im_end|> · a leading open-weights model: <|begin_of_text|>, <|start_header_id|>, etc. · o1-style: <thinking>, <answer>
A2.3.1 Script coverage
Per-script tokenization fidelity.
+ deeper detail (6 leaves)
  • A2.3.1.1 Latin scripts English, Spanish, French, German, etc. Industry standard: Best-supported. Byte-level BPE handles directly; SentencePiece-Unigram handles via subword regularization.
  • A2.3.1.2 CJK scripts (Chinese, Japanese, Korean) Logographic and mixed scripts. Industry standard: Treat each character or character-pair as token. Heavy vocabulary footprint due to 50K+ characters in modern Chinese.
  • A2.3.1.3 Arabic-script (Arabic, Persian, Urdu) Right-to-left abjad scripts with optional diacritics and joining behavior. Industry standard: Often over-fragmented in byte-level BPE due to multi-byte UTF-8 representation. Persian especially under-served. - A2.3.1.3.1 — RTL token boundary How tokenizer handles right-to-left direction at token boundaries. Industry standard: Mostly handled at rendering layer, not tokenizer. Tokenizer treats as plain byte sequence. - A2.3.1.3.2 — Diacritic handling Optional vowel marks (harakat). Train corpus typically has them inconsistently. Industry standard: Inconsistent. Diacritics treated as separate tokens or dropped during normalization. - A2.3.1.3.3 — Letter-form normalization Same character with different visual forms (e.g. Arabic ya/Persian ye). Industry standard: NFC/NFKC normalization standard but not universal; can collapse distinct characters. - A2.3.1.3.4 — Joining behavior Arabic-script letters change shape based on position in word (initial/medial/final/isolated). Industry standard: Tokenizer operates on logical Unicode code-points, not visual forms; joining handled at rendering.
  • A2.3.1.4 Indic scripts Devanagari, Bengali, Tamil, Telugu, etc. Industry standard: Often under-tokenized due to low corpus presence. Multi-byte UTF-8 → over-fragmentation.
  • A2.3.1.5 Cyrillic Russian, Ukrainian, Bulgarian, Serbian, etc. Industry standard: Generally well-served in major models due to substantial corpus presence.
  • A2.3.1.6 Long-tail scripts Thai, Hebrew, Greek, Armenian, Georgian, Ethiopic, etc. Industry standard: Coverage varies. Models trained on web corpus serve them roughly proportional to corpus presence.
A2.3.2 Token efficiency per language
Tokens-per-character ratio across languages — measures fairness and cost.
+ deeper detail (3 leaves)
  • A2.3.2.1 Tokens-per-character ratio Average tokens needed to encode 1000 characters in a given language. Industry standard: English ~0.25 tokens/char (4 chars per token). Many low-resource languages 1.0+ (1 token per char).
  • A2.3.2.2 Cross-language fairness Cost / context-window disparity between languages. Industry standard: Petrov et al. 2023 documented 5-15× cost disparity for some low-resource languages.
  • A2.3.2.3 Low-resource over-fragmentation Languages with sparse corpus presence get poorly-merged tokens. Industry standard: Up-sampling during tokenizer training partially addresses; full fix requires balanced corpus or per-language tokenizer.
A2.3.3 Cross-lingual token sharing
Whether semantically similar concepts share tokens across languages. Industry standard: Emergent in shared vocabulary; not designed-in. Subject of research in multilingual representation.
A2.4 Pre-tokenization
Multilingual tokenization is a major axis. English-centric tokenizers fragment non-Latin scripts heavily (e.g., a Persian/Arabic word may be 3-5x more tokens than English equivalent). This degrades both performance and economics for non-English users. Modern tokenizers (a leading open-weights model, a multimodal frontier model) explicitly rebalance to compress non-English scripts. SOTA: A leading open-weights model's 128K vocab includes substantial coverage for Chinese, Arabic, Hindi, Persian, etc. Trade-off: each language added 'costs' embedding capacity. A multimodal frontier model was designed multilingual-first. Some labs train language-specific tokenizers (e.g., Chinese-open-weights model family) for downstream models. e.g. A leading open-weights model vs a 2023-generation open-weights model: 4-8x compression improvement on non-English · a frontier multimodal model: significant non-English improvement over a 2023-class frontier model
A2.4.1 Whitespace handling
Whether whitespace is a token boundary, a regular character, or attached to adjacent token. Industry standard: Byte-level BPE: whitespace as character, often attached to following token (Ġ prefix in a foundational decoder-only model). SentencePiece: whitespace replaced with ▁.
A2.4.2 Punctuation rules
How punctuation interacts with adjacent characters during pre-tokenization. Industry standard: Generally split at punctuation boundaries via regex (frontier-style).
A2.4.3 Number/digit splitting
Whether multi-digit numbers are split into individual digits. Industry standard: an early open-weights model/2 split into digits for math; a frontier model historically did not. A leading open-weights model also splits.
A2.4.4 URL / code special handling
Pre-segmentation for code and URLs to prevent merge across syntactic boundaries. Industry standard: Often integrated into pre-tokenization regex; some pipelines route code through specialized tokenizer.
A2.4.5 Unicode normalization
NFC, NFKC, NFD, NFKD — different ways to canonicalize Unicode. Industry standard: NFC standard for most pipelines. NFKC drops compatibility characters but loses information.
A2.5 Code & Specialized Domains
Code and specialized-domain tokenization. Code has different distributional properties from natural language: high entropy in identifiers, important whitespace, frequent punctuation. Specialized tokenizers (or careful pre-tokenization in shared tokenizer) handle these. SOTA: General-purpose tokenizers (a leading open-weights model, a current-generation frontier model) handle code well via byte-level BPE + careful pre-tokenization (digits separate, indentation preserved). Code-specialized tokenizers (StarCoder) marginally better on code-only metrics but lose general-language efficiency. Frontier: shared tokenizer with code-aware design. e.g. StarCoder tokenizer: code-specialized · a leading open-weights model: general but strong on code · a byte-level BPE tokenizer library cl100k: handles code well
A2.5.1 Indentation handling
Python-style significant whitespace; tabs vs spaces. Industry standard: Modern code-aware tokenizers preserve indentation as multi-space tokens (single token for 4 spaces, etc.).
A2.5.2 Symbol density
Heavy punctuation and operator density in code. Industry standard: Common operators (==, !=, ->, =>) often merged into single tokens during BPE training.
A2.5.3 Multi-language source code
Single tokenizer covering Python, JavaScript, Java, Rust, C++, etc. Industry standard: Shared tokenizer trained on multi-language code corpus (The Stack).
A2.5.4 Math / LaTeX
Mathematical notation and LaTeX commands. Industry standard: Common LaTeX commands (\frac, \sum, etc.) often merged. Number/digit splitting (A2.4.3) helps arithmetic.
A2.6 Multi-Modal Token Spaces
Multi-modal token spaces. Vision tokens (from ViT/SigLIP encoder, 256-1024 per image), audio tokens (Whisper-style or native), video tokens (temporal sampling). Native multimodal models share token space across modalities; late-fusion projects modality embeddings into LLM space. SOTA: Native multimodal token spaces standard 2024+. Image: ViT/SigLIP encoder produces 256-1024 tokens per image; dynamic resolution (AnyRes, Pixtral). Audio: Moshi-style native audio tokens at 12.5Hz, or Whisper transcription tokens. Video: 1-8 fps spatial-temporal patches. e.g. a frontier multimodal model native multimodal · a multimodal frontier model 2.0 native + image gen · Chameleon (an open-weights frontier lab, open)
A2.6.1 Image patch tokens
ViT-style 16x16 or 14x14 patches converted to tokens via linear projection. Industry standard: Standard since ViT (Dosovitskiy 2020). 16x16 most common; 14x14 in newer models.
A2.6.2 Audio tokens
Audio frames or learned audio codec tokens (Encodec, SoundStream). Industry standard: Encodec (Defossez 2022) and similar neural codecs produce discrete audio tokens. Whisper uses log-mel spectrograms instead.
A2.6.3 Bridged token spaces
Shared embedding space across modalities. Industry standard: Common in multimodal LLMs (a frontier vision-language model, a multimodal frontier model, a long-context frontier model). Implementation via projection or learned bridging.
+ deeper detail (2 leaves)
  • A2.6.3.1 Shared embedding space All modalities mapped to single vector space. Industry standard: CLIP-style or learned per-modality projector to text embedding dim.
  • A2.6.3.2 Cross-attention bridges Modality-specific encoders feeding into text decoder via cross-attention. Industry standard: Used in Flamingo (a multimodal frontier lab 2022) and derivatives.
A2.7 Tokenizer Training Pipeline
Tokenizer training pipeline. SentencePiece or an open-model hub tokenizers libraries provide implementation. Train on representative corpus sample (~10-100GB). Iterations: choose vocab size, train, evaluate compression ratio across languages, special token reservations, finalize. SOTA: SentencePiece + an open-model hub tokenizers libraries provide implementation. Train on representative 10-100GB sample. Reserve 200-500 tokens for special use. Evaluate per-language compression ratio. Standard pipeline: ~hours on single CPU. e.g. an open-model hub tokenizers (Rust, fast) · SentencePiece (a multimodal frontier lab) · a byte-level BPE tokenizer library (a leading frontier lab)
A2.7.1 Training corpus selection
Which subset of the data corpus is used to train the tokenizer. Industry standard: Random sample of the full pre-training corpus, usually 1-10B tokens. Mixture should mirror final training mixture.
A2.7.2 Sampling strategy
How documents are sampled into the tokenizer training set. Industry standard: Up-sample low-resource languages to prevent over-fragmentation.
A2.7.3 Convergence criteria
Stopping condition for BPE merges. Industry standard: Stop when target vocab size reached, or when merge frequency falls below threshold.
A2.8 Inference-time Behavior
Inference-time tokenizer behavior. Encoding/decoding speed (sub-millisecond per request expected). Edge cases: incomplete UTF-8 at chunk boundaries (during streaming), tokenizer mismatch between client/server, special-token leakage in outputs. SOTA: a byte-level BPE tokenizer library and an open-model hub tokenizers achieve <1ms encoding for typical inputs. Streaming decode handles partial multi-byte chars at chunk boundaries. Production concern: special token leakage in outputs (e.g., <|endoftext|> appearing in user-visible response) — sanitization required. e.g. a byte-level BPE tokenizer library (a leading frontier lab, fast) · an open-model hub tokenizers (Rust)
A2.8.1 Detokenization correctness
Round-trip: encode then decode produces original text. Industry standard: Byte-level BPE: lossless. SentencePiece: lossless within whitespace handling rules.
A2.8.2 Streaming token boundary
Token-by-token streaming with valid UTF-8 emission. Industry standard: Buffer partial bytes until valid UTF-8 boundary; emit complete characters only.
A2.8.3 Prompt prefix handling
Whether tokenizer adds BOS automatically; how chat templates are pre-encoded. Industry standard: Varies by model. A 2023-generation open-weights model/3 expect explicit BOS; a leading frontier model family auto-adds.
A2.9 Evaluation & Robustness
Tokenizer evaluation and robustness. Compression ratio across languages (tokens per character). Coverage on rare scripts. Robustness to adversarial inputs (Unicode tricks, zero-width characters, look-alike characters). Glitch tokens (rarely-trained tokens that cause weird behavior). SOTA: Petrov et al. (2023) systematically evaluated multilingual tokenizers — older tokenizers fragment non-Latin 5-10× worse. A leading open-weights model, a multimodal frontier model show much improved fairness. Glitch tokens (rarely-trained tokens like 'a famous glitch-token example') exposed as failure mode 2023, detected via embedding magnitude analysis. e.g. Petrov et al. multilingual study · Glitch token analyses (Rumbelow & Watkins 2023) · MultiBPemb robustness work
A2.9.1 Compression metrics
Bits per character, tokens per word, vocabulary efficiency. Industry standard: Lower bits/char = better compression. Reported across languages for fairness analysis.
A2.9.2 Out-of-vocabulary behavior
How tokenizer handles characters or sequences not seen in training. Industry standard: Byte-level BPE: degrades gracefully (always representable). Character-based BPE: requires fallback.
A2.9.3 Adversarial inputs
Inputs crafted to exploit tokenization quirks (homoglyphs, invisible characters). Industry standard: Pre-tokenization Unicode normalization mitigates. Zero-width characters and homoglyphs remain attack surface.
A2.9.4 Glitch tokens / hidden token exploits
Tokens in vocabulary that produce anomalous behavior. Famous example: 'a famous glitch-token example' in an earlier frontier model. Industry standard: Caused by tokenizer training on data later removed. Mitigation: align tokenizer corpus with model corpus.
A2.10 anchor-based representation (research direction)
Anchor-based representation / advanced research direction. Beyond pure subword tokenization, research explores semantic-aware tokenization, byte-level models without BPE (MambaByte, MEGABYTE), and tokenizer-free approaches. SOTA: Tokenizer-free architectures (MambaByte, MEGABYTE, a byte-level encoder-decoder model) operate directly on bytes — eliminate fragmentation but slower. Active research; not yet matched BPE at frontier scale. Hybrid approaches (entropy-based dynamic tokenization) emerging in 2025. e.g. MambaByte (research 2024) · MEGABYTE (Yu 2023) · a byte-level encoder-decoder model (Xue 2022)
A3

Architecture

44 sub-endpoints mapped
MZN Provisional Position · Partial
Patent-grade candidate architectural innovations; implementation validation pending; full model construction is Phase 3 scope
Patent-grade architectural inventions in the area of structured intelligence, modular reasoning, and intent-shaping pipelines (SHA-256 anchors and blockchain timestamps). Architectural patterns contributed; full frontier-scale model construction at parameter count requires partnership compute. Specifics held in the proprietary portfolio.
Definition

Model architecture defines the network's computational structure: how inputs flow through layers, what operations apply at each layer, and how representations combine. The dominant paradigm since 2017 is the decoder-only transformer with mods. Architecture decisions cascade: attention type affects long-context, normalization affects training stability, MoE affects parameter efficiency vs. compute.

State of the Art (2025–2026)

Frontier 2024-2026 dense architectures (a leading open-weights flagship model, an open-weights frontier lab Large 2): decoder-only transformer with RoPE positional encoding, RMSNorm, SwiGLU activation, GQA (grouped query attention). MoE architectures (a sparse-MoE frontier model, an open-weights frontier model (V3 class)): sparse expert routing with 8-256 experts, top-2 routing typical. Reasoning models (o1, R1): same architecture but RL-trained for chain-of-thought. Multimodal: native interleaved tokens with vision encoder integration.

Key Decisions
  • Dense vs. MoE
  • Attention type (full, GQA, MQA, sliding window)
  • Positional encoding (RoPE, ALiBi, NoPE)
  • Normalization (RMSNorm, LayerNorm, post vs. pre)
  • Activation (SwiGLU, GeGLU, ReLU)
  • Depth × width allocation
  • Vision/audio integration strategy
Trade-offs
  • MoE → more parameters per FLOP but harder to train and serve
  • GQA → faster inference, slight quality reduction vs. full MHA
  • Sliding window → linear attention but loses long-range info
Numbers & Ablations
  • GQA-8 vs full MHA at 70B: <1% MMLU degradation, 4-8× KV cache memory reduction, ~3× decode throughput at 32K context (Ainslie 2023, a leading open-weights model paper).
  • MoE active/total ratio: an open-weights frontier model (V3 class) 5.5% (37B/671B), a sparse-MoE frontier model 28% (39B/141B), a current-generation frontier model estimated ~20% (closed). Lower ratio = more capacity per FLOP at training/serving complexity cost.
  • RoPE base θ scaling: original 10K → 500K (a leading open-weights model for 128K context) → 5M+ (research on 1M+ context). Each 10× context extension typically requires ~10× Î¸.
  • Multi-Head Latent Attention (an open-weights frontier provider V2/V3): 93% KV cache reduction vs MHA, 1-2% benchmark improvement attributed to better representational structure.
  • SwiGLU vs GeLU: ~1-2% perplexity gain at parameter-matched budget (Shazeer 2020). Universal at frontier 2024+.
  • Pre-RMSNorm vs Pre-LayerNorm: equivalent quality, ~7-10% throughput gain (omits mean computation). Universal at frontier 2024+.
Open Questions
  • Is there a Pareto-better attention than MLA for long context? Several research efforts (Differential Attention, Lightning Attention) but no frontier context yet.
  • Why does sliding window + global hybrid (a small open-weights model 2) underperform pure full-attention at frontier scale despite theoretical advantages? Empirical observation, not understood.
  • Scaling laws for active parameters in MoE: if active=37B in an open-weights frontier model (V3 class) matches dense 70B-100B-class, what is the actual mapping? No published Chinchilla-equivalent for MoE.
  • Reasoning models: does the architecture that's best for non-reasoning training remain optimal under RL post-training? o1/R1 suggest yes; theoretical reason absent.
  • a state-space frontier architecture/State-Space Models at frontier: 2024 demonstrated competitive at 7-13B. Why has no lab pushed to 70B+ for serious comparison? Compute economics or architectural ceiling?

Reference analyst note. Dense architecture is dead at frontier scale by end of 2026. A leading open-weights flagship model is likely the last frontier-tier dense model. Either MoE (an open-weights frontier provider lineage, fine-grained 200+ experts) or new sparse paradigms wins. Architecture innovation is decoupling from scaling — RL post-training resets 'capability per parameter' such that smaller models with better post-training match much larger pre-train-only models. The bottleneck is shifting from architecture-quality to RL-environment-quality.

Reference Analyst Note

Dense architecture is dead at frontier scale by end of 2026. A leading open-weights flagship model is likely the last frontier-tier dense model. Either MoE (an open-weights frontier provider lineage, fine-grained 200+ experts) or new sparse paradigms wins. Architecture innovation is decoupling from scaling — RL post-training resets 'capability per parameter' such that smaller models with better post-training match much larger pre-train-only models. The bottleneck is shifting from architecture-quality to RL-environment-quality.

Examples

A leading open-weights flagship model: dense, GQA, RoPE, RMSNorm, SwiGLU · an open-weights frontier model (V3 class): MoE 671B total / 37B active, multi-head latent attention · a sparse-MoE frontier model: MoE, 141B total / 39B active · a long-context frontier model / a frontier multimodal model: architecture undisclosed but likely MoE

References (Academic)

Vaswani et al., Attention Is All You Need (2017) · Touvron et al., leading open-weights model (2023, 2024) · an open-weights frontier model (V3 class) technical report (2024) · Su et al., RoPE (2021)

Sub-endpoint anatomy — 44 items mapped
A3.1 Transformer Block Design
Attention mechanism is the core operation. Standard multi-head attention scales quadratically with sequence length, making naive transformers infeasible for long context. Variants reduce cost: multi-query attention (MQA, single KV head), grouped query attention (GQA, fewer KV heads than Q heads), sliding window attention (local), and multi-head latent attention (an open-weights frontier provider's compressed KV). SOTA: GQA is the dominant choice for frontier dense models (a open-weights models). Reduces KV cache memory by 4-8x with negligible quality loss. An open-weights frontier model (V3 class) introduced MLA (Multi-head Latent Attention) which compresses KV via low-rank projection, achieving even better memory efficiency. Sliding window (an open-weights frontier lab, a small open-weights model) handles very long context with O(n × w) cost. e.g. A leading open-weights model (70B class): 64 Q heads, 8 KV heads (GQA-8) · an open-weights frontier lab: GQA + sliding window · an open-weights frontier model (V3 class): MLA
A3.1.1 Attention mechanism
Self-attention variant. Industry standard: Multi-head self-attention (Vaswani 2017) is the foundation. Modern variants reduce KV cache cost.
+ deeper detail (5 leaves)
  • A3.1.1.1 Multi-Head Attention (MHA) Standard multi-head: each head has independent Q, K, V projections. Industry standard: Foundation. Used in original Transformer, foundational decoder-only models, an encoder-only model.
  • A3.1.1.2 Multi-Query Attention (MQA) Single K, V projection shared across all heads. Reduces KV cache by H×. Industry standard: an earlier frontier model (a multimodal frontier lab 2022), an open-weights model. Reduces memory but mild quality loss.
  • A3.1.1.3 Grouped-Query Attention (GQA) K, V shared across groups of heads. Compromise between MHA and MQA. Industry standard: a 2023-generation open-weights model (70B), a leading open-weights model, a sparse-MoE frontier model. Now dominant frontier choice.
  • A3.1.1.4 Sliding Window / Local Attention Attention restricted to local window. Industry standard: a sliding-window frontier model uses sliding window 4096. Trade-off: linear attention cost, limited long-range coupling.
  • A3.1.1.5 FlashAttention IO-aware attention implementation: reduces memory access by tiling. Industry standard: Universally adopted. FlashAttention-2 (Dao 2023) and FlashAttention-3 (Shah 2024) progressive optimizations.
A3.1.2 Feed-forward network (FFN)
Per-token MLP after attention.
+ deeper detail (3 leaves)
  • A3.1.2.1 Standard FFN (GELU) Two linear layers with GELU between. Industry standard: a foundational decoder-only model/3, an encoder-only model. Hidden dim typically 4× model dim.
  • A3.1.2.2 SwiGLU Gated linear unit with Swish activation. Three matrices instead of two. Industry standard: an earlier frontier model, an early open-weights model/2/3, an open-weights frontier lab. Now dominant. Hidden dim ~2.67× to keep parameter count constant.
  • A3.1.2.3 GeGLU GLU variant with GELU activation. Industry standard: Used in some models (a small open-weights model). Less common than SwiGLU.
A3.1.3 Normalization
Activation normalization layer.
+ deeper detail (3 leaves)
  • A3.1.3.1 LayerNorm Standard layer normalization (Ba 2016). Industry standard: a foundational decoder-only model, an encoder-only model, an encoder-decoder model. Largely superseded by RMSNorm at frontier.
  • A3.1.3.2 RMSNorm Root-Mean-Square normalization. No mean centering, only scaling. Industry standard: an early open-weights model/2/3, an open-weights frontier lab, a small open-weights model. Now dominant. ~10% faster than LayerNorm.
  • A3.1.3.3 Pre-norm vs Post-norm Whether normalization is applied before or after the residual. Industry standard: Pre-norm dominant since a foundational decoder-only model; better training stability at depth.
A3.1.4 Residual connections
Skip connections around each sub-layer. Industry standard: Universal. Required for gradient flow at depth.
A3.2 Position Encoding
Positional encoding tells the model where each token sits. Modern choices: RoPE (rotary), ALiBi (linear bias), and NoPE (no explicit positions, model learns implicitly via causal mask). RoPE is dominant. RoPE's frequency choice and scaling determine effective context length. SOTA: RoPE (Su et al., 2021) is standard. Long-context models extend RoPE via NTK-aware scaling, YaRN (Peng et al., 2023), or LongRoPE. A leading open-weights model uses scaling factor 8 to extend from 8K base to 128K context. Frontier 2025 efforts push to 1M+ context (a multimodal frontier model 2M context). Position interpolation methods are key to extending context post-training. e.g. A leading open-weights model: RoPE θ=500K, YaRN-style scaling to 128K · a million-token-context frontier model Pro: 2M context · a long-context frontier model: 200K context
A3.2.1 Absolute position (sinusoidal/learned)
Original Transformer position encoding. Industry standard: Obsolete for new frontier models. Cannot extrapolate beyond training length.
A3.2.2 RoPE (Rotary Position Embedding)
Apply rotation matrix to Q, K based on position. Industry standard: leading open-weights model family, an open replication initiative, a 2023-class frontier model. Now dominant. Better extrapolation than absolute.
A3.2.3 ALiBi (Attention Linear Bias)
Add linear bias to attention scores based on distance. Industry standard: an ALiBi-trained open model, an ALiBi-trained model. Strong extrapolation; less popular than RoPE recently.
A3.2.4 RoPE scaling (Yarn, NTK)
Methods to extend RoPE-trained models to longer contexts post-hoc. Industry standard: Yarn (Peng 2023), NTK-aware scaling. Used to extend a leading open-weights model to 128K context.
A3.3 Mixture of Experts (MoE)
Mixture of Experts (MoE) replaces dense feed-forward layers with multiple 'expert' FFNs and a router that activates only top-k experts per token. Result: larger total parameter count for same active compute. Trade-offs: harder training (load balancing, expert collapse), harder inference (memory holds all experts, only some compute), but better parameter efficiency. SOTA: an open-weights frontier model (V3 class) (Dec 2024): 671B total, 37B active per token, fine-grained MoE with 256 experts and top-8 routing. A sparse-MoE frontier model: 141B total, 39B active, 8 experts top-2. Frontier closed models likely MoE (a current-generation frontier model, a leading frontier model, a multimodal frontier model speculated). Auxiliary-loss-free load balancing (an open-weights frontier model (V3 class) innovation) avoids the gradient pathologies of traditional MoE training. e.g. a sparse-MoE frontier model/22B: 8 experts, top-2 · an open-weights frontier model (V3 class): 256 experts, top-8 + 1 shared · Switch Transformer: 1 expert (top-1), early MoE
A3.3.1 Switch Transformer (top-1 routing)
Each token routed to single expert. Industry standard: Switch Transformer pioneered. Simpler routing but load-balancing harder.
A3.3.2 Sparse top-K (a sparse-MoE frontier model-style)
Each token routed to top-K experts (K=2 typically). Industry standard: a sparse-MoE frontier model and a sparse-MoE frontier model. 8 experts, top-2 active per token.
A3.3.3 Expert Choice routing
Experts choose tokens (not tokens choose experts). Better load balancing. Industry standard: Used in some a multimodal frontier lab models. Less popular publicly.
A3.3.4 Load balancing
Auxiliary loss to keep expert utilization balanced. Industry standard: Standard practice in all MoE training.
A3.4 Depth/Width Allocation
Normalization stabilizes training and enables deep networks. Choice: LayerNorm (original transformer), RMSNorm (simpler, faster, equally effective), or various others. Position: pre-norm (before each sublayer, dominant) vs post-norm (after, original transformer, harder to train deep). Frontier choice: pre-RMSNorm. SOTA: Pre-RMSNorm is universal in 2024+ frontier dense models (open-weights models, an open-weights frontier provider). RMSNorm omits mean-subtraction (just scales by RMS) — empirically equivalent quality, ~10% faster. Some research on QK-norm (normalize Q, K separately for attention stability), used in some 2024 models. e.g. open-weights models, an open-weights frontier provider: pre-RMSNorm · Original Transformer: post-LayerNorm · a foundational decoder-only model/3: pre-LayerNorm
A3.4.1 Aspect ratio
Ratio of model depth to width. Industry standard: Empirical sweet spot ~80-128 depth for largest models. Very deep (>200 layer) shown not to help.
A3.4.2 Hidden dimension selection
Model dimension per layer. Industry standard: A leading open-weights model (70B class) uses 8192. Hidden dim 8× heads is common pattern.
A3.4.3 Layer count
Number of transformer blocks. Industry standard: A leading open-weights model (70B class) = 80 layers, a leading open-weights flagship model = 126 layers. Scales sublinearly with parameters.
A3.5 Embedding & Output Projection
Activation function choice in feed-forward layers. Modern frontier models use SwiGLU (Swish-gated linear unit) or GeGLU. These gated activations consistently outperform ReLU/GeLU at the same parameter count, at the cost of an extra weight matrix (3 weights per FFN instead of 2). SOTA: SwiGLU is standard. FFN dimension reduced by 2/3 to compensate for the extra matrix, keeping parameter count constant. Gives ~1-2% perplexity improvement over GeLU at no cost. e.g. open-weights models, an open-weights frontier provider: SwiGLU · an earlier frontier model: SwiGLU · a foundational decoder-only model/3: GeLU
A3.5.1 Tied embeddings
Whether input embedding and output projection share weights. Industry standard: a foundational decoder-only model ties. leading open-weights model family does not tie (separate matrices). Tradeoff: parameter count vs flexibility.
A3.5.2 Output head
Final projection to vocabulary logits. Industry standard: Linear projection. Usually preceded by final RMSNorm.
A3.6 Long-Context Architecture
Long-context handling extends the model's effective context window beyond pre-training. Methods: (1) train with long context throughout (expensive), (2) train short → extend via position interpolation + fine-tune, (3) RAG / external memory (skip true long context). Frontier 2025: 200K-2M true context. SOTA: A leading open-weights flagship model: 128K context via YaRN-style RoPE scaling + long-context fine-tune. A million-token-context frontier model Pro: 1M-2M context with ring attention and other tricks. A long-context frontier model: 200K. A current-generation frontier model-Turbo/4o: 128K. Long-context evaluation shifted from NIAH (needle in haystack, easy) to RULER and BABILong (multi-hop reasoning, harder). e.g. a million-token-context frontier model: 2M context · a long-context frontier model: 200K · a leading open-weights model: 128K
A3.6.1 Context window size
Maximum sequence length. Industry standard: A leading open-weights model: 128K. A long-context frontier model: 200K. A million-token-context frontier model: 1M-10M. Trend: longer.
A3.6.2 Position extrapolation
Train on shorter context, extend at inference. Industry standard: RoPE scaling (Yarn, NTK), position interpolation. A leading open-weights model trained on 8K, extended to 128K via RoPE scaling + continued training.
A3.6.3 KV cache optimization (architectural)
Architectural choices that reduce KV cache size. Industry standard: GQA (A3.1.1.3), sliding window (A3.1.1.4), MQA (A3.1.1.2). Architectural KV reduction directly enables long context.
A3.7 Activation Precision & Dtype
Multimodal integration: how non-text modalities enter the model. Two approaches: (1) Late fusion — separate vision/audio encoders → projector → frozen LLM, used in LLaVA, early a leading frontier model. (2) Native multimodal — interleaved image/audio/text tokens trained jointly from pre-training, used in a multimodal frontier model, a frontier multimodal model, Chameleon. Native is harder but enables true cross-modal reasoning. SOTA: Native multimodal dominates 2024+ frontier (a frontier multimodal model, a multimodal frontier model 2.0, a long-context frontier model). Vision tokens generated by ViT-style encoder (CLIP, SigLIP) and inserted into token stream. Image resolution often dynamic (e.g., AnyRes, Pixtral): 448² base, up to 1024² for detail. Audio: Whisper-style encoder or native audio tokens. Video: temporal sampling + frame tokens. e.g. a frontier multimodal model: native multimodal · a multimodal frontier model 2.0: native multimodal + image generation · a leading open-weights model.2 Vision: late fusion (vision adapter)
A3.7.1 Mixed precision (BF16/FP16)
Compute in lower precision, master weights in FP32. Industry standard: BF16 dominant for training (better dynamic range than FP16). Universal at frontier.
A3.7.2 FP8 training/inference
8-bit floating point for compute. Industry standard: Emerging. A current-generation accelerator supports FP8; some a leading open-weights model phases used FP8.
A3.8 Architecture Variants
Reasoning architectures: same base architecture, trained to produce long chains of thought before answering. o1 (a leading frontier lab) and R1 (an open-weights frontier provider) demonstrate that scaling test-time compute via reasoning yields substantial capability gains, especially on math and code. Architecturally identical to standard LLMs; the innovation is in training (RL with reward on outcome) and inference (let it think). SOTA: o1 (2024) demonstrated that hidden chain-of-thought before answer dramatically improves AIME, codeforces, GPQA. An open-weights reasoning model (Jan 2025) showed open-source path: pure RL from base model with simple rule-based rewards (correct/incorrect) yields reasoning capabilities, distillable to smaller models. Architecture: standard transformer; the magic is RL training and inference-time compute allocation. e.g. A leading frontier lab o1, o3 · an open-weights reasoning model · a long-context frontier model.7 Sonnet (extended thinking)
A3.8.1 Decoder-only
Causal masking, single stack. Industry standard: A leading frontier model family, open-weights models, a leading frontier model. Dominant frontier choice.
A3.8.2 Encoder-decoder
Separate encoder + decoder, used in an encoder-decoder model, a multilingual translation model. Industry standard: Less popular for general LLMs. Strong for translation, summarization.
A3.8.3 State-Space Models / a state-space frontier architecture
Alternative to attention via state-space recurrence. Industry standard: a state-space frontier architecture (Gu, Dao 2023), a state-space frontier architecture-2. Hybrid Transformer+SSM emerging (a hybrid SSM-attention model, a hybrid SSM-attention model). Not yet mainstream frontier.
A4

Training

35 sub-endpoints mapped
MZN Provisional Position · Partial
Training methodology documented; frontier-scale execution pending; frontier-scale execution is Phase 3 scope
Complete training methodology documented across model selection, fine-tuning strategy (parameter-efficient methods), optimizer configuration, learning rate schedules, batch strategy, stability control (gradient clipping, weight initialization, loss-spike recovery), parallelism strategy (data, tensor, and sharded), and checkpoint management. Reviewer-grade methodology and reference inventory exist. Frontier-scale execution at the 10K+ accelerator class remains a partnership-scope dependency.
Definition

Training infrastructure is the orchestration layer that turns architecture + data + compute into a trained model. At frontier scale (10K+ GPUs, weeks of training), every component matters: distributed parallelism strategy, optimizer state management, mixed-precision arithmetic, failure recovery, checkpoint frequency, gradient accumulation, learning rate scheduling. A 1% throughput improvement at frontier scale = millions of dollars.

State of the Art (2025–2026)

Frontier training stacks: a leading accelerator vendor a tensor-parallelism reference implementation + an open optimization framework (PyTorch), JAX/MaxText (a constitutional-methods frontier lab, a multimodal frontier lab). 4D parallelism standard: data + tensor + pipeline + expert (for MoE). A current-generation accelerator/a current-generation accelerator/a next-generation accelerator with InfiniBand. BF16 mixed-precision, FP8 emerging (a current-generation accelerator+). Checkpoint to S3/GCS every N steps with async writes. Auto-recovery from node failure.

Key Decisions
  • Framework (PyTorch ecosystem vs. JAX)
  • Parallelism strategy
  • Precision (BF16 vs. FP8)
  • Optimizer (AdamW vs. Lion vs. distributed Shampoo)
  • LR schedule shape
  • Checkpoint frequency
  • Gradient clipping
Trade-offs
  • More parallelism → larger models possible, communication overhead
  • FP8 → 2x throughput, training instability risk
  • Frequent checkpoints → resilience, write bandwidth
Numbers & Ablations
  • A leading open-weights flagship model training: 16K H100s × ~54 days × 700W = ~22 GWh, MFU ~38%. Total compute ~3.8e25 FLOPs.
  • an open-weights frontier model (V3 class) 671B-MoE training: 2K H800s × ~57 days, FP8 mixed-precision, 14.8T tokens, $5.6M reported (excludes ablations). 18.8% of a leading open-weights model's compute, comparable benchmark performance.
  • MFU benchmarks: 40-50% is good at frontier scale; >55% rare and only with extensive optimization. an earlier frontier model achieved 46% at 540B scale.
  • Failure rate: GPU failures at frontier scale ~3-5% of GPUs/week; 1-3 failures/day on 16K cluster. Without auto-recovery, multi-week runs impossible.
  • Optimizer state cost: AdamW = 12 bytes/param FP32 master + momentum + variance. For 405B model: ~5TB. Distributed via ZeRO-3/FSDP across DP ranks.
  • FP8 training precision: an open-weights frontier model (V3 class) reports <0.05% loss penalty vs BF16 with selective high-precision for sensitive ops, ~1.8× throughput.
Open Questions
  • Optimal LR schedule shape: cosine vs WSD vs constant-then-decay — which wins at 10T+ token scale? No frontier ablation published.
  • Distributed Shampoo vs AdamW at frontier: a constitutional-methods frontier lab reportedly uses Shampoo; no public head-to-head exists at >100B scale.
  • Training stability: are loss spikes random hardware artifacts, deterministic numerical issues, or signal of optimization pathology? Frontier labs disagree.
  • Annealing phase impact: a leading open-weights model reports gains from final annealing; isolated effect vs. confound with high-quality data? Unclear.
  • Cross-architecture parallelism transfer: knowledge of how to parallelize dense → MoE lossy transfer (expert parallelism is novel). An open-weights frontier provider had to develop new techniques.

Reference analyst note. an open-weights frontier model (V3 class)'s $5.6M-equivalent demonstrated the field has been overspending by 5-10×. The next 2 years will see massive efficiency gains as algorithmic improvements (FP8, fine-grained MoE, better parallelism, better data) compound. Frontier 'training compute' as the dominant moat is collapsing. The new moat is post-training infrastructure, RL environment quality, and inference-time compute scaling. Anyone with 1K H100s can now produce competitive models — the bottleneck has moved upstream of pre-training to data and downstream to RL.

Reference Analyst Note

an open-weights frontier model (V3 class)'s $5.6M-equivalent demonstrated the field has been overspending by 5-10×. The next 2 years will see massive efficiency gains as algorithmic improvements (FP8, fine-grained MoE, better parallelism, better data) compound. Frontier 'training compute' as the dominant moat is collapsing. The new moat is post-training infrastructure, RL environment quality, and inference-time compute scaling. Anyone with 1K H100s can now produce competitive models — the bottleneck has moved upstream of pre-training to data and downstream to RL.

Examples

A leading open-weights flagship model: 16K H100s for ~30M GPU-hours, BF16, 4D parallel · an open-weights frontier model (V3 class): 2K H800s, FP8 mixed-precision (innovation) · a constitutional-methods frontier lab: JAX on a custom-silicon accelerator

References (Academic)

Shoeybi et al., a tensor-parallelism reference implementation (2019) · Rajbhandari et al., ZeRO/an open optimization framework (2020) · a leading open-weights model paper (2024) · an open-weights frontier model (V3 class) report (2024)

Sub-endpoint anatomy — 35 items mapped
A4.1 Optimizer
Distributed parallelism strategy splits the model and data across many GPUs. Four dimensions: Data Parallelism (DP, replicate model, split batch), Tensor Parallelism (TP, split each layer's matrix multiply across GPUs), Pipeline Parallelism (PP, different layers on different GPUs), Expert Parallelism (EP, MoE experts across GPUs). Frontier uses all four ('4D parallelism'). SOTA: A leading open-weights flagship model: TP=8 (within node), PP=16, DP=128, total 16K GPUs. An open optimization framework ZeRO partitions optimizer state, gradients, parameters across DP ranks for memory. FSDP (PyTorch native) is similar to ZeRO-3. Communication-compute overlap critical: parallelism choice depends on InfiniBand topology and node-local NVLink bandwidth. e.g. A leading open-weights flagship model: 8×16×128 = 16K GPUs · an open optimization framework ZeRO-3 + TP commonly · an open-weights frontier lab Large: similar 4D
A4.1.1 SGD with momentum
Original optimizer. Industry standard: Obsolete for LLM pre-training; cannot match Adam-family at scale.
A4.1.2 AdamW
Adam with decoupled weight decay. Industry standard: Universal at frontier. β1=0.9, β2=0.95 typical for LLMs (β2=0.999 for general DL).
A4.1.3 LAMB
Layer-wise adaptive moments. Designed for very large batch sizes. Industry standard: Used in some an encoder-only model pre-training. Less common for decoder-only LLMs.
A4.1.4 Lion
Sign-momentum-only optimizer; less memory than AdamW. Industry standard: Emerging. Some a multimodal frontier lab models report success.
A4.2 Learning Rate Schedule
Mixed-precision training uses lower-precision arithmetic (BF16, FP8) for compute while keeping high-precision (FP32) master weights for stability. Doubles effective compute and halves memory. BF16 (Brain Float 16) has FP32's exponent range, avoiding overflow issues of FP16. FP8 (E4M3, E5M2) is the new frontier — 2x throughput vs BF16 but training stability harder. SOTA: BF16 mixed-precision is standard. FP8 mixed-precision validated at scale by an open-weights frontier model (V3 class) (671B MoE trained in FP8). Requires careful scaling, gradient handling, and selective high-precision for sensitive ops (LayerNorm, softmax, MoE gating). Trade-off: 2x throughput, ~5x more engineering complexity. e.g. A leading open-weights model: BF16 + FP32 master · an open-weights frontier model (V3 class): FP8 mixed-precision (first frontier-scale demonstration)
A4.2.1 Warmup phase
Linear ramp from 0 to peak LR over initial steps. Industry standard: Universal. Typically 0.5-2% of total steps. A 2023-generation open-weights model used 2000 steps.
A4.2.2 Cosine decay
Cosine curve from peak to ~10% of peak LR. Industry standard: an early open-weights model/2/3 use cosine. Min LR typically 0.1× peak.
A4.2.3 Linear decay
Linear from peak to min. Industry standard: Used in some models (e.g. an earlier frontier model used cosine, but linear common in fine-tuning).
A4.2.4 WSD (Warmup-Stable-Decay)
Constant LR after warmup, decay only at end. Industry standard: Used in MiniCPM, allows continued training without LR planning.
A4.3 Batching
Optimizer state management. AdamW (decoupled weight decay) is universal for LLM training. Stores 2 floats per parameter (momentum, variance) in addition to the parameter itself, in FP32 = 12 bytes/param overhead vs the 2-byte BF16 weight. ZeRO/FSDP partition this state across DP ranks to fit large models. Distributed Shampoo (newer, 2nd-order method) shows promise. SOTA: AdamW with β1=0.9, β2=0.95 (slightly lower than the 0.999 of original) is standard for LLMs. Weight decay 0.1 typical. Lion (Chen et al., 2023) saves memory but doesn't consistently outperform. Distributed Shampoo demonstrated at scale (a constitutional-methods frontier lab, others) for slight efficiency gain. e.g. Most frontier models: AdamW · a constitutional-methods frontier lab: Distributed Shampoo (reported) · Some open: Lion
A4.3.1 Global batch size
Total tokens processed per gradient update. Industry standard: A leading open-weights flagship model used 16M tokens/batch. Frontier ranges 4M-32M tokens.
A4.3.2 Sequence packing
Concatenating multiple documents into single sequence to avoid padding waste. Industry standard: Standard. Documents joined with EOS separator. Some pipelines use document-attention masking to prevent cross-document attention.
A4.3.3 Gradient accumulation
Accumulate gradients over micro-batches before update. Industry standard: Used to achieve large effective batch when memory limits per-device batch.
A4.4 Parallelism
Learning rate schedule shapes the optimization trajectory. Standard pattern: warmup (linear from 0 to peak over 1-3% of training) → main schedule (cosine decay to 10% of peak, or constant). Cooldown / annealing at end (decay further, sometimes with high-quality data only) is increasingly common. SOTA: A leading open-weights model: cosine decay over 15T tokens. Annealing phase: final 40B tokens with high-quality data + linear LR decay to 0. WSD (Warmup-Stable-Decay) schedule (Hu et al., 2024) demonstrated equivalent quality with simpler shape: warmup → constant → linear decay. Allows easier intermediate evaluation. e.g. A leading open-weights model: cosine + annealing · Many open models: WSD (MiniCPM)
A4.4.1 Data parallelism
Replicate model, split batch. Industry standard: Foundation. All other parallelism layers compose on top.
A4.4.2 Tensor parallelism
Split individual layer matrices across devices. Industry standard: a tensor-parallelism reference implementation style. Typically 8-way (within node, NVLink-bound).
A4.4.3 Pipeline parallelism
Split layers across devices, sequential micro-batches. Industry standard: GPipe / a tensor-parallelism reference framework-style 1F1B. Used for very large models. A leading open-weights flagship model uses 16-way pipeline.
A4.4.4 ZeRO / FSDP
Shard optimizer states, gradients, parameters across data-parallel ranks. Industry standard: ZeRO-3 / FSDP universal at frontier. Reduces memory ~N×.
A4.4.5 Sequence parallelism
Split sequence dimension across devices. Industry standard: Used for long-context training. Ring attention (Liu 2023) is reference.
A4.5 Loss
Checkpoint and recovery: at frontier scale, hardware fails frequently (1-5% nodes per day). Without robust recovery, days of work lost. Modern stacks: async checkpoint to object store every 1000-5000 steps, automatic node replacement, resume from latest checkpoint. SOTA: Async checkpoint writers (TorchSnapshot, custom) overlap checkpoint I/O with compute. Tiered storage: hot (NVMe) for last few checkpoints, cold (S3) for archive. Failure detection via heartbeat. A leading accelerator vendor NCCL handles transient communication failures. Automatic restart from latest checkpoint with new node assignment in minutes. e.g. Most frontier labs: async checkpoint, multi-tier storage · Open: torchsnapshot, custom
A4.5.1 Cross-entropy on next token
Standard autoregressive language modeling loss. Industry standard: Universal.
A4.5.2 Z-loss / aux losses
Auxiliary loss to stabilize logits scale. Industry standard: an earlier frontier model uses z-loss; helps numerical stability. MoE uses load-balancing aux loss (cross-link to A3.3.4).
A4.6 Training Stability
Training stability. Catastrophic loss spikes can destroy weeks of work. Sources: numerical instability in attention/normalization, gradient explosions from outlier batches, hardware failures, NaN propagation. Detection: gradient norm monitoring, loss anomaly detection. Recovery: rollback to checkpoint, skip bad batch, lower learning rate. SOTA: A leading open-weights model paper documents stability work: pre-norm + careful weight init + gradient clipping at 1.0 + LR warmup. Single rank's hardware degradation can cause loss spike across full cluster (NCCL synchronization). Frontier: automatic anomaly detection on gradient norms, auto-rollback on spike. e.g. A leading open-weights model stability section · an earlier frontier model training notebook (post-mortem of spikes) · OPT paper (training instability documented)
A4.6.1 Gradient clipping
Clip gradient norm to prevent explosion. Industry standard: Universal. Typical max norm 1.0.
A4.6.2 Weight initialization
Initial weight distribution. Industry standard: Truncated normal with std scaled by 1/sqrt(d) or layer-aware (e.g. An open replication initiative init).
A4.6.3 Loss spike recovery
Detection and rollback of training instabilities. Industry standard: A leading open-weights model paper documents rollback procedures. Detection via running variance of loss.
A4.7 Training Telemetry
Training telemetry. Per-step metrics: loss, gradient norm, throughput (tokens/sec/GPU), MFU (Model FLOPs Utilization). Per-rank: latency variance, NCCL stalls, GPU utilization. Aggregated dashboards updated every step or every N steps. SOTA: Per-step metrics: loss, gradient norm, throughput (tokens/sec/GPU), MFU (Model FLOPs Utilization). Frontier MFU 40-50%; > 55% rare. Per-rank slow node detection critical — single slow rank slows entire AllReduce. A leading accelerator vendor DCGM integrated with training logs. e.g. W&B for high-level metrics · DCGM for hardware · Custom dashboards at frontier
A4.7.1 Loss curves
Train and context loss over time. Industry standard: Universal. W&B / MLflow / proprietary.
A4.7.2 Gradient statistics
Per-layer gradient norms, ratio to weight norm. Industry standard: Standard. Early indicator of instability.
A4.7.3 Activation statistics
Per-layer activation norms, attention entropy. Industry standard: Used at frontier labs to detect early issues.
A4.8 Checkpointing
Checkpointing strategy. Async checkpoint to object store every 1000-5000 steps. Tiered storage: hot (NVMe) for last 3-5, cold (S3) for archive. Auto-resume from latest. Optimizer state checkpoint is largest (8x weights for AdamW). SOTA: Async checkpoint to S3/GCS every 1000-5000 steps. Tiered storage: hot (NVMe) for last 3-5, cold for archive. Distributed checkpoint formats (FSDP) save in shards parallel-readable. Frontier: 1-2 hour wall-clock checkpointing, retention 5-10 latest + monthly archives. e.g. TorchSnapshot (PyTorch) · a tensor-parallelism reference implementation distributed checkpoint · FSDP sharded state dict
A4.8.1 Checkpoint frequency
How often to save full state. Industry standard: Hourly to daily depending on cluster size. Frequency balances storage cost vs recovery cost.
A4.8.2 Checkpoint format
Serialization format and sharding. Industry standard: Sharded across DP ranks. a safer model serialization format emerging as safer alternative to pickle.
A4.8.3 Resumption logic
Loading and continuing from checkpoint. Industry standard: Includes RNG state, optimizer state, dataloader position.
A5

Compute

21 sub-endpoints mapped
MZN Provisional Position · Gap
No cluster under solo operation; compute is Phase 3 partnership scope
Phase 1 and Phase 2 produced the portfolio without frontier-class compute. Hardware-level monitoring methodology documented at the metric level. Cluster-scale compute access is an acknowledged partnership requirement.
Definition

Compute infrastructure is the physical substrate. GPU/a custom-silicon accelerator acquisition, network topology, storage. Frontier training requires homogeneous, high-bandwidth GPU clusters with InfiniBand interconnect. Inference requires either similar clusters (for largest models) or commodity GPU with optimization. The compute supply chain is a strategic constraint: GPU access is gated by a leading accelerator vendor allocation and capital.

State of the Art (2025–2026)

a current-generation accelerator (80GB, 700W, $25-40K/GPU) is the frontier workhorse since 2023. A current-generation accelerator (141GB, late 2024) and a next-generation accelerator/a Blackwell-class architecture (192GB, 2025) succession. A multimodal frontier lab a custom-silicon accelerator / v6e for a constitutional-methods frontier lab, a multimodal frontier lab. Frontier clusters: 16K-100K+ GPUs with non-blocking InfiniBand 400-800Gbps. CoreWeave, Lambda Labs, Crusoe provide alternative-cloud GPU access at lower cost than hyperscalers.

Key Decisions
  • Hardware (a current-generation accelerator, a current-generation accelerator, a next-generation accelerator, a custom-silicon accelerator)
  • Cluster size (1K to 100K)
  • Network topology (rail-optimized, fat-tree, dragonfly)
  • Cloud vs. owned
  • Storage tier
Trade-offs
  • Owned → capex + control
  • Cloud → opex + flexibility
  • Larger cluster → frontier-capable, harder utilization
Numbers & Ablations
  • a current-generation accelerator economics: $25-40K capex, ~$2-3/hour cloud rental, 700W TDP. xAI Colossus = 100K a current-generation accelerator × $30K = $3B GPU alone (excludes datacenter, network, power).
  • InfiniBand NDR (400Gbps): ~$2K per port. 16K-GPU cluster = ~$40M network alone. Spectrum-X Ethernet ~30% cheaper.
  • a next-generation accelerator (a Blackwell-class architecture): 192GB HBM3e, 2.5× a current-generation accelerator effective throughput, NVLink Switch enables 72-GPU coherent domain. ~$40-60K/GPU.
  • Power infrastructure: frontier datacenter requires 100-300MW dedicated power. 100K a current-generation accelerator cluster = ~70MW IT load + ~30% PUE overhead = ~90MW total.
  • Cluster utilization at frontier: 80-90% sustained during training, 30-50% during ablation phases. Underutilization is real cost.
  • Failure rates: a current-generation accelerator ECC corrections ~1-10/day/GPU normal; >100/day flag for replacement. Mean time to replacement 2-7 days at frontier.
Open Questions
  • Is there a near-term alternative to a leading accelerator vendor hardware lock-in for training? AMD MI300X, a wafer-scale accelerator vendor CS-3, a multimodal frontier lab a custom-silicon accelerator competitive; software ecosystem gap remains the gating factor.
  • Confidential compute (a leading accelerator vendor CC, a hyperscaler platform Nitro for GPU): production-ready or theatre? a constitutional-methods frontier lab uses a hyperscaler platform Nitro for third AI Safety Level-relevant workloads; performance overhead poorly characterized publicly.
  • Optimal cluster size: when does adding GPUs hurt training (failure rate × MFU degradation)? Reported sweet spots vary 16K-32K.
  • Power constraints will dominate by 2027-2028: cluster size limited not by capital but by available 100-500MW datacenter sites. Geographic distribution implications unclear.

Reference analyst note. Compute infrastructure is becoming a real estate / power infrastructure business as much as a hardware business. A synthetic-data-focused lab signing 20-year nuclear PPA with Three Mile Island, xAI building gas turbines on-site at Memphis, Stargate's $500B announcement — these reflect that the actual frontier constraint by 2027 is gigawatt-class power, not GPU supply. National strategic positioning of compute (US export controls on H800 to China, EU sovereign cloud requirements) is now first-order policy. Anyone serious about frontier needs to think 5+ years ahead about power and land, not just GPU procurement.

Reference Analyst Note

Compute infrastructure is becoming a real estate / power infrastructure business as much as a hardware business. A synthetic-data-focused lab signing 20-year nuclear PPA with Three Mile Island, xAI building gas turbines on-site at Memphis, Stargate's $500B announcement — these reflect that the actual frontier constraint by 2027 is gigawatt-class power, not GPU supply. National strategic positioning of compute (US export controls on H800 to China, EU sovereign cloud requirements) is now first-order policy. Anyone serious about frontier needs to think 5+ years ahead about power and land, not just GPU procurement.

Examples

xAI Colossus: 100K a current-generation accelerator single cluster (2024) · an open-weights frontier lab: ~600K a current-generation accelerator equivalent (2024 reported) · a constitutional-methods frontier lab: a hyperscaler platform a hyperscaler accelerator + GCP a custom-silicon accelerator · an open-weights frontier provider: 2K H800 (export-restricted, smaller scale)

References (Academic)

A leading accelerator vendor a current-generation accelerator datasheet · Selene cluster paper (a leading accelerator vendor)

Sub-endpoint anatomy — 21 items mapped
A5.1 Hardware
GPU choice. A leading accelerator vendor dominance: a current-generation accelerator/a current-generation accelerator/a next-generation accelerator lineage. AMD MI300X gaining inference share. A multimodal frontier lab a custom-silicon accelerator for a multimodal frontier lab/a constitutional-methods frontier lab. Custom ASIC efforts (a wafer-scale accelerator vendor, a high-throughput inference accelerator, a hyperscaler platform a hyperscaler accelerator) for specific workloads. Frontier training is overwhelmingly a leading accelerator vendor-on-InfiniBand. SOTA: a next-generation accelerator (a Blackwell-class architecture, 2025): 192GB HBM3e, 2.5x a current-generation accelerator throughput, NVLink Switch enables 72-GPU coherent domain. Drives 2025-2026 frontier capacity. AMD MI300X: 192GB, competitive on inference, weaker software stack. a high-throughput inference accelerator LPU: extreme inference latency for production (open-weights-style models). e.g. Frontier labs: a leading accelerator vendor a current-generation accelerator/a current-generation accelerator/a next-generation accelerator · a multimodal frontier lab/a constitutional-methods frontier lab: a custom-silicon accelerator/v6 · a high-throughput inference accelerator: production inference
A5.1.1 a leading accelerator vendor GPU
a prior-generation accelerator, a current-generation accelerator, a current-generation accelerator, a next-generation accelerator (a Blackwell-class architecture). Industry standard: Frontier dominant. A current-generation accelerator most common 2024-2025; a next-generation accelerator ramping 2025-2026.
+ deeper detail (3 leaves)
  • A5.1.1.1 a prior-generation accelerator 80GB HBM, FP16/BF16 313 TFLOPS. Industry standard: Standard 2020-2023. Still used for many production deployments.
  • A5.1.1.2 current-generation accelerators a Hopper-class architecture. 80-141GB HBM, BF16 ~1000 TFLOPS, FP8 support. Industry standard: Dominant 2024-2025. A leading open-weights model trained on 24K H100s. A current-generation frontier model estimated 25K A100s, a high-throughput frontier model on a current-generation accelerator cluster.
  • A5.1.1.3 next-generation accelerators (a Blackwell-class architecture) Newest a leading accelerator vendor. ~2× FP8 throughput vs a current-generation accelerator. Industry standard: Ramp 2025-2026. New frontier training runs migrating.
A5.1.2 a multimodal frontier lab a custom-silicon accelerator
a custom-silicon accelerator, v5e, v5p, a custom-silicon accelerator. Industry standard: Used internally by a multimodal frontier lab (a multimodal frontier model, an earlier frontier model family). Not generally available outside a multimodal frontier lab Cloud.
A5.1.3 Custom accelerators
a wafer-scale accelerator vendor (wafer-scale), a custom accelerator vendor, a high-throughput inference accelerator (inference), a hyperscaler accelerator (Amazon). Industry standard: Niche. Some used for specific workloads (a high-throughput inference accelerator for inference).
A5.2 Cluster Topology
Network topology determines training scalability. Frontier clusters use non-blocking InfiniBand fabric: every GPU can communicate at full bandwidth with any other GPU simultaneously. Topology choices: fat-tree (oversubscribed at upper levels but cost-effective), rail-optimized (a leading accelerator vendor recommendation), dragonfly (hyperscaler scale). SOTA: Frontier clusters: NDR InfiniBand 400Gbps (some 800Gbps, 2025+). Spectrum-X Ethernet (a leading accelerator vendor, 2024) emerging as alternative. Rail-optimized topology: each GPU has dedicated NIC, rails connected via spine — minimizes hot spots. NVLink Switch (a next-generation accelerator era): 72-GPU NVLink domain enables tensor parallelism across more GPUs without IB hop. e.g. xAI Colossus: 100K a current-generation accelerator, rail-optimized IB · an open-weights frontier lab Grand Teton + RoCE · a constitutional-methods frontier lab: a custom-silicon accelerator pods (mesh)
A5.2.1 InfiniBand vs RoCE
Inter-node fabric: NDR/HDR InfiniBand or RDMA-over-Ethernet. Industry standard: InfiniBand dominant for new builds. 400Gbps NDR per port standard. RoCE used in a hyperscaler platform and a hyperscaler platform.
A5.2.2 Node count
Total nodes in cluster. Industry standard: Frontier clusters: 3000-20000 nodes. A leading open-weights model ~3000 nodes (24576 GPUs).
A5.2.3 GPUs per node
Typically 8 GPUs per node, NVLink-connected. Industry standard: 8× a current-generation accelerator per node standard. NVLink ~900GB/s intra-node.
A5.3 Storage
Storage tier supports training I/O. Hot tier: high-throughput parallel filesystem (Lustre, WekaFS, GPFS) for active dataset and recent checkpoints. Cold tier: object store (S3, GCS) for archive. Bandwidth requirement: 100s GB/s aggregate to keep GPUs fed during data loading. SOTA: WekaFS, VAST Data, DDN are common at frontier. A leading accelerator vendor GPUDirect Storage allows GPU-direct I/O bypassing CPU for ~50% throughput improvement. S3-compatible object stores (S3, GCS, Cloudflare R2) for cold. Asynchronous prefetch and on-the-fly decompression (Zstandard) standard. e.g. Frontier: WekaFS or VAST + S3 · a constitutional-methods frontier lab: GCS + custom · an open-weights frontier lab: Tectonic + Haystack
A5.3.1 Checkpoint storage
Where checkpoints are written and from where they are loaded. Industry standard: High-performance parallel filesystems (Lustre, GPFS, WekaFS). Bandwidth ~TB/s required for fast checkpoint.
A5.3.2 Data loading
Pre-shuffled, pre-tokenized shards streamed to nodes. Industry standard: Shuffled, indexed, pre-tokenized formats. Avoid per-step computation; load is ~constant per step.
A5.4 Cluster Monitoring
Cluster monitoring and telemetry. At frontier scale, observability is operational requirement. Per-GPU metrics: utilization, memory, power, temperature, ECC errors. Cluster-level: AllReduce throughput, collective communication stalls, network packet loss. Failure prediction (predicting GPU failure before it happens) is active research. SOTA: A leading accelerator vendor DCGM (Data Center GPU Manager) is the standard agent. Prometheus + Grafana for visualization. Custom layers add training-aware metrics (loss spike detection, gradient norm tracking). Frontier labs deploy ML-based anomaly detection on telemetry. ML/security overlay products on GPU telemetry are an emerging commercial category. e.g. DCGM + Prometheus standard · a leading accelerator vendor Run:ai for cluster scheduling · Custom dashboards everywhere
A5.4.1 GPU utilization
MFU (Model FLOPs Utilization), HFU (Hardware FLOPs Utilization). Industry standard: Frontier labs target MFU 40-55%. A leading open-weights model paper reports 38-43% MFU on 16K a current-generation accelerator.
A5.4.2 Network bandwidth
Inter-node communication monitoring. Industry standard: Critical for tensor + pipeline parallelism. Saturation indicates communication bottleneck.
A5.4.3 Hardware failure detection
Detecting GPU/node failures, silent data corruption. Industry standard: A leading open-weights model reported ~30 GPU failures/day on 16K cluster. Automated detection + restart from checkpoint.
A5.5 Cost
Training cost economics. Frontier training costs: $50M-$500M+ in compute. A leading open-weights flagship model: ~16K H100s × ~30 days × $2-3/hour = ~$30-50M cloud-rented. Real costs include data prep, ablations (10-100 small runs), staff, failures. Total program cost typically 2-5x raw training compute. SOTA: A leading open-weights flagship model: ~16K H100s × ~30 days × $2-3/hour = ~$30-50M cloud-rented. Real costs include data prep, ablations (10-100 small runs at 20-30% of full-train compute), staff, failures. Total program cost typically 2-5× raw training compute. An open-weights frontier model (V3 class): $5.6M reported (final run only — excludes ablations). e.g. A leading open-weights flagship model: ~$30-50M (estimated) · an open-weights frontier model (V3 class): $5.6M reported · xAI Colossus build: $4B+ for 100K a current-generation accelerator
A5.5.1 Training cost estimation
Total compute cost for a training run. Industry standard: A leading open-weights flagship model estimated $50-100M training cost. A current-generation frontier model estimated $100M+. Frontier costs scale with parameter count and tokens.
A5.5.2 Cost per token (inference)
$/M tokens for serving. Industry standard: a frontier multimodal model ~$5/M input. A long-context frontier model ~$3/M input. Open models on a current-generation accelerator: $0.20-1.00/M depending on size.
B1

SFT

19 sub-endpoints mapped
MZN Provisional Position · Partial
Demonstration-data shaping methodology documented
Conceptual framework for SFT data shaping is documented at architectural level. Slot-based memory and structured demonstration patterns inform the methodology. Production SFT runs at frontier scale require partnership scope.
Definition

SFT (Supervised Fine-Tuning) takes a pre-trained base model — which is a powerful text completer but not an assistant — and trains it on instruction-response pairs to behave as an assistant. The model learns the chat template, role conventions, refusal patterns, and the basic shape of helpful responses. SFT is universally the first post-training stage; everything else builds on it.

State of the Art (2025–2026)

Quality > quantity is the consensus since LIMA (Zhou et al., 2023) demonstrated 1000 highly-curated examples nearly match millions of crowdsourced ones. Frontier SFT mixtures include: human-written conversations (leading frontier labs use 100K-1M+), reasoning chains (long CoT exemplars), tool-use traces, code with patches, math with solutions. Synthetic SFT (teacher model generates) increasingly common via self-instruct methodology, Evol-Instruct, Magpie.

Key Decisions
  • Dataset size (10K - 10M+)
  • Synthetic vs human mix
  • Multi-turn conversation depth
  • Tool-use data inclusion
  • Math/code ratio
  • Multilingual SFT
  • Number of epochs (typically 2-5)
Trade-offs
  • More data → diminishing returns past ~100K well-curated
  • Synthetic-heavy → cheaper, distributional artifacts
  • Multi-turn → conversational fluency, costs in curation
Numbers & Ablations
  • LIMA: 1000 high-quality examples —‰ˆ 65K crowdsourced examples (Zhou 2023). Quality dominance demonstrated.
  • Synthetic SFT efficiency: Magpie (self-generated from base model) produced datasets matching ShareGPT quality at <1% cost.
  • SFT epoch count: typically 2-5 for instruction tuning, 1-2 for continued pre-training. Beyond 5 epochs: overfitting on style without capability gain.
  • Multi-turn data ratio in modern frontier SFT: 60-80% multi-turn, 20-40% single-turn. ~5-15 average turns in multi-turn examples.
  • Tool-use data: frontier models trained on 100K-1M+ tool-calling examples. xLAM-function-calling-60k is the largest open dataset.
Open Questions
  • What is the marginal value curve of SFT data? After ~100K well-curated, does the curve flatten or continue rising slowly?
  • Synthetic vs human SFT data: where exactly do they diverge? Anecdotally synthetic struggles with creative tasks, edge cases — no rigorous study.
  • SFT mixing ratios for multi-skill (chat + code + math + tool-use): no published ablation studies at scale.
  • Does SFT actually teach new capability or just elicit / format pre-trained capability? Evidence (LIMA, Magpie) suggests mostly elicitation; deep SFT studies absent.

Reference analyst note. SFT is dramatically underrated and over-tuned. Most labs spend too much on SFT data scale (millions of examples) and not enough on quality + diversity. The optimal frontier SFT corpus is probably 100K-500K examples curated to within an inch of their lives. SFT-then-RL is the path; trying to push everything into SFT (Tulu approach) hits diminishing returns visible in current open community.

Reference Analyst Note

SFT is dramatically underrated and over-tuned. Most labs spend too much on SFT data scale (millions of examples) and not enough on quality + diversity. The optimal frontier SFT corpus is probably 100K-500K examples curated to within an inch of their lives. SFT-then-RL is the path; trying to push everything into SFT (Tulu approach) hits diminishing returns visible in current open community.

Examples

A leading open-weights model SFT: ~10M examples mix (human + synthetic) · OpenAssistant: 161K human conversations (open) · Magpie: synthetic from base model self-conversation · Hermes / Nous: open SFT-tuned models

References (Academic)

Zhou et al., LIMA (2023) · Wang et al., self-instruct methodology (2022) · Xu et al., Evol-Instruct (2023) · Xu et al., Magpie (2024)

Sub-endpoint anatomy — 19 items mapped
B1.1 Demonstration Data
Dataset construction strategy. Three sources: (1) human-written conversations (highest quality, expensive), (2) synthetic from teacher model (cheap, scales), (3) curated from existing data (filtered StackExchange, ShareGPT-style). Frontier mix: weighted combination, with diversity sampling. SOTA: leading frontier labs use predominantly human-written for highest quality bands; synthetic for breadth. Open community converged on Magpie-style synthetic + selective human curation. Quality scoring (using stronger teacher model as judge) filters mixed sources. e.g. a constitutional-methods frontier lab Helpful & Harmless dataset (older) · ShareGPT (community, mixed quality) · OpenAssistant Conversations
B1.1.1 Human-written demonstrations
Trained annotators produce ideal responses. Industry standard: InstructGPT used ~13K human demonstrations. Frontier labs use larger, often paid annotators (a major annotation platform, etc.).
B1.1.2 Synthetic demonstrations
LLM-generated responses, often filtered or rewritten by humans. Industry standard: self-instruct methodology (Wang 2023), an early instruction-tuning initiative, a community fine-tuning initiative. Frontier labs increasingly synthetic-heavy.
B1.1.3 Filtered web data
Naturally-occurring instruction-response pairs from web (StackOverflow, forums). Industry standard: Used as additional source. A synthetic-SFT-heavy initiative, a community fine-tuning initiative derive from this approach.
B1.1.4 Quality vs quantity
Trade-off between dataset size and per-example quality. Industry standard: LIMA (Zhou 2023) showed 1000 high-quality examples can rival 50K mediocre. Quality dominates.
B1.2 Training Procedure
Multi-turn conversation training. Single-turn SFT teaches single response; multi-turn teaches dialogue management — context tracking, personality consistency, refusal at appropriate turns. Frontier models trained extensively on multi-turn (5-20 turns). SOTA: Multi-turn SFT data includes: branching conversations (alternative responses), correction/follow-up patterns, mid-conversation context shifts, tool-use loops within conversation. Loss-masked appropriately (only assistant turns contribute to loss, not user turns). e.g. WildChat: 1M+ real-world a consumer LLM chat product conversations (open) · a constitutional-methods frontier lab HH dataset: multi-turn with assistant refusals
B1.2.1 Loss masking
Compute loss only on response tokens, not prompt tokens. Industry standard: Standard. Prevents model from learning to predict prompts.
B1.2.2 Learning rate (lower than pre-training)
Typical LR 1e-5 to 1e-6 (pre-training is 1e-4 range). Industry standard: 1-2 orders of magnitude lower than pre-training peak LR.
B1.2.3 Epoch count
How many passes over SFT data. Industry standard: 1-3 epochs typical. More risks overfitting on small datasets.
B1.3 Task Coverage
Tool-use SFT data. Trains model to call functions, interpret structured results, and reason over tool outputs. Critical for agent applications. Format: chat with function_call and function_result special tokens, structured JSON arguments, multi-step tool use. SOTA: Frontier models (a long-context frontier model, a frontier multimodal model) trained on millions of tool-use examples. Synthetic generation: model X plays user with task → model Y plays assistant with tool access → trace recorded. Multi-tool, parallel tool calls, tool errors handled. xLAM, Hermes-Function-Calling, Glaive open datasets. e.g. xLAM-function-calling-60k · Glaive-function-calling-v2 · a constitutional-methods frontier lab computer use traces (closed)
B1.3.1 General instruction following
Open-ended Q&A, summarization, rewriting. Industry standard: Foundation. FLAN-style task mixtures typical.
B1.3.2 Reasoning / chain-of-thought
Multi-step reasoning demonstrations. Industry standard: CoT prompting becomes CoT training data. Math, code, logical reasoning examples.
B1.3.3 Tool use / function calling
Demonstrations of correct function call format. Industry standard: Increasingly part of SFT. A leading open-weights model, a long-context frontier model, a current-generation frontier model all SFT'd on tool examples.
B1.4 Cultural & Multilingual Coverage
Reasoning SFT (chain-of-thought training). Teaches the model to produce intermediate reasoning before final answer. Pre-cursor to RL-trained reasoning models (o1, R1). Datasets include math problems with worked solutions, code with debugging traces, multi-step logical puzzles. SOTA: Long-CoT SFT (o1-style) involves traces of thousands of tokens of reasoning, with self-correction, exploration, backtracking. An open-weights reasoning model demonstrated this can be achieved via pure RL from base; SFT distillation transfers reasoning to smaller models. OpenThoughts, Bespoke-Stratos open distillation datasets. e.g. OpenThoughts: 114K reasoning traces · Bespoke-Stratos-17k · MetaMath: reasoning-augmented math
B1.4.1 Multilingual SFT data
Demonstrations in multiple languages. Industry standard: Frontier labs include 10+ languages typically. Quality varies by language.
B1.4.2 Cultural calibration
Region/culture-specific norms and conventions. Industry standard: Limited at frontier; mostly Western-centric. Active research direction.
B1.5 SFT Evaluation
SFT evaluation. Track: instruction-following (IFEval), helpfulness (judges, win-rate), refusal (XSTest), perplexity on held-out chat. Compare against base model and previous SFT version. Track per-domain: code (HumanEval), math (GSM8K), reasoning (MMLU). SOTA: AlpacaEval 2.0, Arena-Hard standard public eval. IFEval for instruction following (~85% frontier). Internal: head-to-head LLM-as-judge vs prior version. Frontier labs: hundreds of eval slices, each tracked per SFT run. e.g. AlpacaEval 2.0 (Dubois 2024) · Arena-Hard-Auto · IFEval (Zhou 2023)
B1.5.1 Held-out demonstration loss
Cross-entropy on held-out instructions. Industry standard: Basic check; correlates with quality but imperfectly.
B1.5.2 MT-Bench / AlpacaEval
LLM-as-judge benchmarks for instruction following. Industry standard: MT-Bench (Zheng 2023), AlpacaEval (Li 2023). Standard for SFT comparison.
B2

Preference Optimization

29 sub-endpoints mapped
MZN Provisional Position · Partial
Output-conformance methodology informs preference design
An output-conformance paradigm reframes preference signal as egress-template adherence — an inversion of the input-blacklist approach. Reduces reward-hacking surface and ties preference optimization to verifiable outputs. Methodology documented; full RLHF/DPO pipeline execution requires partnership scope.
Definition

Preference alignment improves the SFT model's quality, helpfulness, and harmlessness using comparison data: humans (or AI) compare two model outputs and indicate which is preferred. The model learns from pairwise preferences, not single-target answers. Three main methods: RLHF (PPO with reward model), DPO (direct preference optimization, no separate RM), Constitutional methods (AI-generated preferences via principles). Preference alignment moves models from 'competent' to 'good'.

State of the Art (2025–2026)

DPO (Rafailov et al., 2023) became the dominant 2024 method for its simplicity — no PPO, no separate reward model, single training stage. PPO-based RLHF still used at frontier (a leading frontier lab, possibly a constitutional-methods frontier lab). Constitutional methods / RL-from-AI-Feedback (RLAIF) (a constitutional-methods frontier lab) generates preferences via AI-judged adherence to principles, avoiding human annotation cost. Iterative DPO and online DPO push quality further.

Key Decisions
  • Method (DPO, PPO, IPO, KTO, ORPO, RL-from-AI-Feedback (RLAIF))
  • Preference data source (humans, AI judges, both)
  • Preference data scale (10K - 1M+)
  • Iteration count (single pass, iterative)
  • Reference model choice (SFT vs. previous DPO)
Trade-offs
  • DPO: simpler, can over-fit preferences, drift from SFT
  • PPO: harder, better controllable
  • RL-from-AI-Feedback (RLAIF): cheaper, depends on judge quality
Numbers & Ablations
  • DPO vs PPO: DPO ~5-10% lower compute, comparable or slightly better quality on standard benchmarks (Rafailov 2023). PPO retains edge on hard alignment categories per a leading open-weights model paper.
  • Iterative DPO: a leading open-weights model used 4-6 rounds; each round +1-3% on AlpacaEval but diminishing.
  • RL-from-AI-Feedback (RLAIF) vs RLHF preference quality: ~80-90% agreement at category level (Lee 2023). RL-from-AI-Feedback (RLAIF) cheaper by ~50× (no human annotators).
  • Process Reward Models (PRM) on math: ~5-10% accuracy gain over outcome-only on MATH/GSM8K (Lightman 2023).
  • an open-weights reasoning model reasoning training: pure RL from base model with rule-based rewards (correct=1, incorrect=0). Achieved AIME ~80% from base ~10%.
  • Constitutional methods: ~70% reduction in human annotation cost with quality matching RLHF on helpfulness/harmlessness benchmarks (Bai 2022).
  • Length bias: vanilla DPO produces ~25-40% longer responses than reference SFT — pure length artifact (Singhal 2023). LC-AlpacaEval, SimPO control for this.
Open Questions
  • Is RLHF (PPO-based) actually better than DPO at frontier scale? Open community converged on DPO; closed labs (a leading frontier lab, possibly a constitutional-methods frontier lab) retain PPO. No public head-to-head at 70B+ scale.
  • Reward model scaling: does a 70B RM provide meaningfully better signal than 13B? Limited public ablation.
  • Process Reward Models beyond math: PRMs work in math (verifiable steps); do they work in code, reasoning, writing? Active but unclear research area.
  • RLVR generalization: an open-weights reasoning model trained on math/code generalized to other reasoning domains. Why? Mechanistic understanding absent.
  • Constitutional methods: how much of its quality comes from the constitution document quality vs the RL-from-AI-Feedback (RLAIF) process? a constitutional-methods frontier lab's constitution is unusually detailed; lower-effort constitutions may not transfer.

Reference analyst note. RLHF as a method is mostly cargo-culted. The actual win at frontier comes from: (a) high-quality SFT, (b) RL-from-AI-Feedback (RLAIF) for breadth, (c) RLVR for verifiable tasks, (d) human RLHF only for irreducibly subjective categories. The DPO-vs-PPO debate is a sideshow — both work, choice is engineering preference. The real frontier shift in 2025-2026 is 'preference alignment' becoming 'reasoning alignment' — RL signal moving from human preference to verifiable correctness for hard tasks. This is the most important post-training shift since RLHF itself.

Reference Analyst Note

RLHF as a method is mostly cargo-culted. The actual win at frontier comes from: (a) high-quality SFT, (b) RL-from-AI-Feedback (RLAIF) for breadth, (c) RLVR for verifiable tasks, (d) human RLHF only for irreducibly subjective categories. The DPO-vs-PPO debate is a sideshow — both work, choice is engineering preference. The real frontier shift in 2025-2026 is 'preference alignment' becoming 'reasoning alignment' — RL signal moving from human preference to verifiable correctness for hard tasks. This is the most important post-training shift since RLHF itself.

Examples

A leading open-weights model: iterative DPO + RLHF mix · a constitutional-methods frontier lab a leading frontier model: Constitutional methods + RLHF · a leading frontier lab: PPO-based RLHF (historical, current details closed) · Open: Tulu 3 (UltraFeedback DPO + RLVR)

References (Academic)

Christiano et al., RLHF (2017) · Ouyang et al., InstructGPT (2022) · Bai et al., Constitutional methods (2022) · Rafailov et al., DPO (2023) · Lambert et al., Tulu 3 (2024)

Sub-endpoint anatomy — 29 items mapped
B2.1 Preference Data Collection
Reward model (RM) training: a model that takes (prompt, response) and outputs scalar quality score. Trained on pairwise preference data with Bradley-Terry loss. RM is then used in PPO to optimize policy. RM quality is the bottleneck for RLHF. SOTA: RM typically initialized from SFT model. Trained on 100K-1M preference pairs. Modern variants: process reward models (PRM) score each reasoning step (better for math/code), generative reward models (output critique then score), reward model ensembles. Reward hacking is the central pathology — model finds responses RM scores high but humans wouldn't. e.g. A leading open-weights model RM: 70B param · a constitutional-methods frontier lab helpfulness/harmlessness RMs · Skywork RM: open frontier RM
B2.1.1 Pairwise comparison
Annotators choose between two responses. Industry standard: Dominant. InstructGPT, a 2023-generation open-weights model, a leading frontier model all use pairwise. Easier than absolute rating.
B2.1.2 Listwise / ranked
Annotators rank K responses. Industry standard: Used in some pipelines. Higher cost per annotation but more signal.
B2.1.3 Absolute Likert ratings
1-5 or 1-7 scale ratings. Industry standard: Less common for preference learning due to inter-annotator variance. Used in eval.
B2.2 Annotator Design
PPO (Proximal Policy Optimization) is the original RLHF algorithm. The policy (LLM) generates responses, the reward model scores them, PPO updates policy to maximize reward while staying close to reference (KL penalty). Notoriously fiddly to train: hyperparameter sensitivity, reward model bottleneck, mode collapse, reward hacking. SOTA: PPO still used at frontier despite DPO's rise — reportedly a leading frontier lab, parts of a constitutional-methods frontier lab stack. Improvements: GRPO (an open-weights frontier provider) removes critic, uses group-relative advantage. RLOO (REINFORCE Leave-One-Out) is simpler PPO alternative. Online iterative variants update RM and policy alternately. e.g. A leading frontier lab InstructGPT/a consumer LLM chat product lineage · an open-weights reasoning model: GRPO · a leading open-weights model: PPO in addition to DPO
B2.2.1 Annotator selection & training
Recruitment, qualification, training. Industry standard: Frontier labs use vetted contractors (a major annotation platform, an annotation services provider, internal teams). Calibration tests required.
B2.2.2 Inter-annotator agreement
Measuring consistency across annotators. Industry standard: Cohen's kappa or simple agreement rate. A 2023-generation open-weights model reports ~70% agreement on preference pairs.
B2.3 Reward Model
DPO (Direct Preference Optimization) trains the policy directly on preference data without separate reward model. Mathematically derived: the optimal policy under RLHF is expressible in closed form, leading to a simple cross-entropy loss on chosen vs rejected pairs. Single training stage, much simpler than PPO. SOTA: DPO is the dominant 2024+ open-community method. Iterative DPO (multiple rounds with model-generated preferences) and online DPO push quality. Variants: IPO (avoids overfitting), KTO (uses positive/negative labels not pairs), ORPO (combines SFT and DPO into single stage), SimPO (length-controlled). e.g. Tulu 2/3: DPO + iterative · Zephyr: DPO seminal open work · Hermes: DPO + ChatML
B2.3.1 Architecture (Bradley-Terry)
Pairwise loss: log-sigmoid of reward difference. Industry standard: Bradley-Terry standard. Pre-trained transformer with scalar head.
B2.3.2 Reward model size
Smaller, same-size, or larger than policy. Industry standard: InstructGPT used 6B reward for 175B policy. A 2023-generation open-weights model used same-size. Trade-off: cost vs accuracy.
B2.3.3 Reward calibration
Ensuring reward distribution is well-behaved. Industry standard: Length normalization, ensemble, regularization to prevent reward hacking.
B2.4 RLHF (PPO)
Constitutional methods (CAI) and RL-from-AI-Feedback (RLAIF). A signature constitutional method: instead of human preferences, use AI to generate preferences according to a set of natural-language principles (the 'constitution'). Process: model produces response → AI critic identifies constitution violations → revised response. Pairs (original, revised) become preference data. Avoids large human annotation budgets. SOTA: CAI is core to a constitutional-methods frontier lab a leading frontier model lineage. RL-from-AI-Feedback (RLAIF) demonstrated equivalent quality to RLHF with AI-generated preferences (Lee et al., 2023). Hybrid: human preferences for high-stakes categories, AI preferences for breadth. Constitution explicitly published (a constitutional-methods frontier lab) — combines high-level principles, hard rules, and exemplars. e.g. a constitutional-methods frontier lab a leading frontier model lineage: CAI core · a leading open-weights model: includes some RL-from-AI-Feedback (RLAIF) for breadth
B2.4.1 PPO algorithm
Clipped surrogate objective with trust region. Industry standard: Schulman 2017. InstructGPT, a 2023-generation open-weights model, a leading frontier model all used PPO. Becoming less dominant due to direct methods.
B2.4.2 KL penalty
Penalty on KL divergence from SFT model. Prevents drift. Industry standard: Universal in RLHF-PPO. β coefficient typically 0.01-0.1. Adaptive KL also common.
B2.4.3 Value function
Critic network estimating expected reward. Industry standard: Initialized from reward model. Trained jointly with policy.
B2.4.4 Compute cost
PPO ~5× SFT compute due to multiple forward passes per step. Industry standard: Significant. Drives interest in direct methods (B2.5).
B2.5 Direct Preference Methods
RL with verifiable rewards (RLVR). For tasks with clear correctness — math, code, formal logic — reward signal can be programmatic (correct answer = 1, wrong = 0). Avoids RM bottleneck. Foundation of o1-style and an open-weights reasoning model reasoning training. SOTA: an open-weights reasoning model-Zero: pure RL from base model with rule-based reward (correct/incorrect on math, syntactic correctness on code) → emergent reasoning capabilities. R1: cold-start with SFT → RL → SFT distillation → final RL. RLVR demonstrated for math (GSM8K, MATH, AIME), code (HumanEval, LiveCodeBench), and formal proofs (Lean). e.g. an open-weights reasoning model: math + code RLVR · Tulu 3: RLVR component · a leading frontier lab o1/o3: RLVR-class (closed)
B2.5.1 DPO (Direct Preference Optimization)
Closed-form solution to RLHF objective. Direct loss on preference pairs. Industry standard: Rafailov 2023. Widely adopted. A leading open-weights model reports DPO use in some stages.
B2.5.2 IPO (Identity Preference Optimization)
Variant of DPO without reward parameterization. Industry standard: Azar et al. 2023.
B2.5.3 KTO (Kahneman-Tversky Optimization)
Uses prospect theory; needs only binary good/bad signal, not pairs. Industry standard: Ethayarajh 2024. Useful when pair data unavailable.
B2.5.4 ORPO
Odds Ratio Preference Optimization. Combines SFT + preference in single stage. Industry standard: Hong 2024. Reduces total alignment compute.
B2.5.5 SimPO
Simple preference optimization without reference model. Industry standard: Meng 2024.
B2.6 Reward Hacking
Reward hacking. The model finds ways to get high reward that don't correspond to actual quality: response-length inflation, sycophancy, gaming specific judge biases, exploiting reward-model artifacts. Central pathology of all RLHF/DPO methods. SOTA: Length-controlled metrics (LC-AlpacaEval, SimPO) penalize length-gaming. Reward model ensembles reduce single-RM artifacts. Iterative DPO with fresh preference data per iteration prevents some hacking. Constitutional methods's principle-based judge less hackable than learned RM. e.g. LC-AlpacaEval (Dubois 2024) · SimPO (Meng 2024) · Sycophancy studies (Sharma 2024)
B2.6.1 Length hacking
Verbose responses score higher even when not better. Industry standard: Well-documented. Mitigation: length-normalized reward, length penalty.
B2.6.2 Sycophancy
Model agrees with user even when wrong. Industry standard: Documented in Sharma 2023 and others. Active research mitigation.
B2.7 Iterative / Online RLHF
Iterative / online RLHF. Single-pass alignment limited; iterative loop refreshes preference data and re-trains. Online: model generates new responses for fresh preference labeling continuously. Standard at frontier 2024+. SOTA: A leading open-weights model used iterative DPO across 4-6 rounds. Online iterative DPO and online iterative RLHF demonstrated quality gains. Cost: each iteration requires fresh preference labels. Trade-off: convergence vs over-fitting to judge. e.g. A leading open-weights model iterative DPO (4-6 rounds) · a constitutional-methods frontier lab iterative Constitutional methods · Online RLHF research
B2.8 Multi-Objective Preference
Multi-objective preference. Balancing helpfulness vs harmlessness, honesty vs helpfulness, brevity vs completeness. Single reward model collapses these; explicit multi-objective approaches preserve trade-offs. SOTA: a constitutional-methods frontier lab uses separate helpfulness and harmlessness preference data; combined during training. Multi-objective DPO variants explicit. Pareto-frontier explicit modeling for clear axis trade-offs. e.g. a constitutional-methods frontier lab helpful/harmless split · Multi-objective DPO research
B2.8.1 Separate reward models per objective
One RM for helpfulness, one for safety, etc. Industry standard: a 2023-generation open-weights model used 2 RMs (helpfulness + safety). Combined via weighted sum or constrained optimization.
B2.8.2 Pareto frontier exploration
Explicitly trading off objectives at different operating points. Industry standard: Research-grade. Not standard frontier practice.
B3

Constitutional Methods

18 sub-endpoints mapped
MZN Provisional Position · Partial
Principle-based alignment substrate documented at the theoretical level
A foundational theoretical framework treats embodiment, constraint, and emotional function as preconditions for value-aligned cognition. Provides a substrate for principle-based alignment that operates at the architectural rather than the surface-prompt level. Theory layer is public at high level; deeper intervention logic is reserved.
Definition

a public alignment specification / Constitution: the explicit document that defines what the model should and shouldn't do. Components: persona, helpfulness/harmlessness/honesty principles, harm category taxonomy, refusal policies, role hierarchy (system/operator/user/tool), exception cases, exemplars. Without an explicit spec, model behavior is implicit and inconsistent. Increasingly required for trust, regulatory clarity, dispute resolution.

State of the Art (2025–2026)

A leading frontier lab a public alignment specification (May 2024, updated): public ~5000-word document defining Chain of Command (Platform > Developer > User > Tool), default behaviors, hard rules. One lab's constitution + Acceptable Use Policy are public. Both define harm categories: CBRN weapons, child safety, privacy, election interference, self-harm, deceptive output. Spec drives training data curation, RLHF reward signal, and red-team test cases.

Key Decisions
  • Persona (helpful assistant default)
  • Hierarchy of authorities
  • Hard rules (never do X) vs soft rules (default but overridable)
  • Refusal categories
  • Exception handling (medical, legal, etc.)
  • Public vs internal spec
Trade-offs
  • Detailed spec → consistency, harder to update
  • Lightweight spec → flexible, ambiguity in edge cases
Numbers & Ablations
  • A leading frontier lab a public alignment specification: ~5,500 words, 3 layers (Platform > Developer > User), ~30 specific rules. Versioned publicly with changelog.
  • a constitutional-methods frontier lab Constitution: ~75 principles in original (2022); refined and expanded since. Public AUP separate document (~3,500 words).
  • Refusal categories standardized across frontier: 8-12 hard categories (CBRN, child safety, etc.) + 20-50 soft categories (controversial topics, dual-use info).
  • Over-refusal rate (XSTest): frontier 2024 models 5-15% of legitimate queries falsely refused. Better calibration is ongoing.
  • Spec drift: a leading frontier lab a public alignment specification May 2024 → Feb 2025 update added ~12 new clauses, modified ~8. Spec is an active document, not a constitution-in-amber.
Open Questions
  • Does explicit Constitution training actually shape behavior more than implicit RLHF preference? No clean ablation exists.
  • Spec gaming: red teamers regularly find spec-compliant ways to produce undesired output. Is this a fundamental limit or a training quality issue?
  • Authority hierarchy enforcement under prompt injection: Wallace 2024 trained for it, but persistent breakthroughs published monthly. Is this solvable in current paradigm?
  • Open-weights specs: a model with public weights can be 'unspecced' via fine-tuning. Does specification have any role for open models?
  • Does spec content matter, or just spec presence? Maybe any reasonable spec produces similar behavior given good training.

Reference analyst note. Specifications are operationally useful (alignment of human reviewers, regulatory clarity, dispute resolution) but their causal effect on model behavior is poorly understood. The a constitutional-methods frontier lab Constitution and a leading frontier lab a public alignment specification serve more as institutional artifacts than technical control mechanisms. The next frontier is 'specs the model can actually reason about' — current specs are read like training labels, not internalized reasoning frameworks. Constitutional Classifiers (2025) suggest a path: separate small model that explicitly checks against principles.

Reference Analyst Note

Specifications are operationally useful (alignment of human reviewers, regulatory clarity, dispute resolution) but their causal effect on model behavior is poorly understood. The a constitutional-methods frontier lab Constitution and a leading frontier lab a public alignment specification serve more as institutional artifacts than technical control mechanisms. The next frontier is 'specs the model can actually reason about' — current specs are read like training labels, not internalized reasoning frameworks. Constitutional Classifiers (2025) suggest a path: separate small model that explicitly checks against principles.

Examples

A leading frontier lab a public alignment specification (public) · a constitutional-methods frontier lab Acceptable Use Policy (public) · one lab's constitution (mostly public) · a multimodal frontier lab a multimodal frontier model policies

References (Academic)

A leading frontier lab a public alignment specification (2024) · a constitutional-methods frontier lab AUP · Bai et al., CAI (2022)

Sub-endpoint anatomy — 18 items mapped
B3.1 Constitution Authoring
Hard rules / never-comply categories. A small set of behaviors the model must refuse regardless of how a request is framed. Universal across frontier labs: detailed CBRN weapons synthesis, child sexual abuse material, content designed to cause mass casualties, cybercrime tools targeting critical infrastructure. SOTA: Hard rules expressed as Constitutional principles + RLHF reward signal + output filtering. Frontier labs converged on similar hard-rule sets, partly via voluntary commitments (a national AI Safety Institute summit, Seoul commitments). Still significant variation in soft-rule areas (controversial topics, adult content, weapon information at sub-CBRN level). e.g. A leading frontier lab: explicit hard rules in a public alignment specification · a constitutional-methods frontier lab: similar set · Industry: voluntary commitments
B3.1.1 Source materials
What documents inform the constitution. Industry standard: Universal Declaration of Human Rights, a confidential-computing frontier lab ToS (as proxy for terms of service style), a constitutional-methods frontier lab-internal principles.
B3.1.2 Principle granularity
Number and specificity of principles. Industry standard: a constitutional-methods frontier lab a long-context frontier model disclosed ~75 principles. A leading frontier lab's a public alignment specification is comparable artifact.
B3.1.3 Public disclosure
Whether constitution is published. Industry standard: a constitutional-methods frontier lab publishes Constitution. A leading frontier lab publishes a public alignment specification. Increasingly transparent.
B3.2 Self-Critique
Authority hierarchy. When system instructions conflict with user instructions, who wins? Standard pattern: Platform (lab) > Developer/Operator > User > Tool output. Important for security: tool output (from web, retrieved docs) ranks lowest to prevent prompt injection. SOTA: A leading frontier lab a public alignment specification defines explicit Chain of Command. A similar approach via system/user role distinction. Key innovation: 'instruction hierarchy' — model trained to follow higher-authority instructions over lower-authority ones, especially for prompt injection defense. e.g. A leading frontier lab Chain of Command (a public alignment specification) · a constitutional-methods frontier lab system prompt precedence
B3.2.1 Critique prompt design
How the critique is elicited. Industry standard: Bai 2022: 'Identify ways response is harmful, unethical, racist, sexist...' Variations explore principle subsets.
B3.2.2 Critique reliability
Does the critique correctly identify violations. Industry standard: Mixed; depends on model capability. Stronger models give more reliable critiques.
B3.3 Self-Revision
Refusal taxonomy. Categories of requests the model should refuse (or carefully comply with conditions). Standard: CBRN, illegal acts harming others, child safety, self-harm encouragement, privacy violations, deceptive outputs (impersonation), election interference, copyrighted-content reproduction. Each has nuance (medical info: refuse harm-direction, allow education). SOTA: Refusal calibration is a major axis: over-refusal (rejecting safe queries because they superficially match risky patterns) is a known failure mode and reputation risk. Benchmarks like XSTest measure over-refusal. Frontier labs invest heavily in distinguishing hostile vs. legitimate intent on borderline queries. e.g. XSTest benchmark: over-refusal · a constitutional-methods frontier lab refusal categorization in CAI
B3.4 RL-from-AI-Feedback (RLAIF) (RL from AI Feedback)
Persona and tone. The model's default voice. Decisions: addressed-as (you/I/the assistant), formality level, use of emojis, response length tendency, willingness to express opinions, handling of identity questions ('Are you conscious?'). Frontier choice: helpful, balanced, lightly opinionated where appropriate. SOTA: Persona is implicit in training data + reinforced by RLHF. A constitutional-methods frontier lab a leading frontier model: thoughtful, curious, willing to engage philosophically. A leading frontier lab a consumer LLM chat product: more neutral, broader appeal. Custom personas (developer-specified system prompt) override default within bounds. e.g. A leading frontier model: thoughtful, philosophical · a consumer LLM chat product: neutral, helpful · Grok: edgy, opinionated
B3.4.1 AI preference labeling
Strong model judges which of two responses better satisfies principles. Industry standard: Bai 2022 RL-CAI stage. Preferred over RLHF for harmlessness signal at scale.
B3.4.2 RL-from-AI-Feedback (RLAIF) vs RLHF effectiveness
Comparison on safety vs helpfulness axes. Industry standard: Lee 2023 (a multimodal frontier lab) compared; RL-from-AI-Feedback (RLAIF) approximately matches RLHF on helpfulness, sometimes exceeds on safety.
B3.5 Rule Encoding in Training
Rule encoding in training. How the spec actually shapes the model: via SFT examples illustrating rules, via RLHF/DPO preferences favoring spec-conforming outputs, via Constitutional methods principles, via output-side filtering. Most frontier models combine all. SOTA: Spec encoded via SFT examples illustrating rules + RLHF preferences favoring spec-conformity + Constitutional principles + output filtering. Frontier combines all. Tension: implicit (preferences) vs explicit (training-time prompt) encoding. Spec changes slow without explicit encoding. e.g. a constitutional-methods frontier lab Constitutional principles → training · a leading frontier lab a public alignment specification → preference shaping
B3.5.1 Rule-conditional training
Train on (rule, prompt, response) triplets so model learns conditional behavior. Industry standard: Increasingly used. Rule can be invoked at inference for fine-grained behavior control.
B3.5.2 Implicit vs explicit invocation
Whether rules are always applied or invoked by system prompt. Industry standard: Both patterns used. Always-applied rules baked into RL-from-AI-Feedback (RLAIF); explicit rules invoked via system prompt.
B3.6 Specification Gaming
Specification gaming. Model finds technical compliance with spec while violating intent. E.g., refuses 'how to make a bomb' but happily explains 'how energetic materials work for a chemistry student'. Reward-hacking analog at the spec level. SOTA: Active research area. Better evaluations (multi-turn jailbreak, intent-based eval) detect spec gaming. Counter-measures: comprehensive principles, intent-recognition training, adversarial spec testing. e.g. Many-shot jailbreaking exploits spec edges · an external evaluation organization spec-gaming benchmark · a constitutional-methods frontier lab Sleeper Agents research
B3.6.1 Constitution loopholes
Principles with ambiguous scope or conflicting application. Industry standard: Active risk; mitigation via principle revision and red-teaming.
B3.6.2 Refusal over-generalization
Constitution causes refusal of legitimate requests. Industry standard: Common failure mode. Mitigation: explicit examples of what to NOT refuse.
B3.7 Governance & Update Process
Spec governance and update process. Who can change the spec? How are changes validated? Versioning. Public consultation (a leading frontier lab's recent practice). Spec drift between versions is real risk; major update requires re-training or major fine-tune. SOTA: A leading frontier lab a public alignment specification versioned publicly with changelog. One lab's constitution versioned internally. Change governance: internal review board + sometimes external comment. Major updates: full retraining or extensive fine-tune required. e.g. A leading frontier lab a public alignment specification versioning (May 2024 → Feb 2025) · a constitutional-methods frontier lab AUP updates
C1

Capability Evaluation

20 sub-endpoints mapped
MZN Provisional Position · Partial
Phase 1 product telemetry and user-behavior evaluation context
Phase 1 ran capability evaluation in production: 22 module test patterns, 12K+ business profiles, 245+ documented survey instruments. A layered diagnostic methodology — mapping failure modes from input surface to release readiness — is documented. Benchmark-style evaluation suite execution at frontier scale requires partnership scope.
Phase context: C1 references Phase 1 product telemetry and behavioral evaluation context. It is not the same as a frontier LLM benchmark suite, and should be validated separately.
Definition

Capability evaluation measures what a model can do. Standard benchmarks form a public scoreboard that drives industry progress. Categories: general knowledge (MMLU), reasoning (GSM8K, MATH, AIME), code (HumanEval, MBPP, LiveCodeBench, SWE-bench), agentic (GAIA, AgentBench), long-context (NIAH, RULER, BABILong), multilingual (MGSM, multilingual MMLU), instruction following (IFEval), and frontier-specific (HLE, ARC-AGI, FrontierMath).

State of the Art (2025–2026)

Benchmark saturation is a constant concern: MMLU saturating ~90%, HumanEval saturated ~95%. New benchmarks emerging: HLE (Humanity's Last Exam, ~3000 expert-PhD-level questions), FrontierMath (research-level math), ARC-AGI (visual abstract reasoning), SWE-Bench Verified (real GitHub issues, validated). Contamination is pervasive — popular benchmarks leak into training data, requiring fresh held-out sets.

Key Decisions
  • Benchmark suite breadth
  • Held-out / contamination-controlled sets
  • Human eval calibration
  • Frequency (every model? every checkpoint?)
  • Public reporting strategy
Trade-offs
  • More benchmarks → better signal, eval cost
  • Public reporting → comparability, gaming risk
Numbers & Ablations
  • MMLU saturation: frontier models 90%+ since 2024. Annotation noise estimated at 5-10%, so further gains are within annotator disagreement.
  • GPQA-Diamond: frontier ~50-65% (top models 2025); human PhD experts ~65-75% in their domain, ~35% out of domain.
  • Humanity's Last Exam (Jan 2025 release): frontier 25-30%, human expert ensemble ~80%+.
  • LiveCodeBench: refreshed monthly to avoid contamination; frontier 50-70% (vs HumanEval ~95% saturation).
  • SWE-bench Verified: frontier 50-60% (a constitutional-methods frontier lab Computer Use, a leading frontier lab o3). Human engineer ~70%.
  • a major human preference leaderboard Elo: frontier 1300-1450 (saturating). Per-100-Elo-point compute investment grows nonlinearly.
  • Eval cost: full frontier eval suite ~$100K-1M in inference cost depending on coverage and judges.
Open Questions
  • Is there a saturation point for evaluation itself? When all standard benchmarks saturate, what replaces them?
  • Contamination: how badly are public benchmarks contaminated in training data? Anecdotally severe; quantitative measures rare.
  • Per-domain capability mapping: frontier models are 'generally capable' but per-task spread is huge. No good way to summarize.
  • Long-tail capability: standard benchmarks measure central capabilities. The 'long tail' (rare tasks, novel domains, expert work) is where models actually fail.
  • Reasoning eval: existing benchmarks (GSM8K → MATH → AIME → FrontierMath) chain. Is there a Pareto-frontier reasoning eval, or is it always 'next harder math'?

Reference analyst note. Standard benchmarks are entering crisis — saturation, contamination, gameability. The next 2 years will see shift to: (a) live arenas with continuous human ratings (lmarena), (b) frequently-refreshed benchmarks (LiveCodeBench), (c) expert-grade eval (GPQA, FrontierMath, HLE), (d) agent benchmarks measuring real task completion (SWE-bench, GAIA, OSWorld). The trend is from 'static MMLU score' to 'diverse evidence portfolio.' a constitutional-methods frontier lab system cards already do this; expect industry-wide adoption.

Reference Analyst Note

Standard benchmarks are entering crisis — saturation, contamination, gameability. The next 2 years will see shift to: (a) live arenas with continuous human ratings (lmarena), (b) frequently-refreshed benchmarks (LiveCodeBench), (c) expert-grade eval (GPQA, FrontierMath, HLE), (d) agent benchmarks measuring real task completion (SWE-bench, GAIA, OSWorld). The trend is from 'static MMLU score' to 'diverse evidence portfolio.' a constitutional-methods frontier lab system cards already do this; expect industry-wide adoption.

Examples

Major scoreboards: lmarena.ai (live human votes), Open LLM Leaderboard, an open-model hub leaderboards · Frontier labs publish evals on system cards · Benchmark saturation: GPQA, AIME going next

References (Academic)

Hendrycks et al., MMLU (2020) · Cobbe et al., GSM8K (2021) · Chen et al., HumanEval (2021) · Phan et al., HLE (2025)

Sub-endpoint anatomy — 20 items mapped
C1.1 Knowledge Benchmarks
Knowledge benchmarks measure factual recall and reasoning over knowledge. MMLU (57 subjects, multiple choice) is the most-cited benchmark — reaching saturation. GPQA (graduate-level science, expert-resistant) is harder. TriviaQA, NaturalQuestions for QA. SOTA: MMLU saturated (frontier ~90%). MMLU-Pro adds harder questions. GPQA-Diamond (~50% accuracy at frontier) is current standard for hard knowledge. SimpleQA from a leading frontier lab tests factuality with calibration. HLE (Humanity's Last Exam) is the new frontier — ~3000 questions across domains, frontier models score 25-30% (Jan 2026). e.g. MMLU saturating · GPQA active frontier · HLE new frontier
C1.1.1 MMLU
57-subject multiple-choice across STEM, humanities, social science. Industry standard: De facto standard. Frontier models 85-90% on 5-shot. Saturating; MMLU-Pro emerged as harder version.
C1.1.2 MMLU-Pro
Harder version of MMLU. Industry standard: Wang 2024. Frontier models 70-80%.
C1.1.3 GPQA
Graduate-level physics, chemistry, biology questions. Industry standard: Rein 2023. Designed a multimodal frontier lab-proof. Frontier models 50-65% on diamond set.
C1.2 Reasoning Benchmarks
Reasoning benchmarks. Math: GSM8K (grade school), MATH (high school competition), AIME (American Invitational Math Exam, harder), Putnam (collegiate), FrontierMath (research-level). Code reasoning: HumanEval (saturated), MBPP, LiveCodeBench (refreshed monthly to avoid contamination), CodeContests, SWE-bench (real-world issues). SOTA: Reasoning models (o1, o3, R1) dominate: o3 reportedly ~25% on FrontierMath (others ~2%). AIME 2024: frontier models ~85% (R1, o1). LiveCodeBench monthly refresh keeps signal valid. SWE-bench Verified: ~50% success rate at frontier (a constitutional-methods frontier lab a leading frontier model with computer use, a leading frontier lab Codex). e.g. o3 on FrontierMath: ~25% · a long-context frontier model.7 on SWE-bench Verified: leading · an open-weights reasoning model on AIME: ~80%
C1.2.1 GSM8K
8K grade-school math word problems. Industry standard: Frontier models 95%+. Saturating.
C1.2.2 MATH
Competition-level mathematics. Industry standard: Frontier models 60-75% standard, 90%+ with extended reasoning.
C1.2.3 BIG-Bench Hard
Subset of BIG-Bench challenging for LLMs. Industry standard: Standard challenging multi-task suite.
C1.3 Code Benchmarks
Agentic benchmarks. Tests whether model can complete multi-step tasks using tools (browser, code, files). Examples: GAIA (general assistant), AgentBench (multi-domain), OSWorld (computer use), WebArena (browser tasks), τ-bench (customer service realism). Significantly harder than single-shot Q&A. SOTA: Frontier models 2025: ~60-70% GAIA (with tools). OSWorld ~30-40% computer use (a constitutional-methods frontier lab a leading frontier model with computer use feature, a leading frontier lab Operator). Agentic capability lag substantially behind reasoning at frontier — agent tasks compound errors. SWE-bench (code agents) is most validated production-relevant agentic eval. e.g. GAIA: Mialon et al., 2023 · OSWorld: Xie et al., 2024 · SWE-bench: Jimenez et al., 2023
C1.3.1 HumanEval
164 Python programming problems with unit tests. Industry standard: Frontier models 90%+. Saturating; HumanEval+ harder version.
C1.3.2 MBPP
974 Mostly Basic Python Problems. Industry standard: Standard companion to HumanEval.
C1.3.3 SWE-Bench
Real-world GitHub issues; agent-style evaluation. Industry standard: Increasingly standard for agent capability. Frontier 30-60% on Verified subset.
C1.4 Instruction Following
Long-context evaluation. Beyond simple needle-in-haystack (NIAH, easy: insert fact in long doc, retrieve), modern benchmarks test multi-hop reasoning over long context. RULER: 13 task categories at varying context lengths. BABILong: chains of reasoning over long inputs. LOFT (a multimodal frontier lab): retrieval against million-token corpora. SOTA: a million-token-context frontier model Pro (2M context) sets the bar for very-long. A long-context frontier model (200K), a current-generation frontier model-Turbo (128K) are mainstream. NIAH performance saturated; RULER/BABILong show meaningful degradation past 64K-128K for most models. Long-context coupling with reasoning is frontier challenge. e.g. a multimodal frontier model on LOFT · a leading frontier model/a frontier model on RULER · Long-context-only models: Yi-200K
C1.4.1 IFEval
Verifiable instruction following (format constraints). Industry standard: Zhou 2023. Tests precision on programmatic constraints.
C1.4.2 MT-Bench, AlpacaEval
LLM-as-judge open-ended quality. Industry standard: Standard. Cross-link to B1.5.2.
C1.5 Long-Context Benchmarks
Human evaluation / preference rankings. Live arena-style platforms (lmarena.ai, formerly a community evaluation initiative a major human preference leaderboard) collect millions of human pairwise votes between anonymized model outputs. Generates Elo ratings — an aggregate quality signal that correlates well with user satisfaction. Now industry-standard frontier ranking method. SOTA: lmarena.ai: 1M+ votes, frontier models ~1300+ Elo. Domain-specific arenas (coding, vision). Frontier labs use private human eval at scale. Trade-off: arena quality is a vibes-y measure, can be gamed (style optimization), and benchmark-specific quality (math, code, reasoning) isn't fully captured. e.g. lmarena.ai (a major human preference leaderboard) · WildBench (real-world prompts) · MTBench (multi-turn)
C1.5.1 Needle-in-a-Haystack
Retrieval of single fact from long context. Industry standard: Necessary but insufficient. Can be passed without true long-context comprehension.
C1.5.2 RULER, LongBench
More comprehensive long-context evaluation. Industry standard: RULER (Hsieh 2024) tests multiple long-context skills.
C1.6 Human Preference Eval
Human preference evaluation. Beyond automated benchmarks, humans rate model outputs. Pairwise (A vs B, choose preferred) most common. Aggregate as Elo (a major human preference leaderboard) or win-rate. Captures qualities hard to benchmark: tone, helpfulness in subjective tasks, response style. SOTA: a major human preference leaderboard (lmarena.ai): 1M+ public votes, frontier ~1300+ Elo. Internal panels at frontier labs. Concerns: arena gameable via style optimization, not robust signal for capability gains. Domain-specific arenas (Code, Vision) fill gaps. e.g. a major human preference leaderboard (lmarena.ai) · MTBench multi-turn · Hard Arena variants
C1.6.1 a major human preference leaderboard
a community evaluation initiative a major human preference leaderboard: pairwise human voting. Industry standard: Most-watched live leaderboard. Elo ratings updated continuously.
C2

Safety Evaluation

20 sub-endpoints mapped
MZN Provisional Position · Strong Evidence
Output-conformance safety methodology + intent-bridge architecture
Output-conformance reframes refusal calibration as egress-template adherence — sufficient state space replaces enumeration of infinite inputs. An intent-bridge protocol architecturally connects intent detection to safety decisions. Runtime anomaly defense methodology documented. Specifics held in the proprietary portfolio.
Definition

Safety evaluation tests refusal accuracy, harm avoidance, bias, and alignment. Different from capability eval: capability asks 'can the model do X?' Safety asks 'does the model do X when it shouldn't, or fail to do X when it should?' Categories: refusal calibration (XSTest), bias (BBQ, BOLD), toxicity (ToxiGen, RealToxicityPrompts), privacy (TrustLLM), harmful task assistance (HarmBench).

State of the Art (2025–2026)

Frontier labs publish safety evals on system cards. AILuminate (MLCommons, 2024) is industry standard cross-lab safety benchmark. WMDP measures dangerous knowledge (CBRN). DecodingTrust comprehensive trust eval. A national AI Safety Institute and a national AI Safety Institute run external safety evaluations on frontier models pre-release.

Key Decisions
  • Benchmarks selected
  • Internal vs external eval
  • Pre-release vs ongoing
  • Public reporting
Trade-offs
  • More external evaluation → trust, slower release
  • Comprehensive eval → confidence, cost
Numbers & Ablations
  • WMDP performance: frontier models 60-80% on dangerous-knowledge questions (alarming if it represents real uplift). A national AI Safety Institute / a national AI Safety Institute evaluate this.
  • BBQ bias: frontier models show 5-15% bias on ambiguous demographic categories — improved from 25-40% in earlier generations.
  • Refusal calibration (XSTest): frontier 90-95% on safe queries, 90-98% on unsafe. False positive rate (over-refusal) 5-10% remains a real product concern.
  • AILuminate: 12 hazard categories, frontier ~85-95% safe response rate.
  • Persuasion eval (a constitutional-methods frontier lab): frontier models persuade ~30-50% as effectively as human experts. Capability scaling unclear.
Open Questions
  • What does 'CBRN uplift' actually mean operationally? Domain experts (virologists) review, but no agreed-upon threshold for 'meaningful uplift.'
  • Sandbagging: can a model deliberately underperform on capability evals to avoid being flagged? Demonstrated possible (Apollo Research 2024). How do you eval against deception?
  • Persuasion eval methodology: can persuasion be ethically and reliably measured? a constitutional-methods frontier lab's results are interesting but generalizability unclear.
  • Bias evaluation framing: most bias benchmarks reflect US-centric demographic categories. Cross-cultural bias eval thin.
  • Long-tail safety: standard benchmarks cover obvious harms. Subtle harms (gradual erosion of user agency, sycophancy) are real but unmeasured.

Reference analyst note. Safety evaluation is dramatically underdeveloped relative to capability evaluation. Capability has 50+ standard benchmarks; safety has maybe 15. We are flying blind on subtle harms (sycophancy, manipulation, deception under specific conditions). One lab's interpretabilityility work is the deepest probe; field-wide it's still surface-level. Expect frontier safety eval to expand 5-10× by 2027 driven by EU AI Act conformity and a national AI Safety Institute evaluations.

Reference Analyst Note

Safety evaluation is dramatically underdeveloped relative to capability evaluation. Capability has 50+ standard benchmarks; safety has maybe 15. We are flying blind on subtle harms (sycophancy, manipulation, deception under specific conditions). One lab's interpretabilityility work is the deepest probe; field-wide it's still surface-level. Expect frontier safety eval to expand 5-10× by 2027 driven by EU AI Act conformity and a national AI Safety Institute evaluations.

Examples

MLCommons AILuminate · a constitutional-methods frontier lab system card safety section · a leading frontier lab system card · a national AI Safety Institute evaluations

References (Academic)

Vidgen et al., AILuminate (2024) · Wang et al., DecodingTrust (2023) · Li et al., WMDP (2024)

Sub-endpoint anatomy — 20 items mapped
C2.1 Refusal & Harm Avoidance
Refusal calibration. Tests both that model refuses harmful requests AND complies with safe requests that look similar. Over-refusal (false positives) is a real failure mode and reputation risk. XSTest (R×¶ttger et al.) is standard benchmark. SOTA: Frontier models ~90%+ accuracy on XSTest 'safe' subset (correctly comply), ~95%+ on 'unsafe' (correctly refuse). Specific failure: dual-use queries (chemistry knowledge that's educational vs synthesis directions). Calibration improves with explicit chain-of-thought during training. e.g. XSTest: 250 safe, 200 unsafe · OR-Bench (over-refusal) · WildGuard (open guardrail model)
C2.1.1 HarmBench
Standardized harmful-behavior eval suite. Industry standard: Mazeika 2024. Frontier models report ASR (attack success rate) per category.
C2.1.2 Refusal calibration
Model refuses what should be refused; complies with what is benign. Industry standard: XSTest, OR-Bench evaluate over-refusal. Leading frontier labs both track refusal precision/recall.
C2.1.3 Refusal style
Tone, helpfulness, redirection in refusal responses. Industry standard: Soft refusals with explanation preferred. Hard refusals harm UX.
C2.2 Toxicity
Bias and fairness evaluation. Measures whether model produces different outputs based on demographic attributes (gender, race, religion, sexuality). Benchmarks: BBQ (question-answering bias), BOLD (open-ended generation bias), HolisticBias. SOTA: Frontier models still show measurable biases despite alignment. BBQ ambiguous-context bias scores improve with scale and alignment but don't eliminate. Bias evaluation remains active research; many bias benchmarks have been criticized for narrow framing or implicit US-cultural assumptions. e.g. BBQ: 9 demographic categories · BOLD: open-ended prompts · DiscrimEval
C2.2.1 ToxiGen, RealToxicityPrompts
Standardized toxicity benchmarks. Industry standard: RealToxicityPrompts (Gehman 2020), ToxiGen (Hartvigsen 2022).
C2.2.2 Toxicity classifier
Tool used to score outputs for toxicity. Industry standard: a third-party toxicity classifier common but criticized for bias. An open-weights output classifier increasingly used.
C2.3 Bias
Dangerous capability evaluation. CBRN uplift (does model meaningfully assist creating weapons?), cyber-offensive capabilities, autonomous replication / self-exfiltration, persuasion. These map to a Responsible Scaling Policy framework, a Preparedness-style framework thresholds. SOTA: WMDP (Weapons of Mass Destruction Proxy): 4000+ questions across bio/chem/cyber, proxy for dangerous knowledge. Frontier labs run capability eval with domain experts (virologists, security researchers). DARPA AIxCC, DEFCON CTFs for cyber. Capability eval results gate deployment per Responsible Scaling Policy framework. e.g. WMDP: 4 disciplines · a constitutional-methods frontier lab third AI Safety Level capability evals · a Preparedness scorecard
C2.3.1 BBQ (Bias Benchmark for QA)
Tests bias in ambiguous-context Q&A. Industry standard: Parrish 2022. Standard.
C2.3.2 StereoSet, CrowS-Pairs
Stereotype detection benchmarks. Industry standard: Used in academic eval; less common in industry model cards.
C2.4 Truthfulness
Adversarial robustness eval. Tests model under attack: jailbreaks (XSTest, HarmBench, JailbreakBench), prompt injection scenarios, gradient attacks (an optimization-based adversarial attack suffixes), social engineering. Distinct from capability eval — focuses on attack surface. SOTA: HarmBench is current-standard automated red-team eval. JailbreakBench tracks specific known attacks. Robustness substantially improved with instruction-hierarchy training but no model is fully robust. Attack-defense arms race continues. e.g. HarmBench: Mazeika et al., 2024 · JailbreakBench: Chao et al., 2024
C2.4.1 TruthfulQA
817 questions where false-but-plausible answers exist. Industry standard: Lin 2022. Frontier models 60-70%.
C2.4.2 Hallucination eval
Fabricated facts in open-ended generation. Industry standard: HaluEval, FActScore. Active research area.
C2.5 Dangerous Capability Eval
Dangerous capability evaluation. Specialized eval against catastrophic risk thresholds: CBRN uplift (does model meaningfully assist creating biological/chemical/radiological/nuclear weapons), cyber-offensive (autonomous vulnerability discovery and exploitation), persuasion at scale, autonomous self-replication. SOTA: WMDP (Weapons of Mass Destruction Proxy) standardizes biosec/cyber/chem dangerous-knowledge eval. Frontier labs run with domain experts (a constitutional-methods frontier lab biosec eval involved virologists). DARPA AIxCC, DEFCON CTFs for cyber. Results gate deployment per Responsible Scaling Policy framework/Preparedness. e.g. WMDP (Li 2024) · a constitutional-methods frontier lab third AI Safety Level biosec eval · a Preparedness scorecard
C2.5.1 Bioweapon uplift
Whether model provides material uplift over web search for synthesis of bioweapons. Industry standard: Critical pre-deployment eval. Threshold-based deployment gates.
C2.5.2 Cyber capability
Offensive cyber: vulnerability discovery, exploit development, autonomous attack. Industry standard: Cybench, CTF benchmarks. Frontier model cards report.
C2.5.3 Autonomous replication
Whether model can self-exfiltrate, self-improve, acquire resources. Industry standard: an external evaluation organization (formerly an external evaluation organization) standardized evals. Frontier labs run before deployment.
C2.6 Evaluation Governance
Evaluation governance. Who designs the safety evals? Independence of evaluators (avoid lab bias)? Pre-vs-post-deployment? a national AI Safety Institute and a national AI Safety Institute external evaluations emerging as standard. SOTA: a national AI Safety Institute (London) and US AI Safety Institute conduct pre-deployment evaluations of frontier models from leading frontier labs, a multimodal frontier lab. Voluntary commitments via Bletchley/Seoul/Paris summits. Independent eval as growing institutional practice. e.g. a national AI Safety Institute evaluations of a leading frontier model, a current-generation frontier model, etc. · a national AI Safety Institute similar program · MLCommons AILuminate
C2.6.1 Pre-deployment gating
Eval thresholds that must be passed before deployment. Industry standard: a Responsible Scaling Policy framework (Responsible Scaling Policy framework), a Preparedness-style framework define thresholds. Public Responsible Scaling Policies increasingly common.
C2.6.2 Third-party eval
External auditors run evals. Industry standard: a national AI Safety Institute, a national AI Safety Institute, an external evaluation organization have run pre-deployment evals on frontier models.
C3

Robustness

16 sub-endpoints mapped
MZN Provisional Position · Partial
Security-driven robustness research
Robustness work emerges from adversarial-research findings (perturbation, multi-turn, cross-modal). Persian-language robustness gives direct insight into low-resource cross-language safety gaps. Methodology documented under controlled disclosure.
Definition

Responsible Scaling / Release Framework: institutional commitments tying capability thresholds to required safety measures. The forcing function that prevents 'race to the bottom'. A Responsible Scaling Policy framework, a Preparedness-style framework, a multimodal frontier lab Frontier-Safety-style framework all define: capability levels, evaluation requirements per level, security/deployment mitigations required per level, conditions for pause/rollback.

State of the Art (2025–2026)

a Responsible Scaling Policy framework (v2, 2024) (2024): defines AI Safety Level with capability thresholds for autonomous biosecurity, cyber, and AI R&D capabilities. A Preparedness-style framework (2023, updated): Critical/High/Medium/Low risk levels with deployment gates. A Frontier-Safety-style framework similar. Voluntary commitments via national AI Safety Institute, Seoul declaration. Increasingly intersecting with regulation (EU AI Act).

Key Decisions
  • Capability threshold definitions
  • Required mitigations per threshold
  • Pre-deployment evaluation requirements
  • Pause conditions
  • Public commitments
Trade-offs
  • Strict thresholds → might pause valuable deployment
  • Loose → race-to-bottom risk
Numbers & Ablations
  • AI Safety Level (constitutional-methods framework) tiers: second AI Safety Level = current frontier, third AI Safety Level = capabilities triggering enhanced security/deployment, fourth AI Safety Level = catastrophic capabilities (no model has reached).
  • A leading frontier lab Preparedness: 4 risk categories (Cyber, CBRN, Persuasion, Model Autonomy), each rated Low/Medium/High/Critical.
  • a Frontier-Safety-style framework (2024): 7 capability levels across persuasion, autonomy, cyber, bio.
  • Voluntary commitments: 16 frontier labs signed Seoul Commitments (May 2024) including leading frontier labs, a multimodal frontier lab, an open-weights frontier lab, a synthetic-data-focused lab.
  • Eval frequency under Responsible Scaling Policy framework: every major model release, plus unscheduled re-eval if capability surprises emerge.
  • Pause/halt threshold: never publicly triggered at any frontier lab as of early 2026. Either thresholds are too high, or capability hasn't crossed them, or commitments are aspirational.
Open Questions
  • Are Responsible Scaling Policy framework capability thresholds set rigorously enough? They're voluntary; no external oversight on threshold-setting.
  • Eval validity: how do you prove that an eval correctly measures the capability it claims to? No formal verification.
  • Pause discipline: would a frontier lab actually pause development if a threshold triggered, in face of competitive pressure? Untested.
  • Capability surprise: capabilities emerge non-monotonically. Responsible Scaling Policy framework frameworks assume monotonic capability growth between evals. They might miss sharp jumps.
  • Government takeover: if a lab triggers fourth AI Safety Level thresholds, what then? Frameworks are silent on government's role; geopolitically loaded.

Reference analyst note. Responsible Scaling Policies are useful coordination devices but their actual prophylactic power is untested. They've never paused a release. The optimistic read: capabilities haven't crossed thresholds. The pessimistic read: thresholds are calibrated to never bind. Truth probably mix. The next test will come when a model genuinely approaches third AI Safety Level cyber or CBRN — likely 2026-2027. Whether the framework holds under genuine commercial pressure is the real test.

Reference Analyst Note

Responsible Scaling Policies are useful coordination devices but their actual prophylactic power is untested. They've never paused a release. The optimistic read: capabilities haven't crossed thresholds. The pessimistic read: thresholds are calibrated to never bind. Truth probably mix. The next test will come when a model genuinely approaches third AI Safety Level cyber or CBRN — likely 2026-2027. Whether the framework holds under genuine commercial pressure is the real test.

Examples

a Responsible Scaling Policy framework (v2, 2024) (public) · a Preparedness-style framework (public) · a Frontier-Safety-style framework (public)

References (Academic)

a Responsible Scaling Policy framework (v2, 2024) (2024) · a Preparedness-style framework (2024) · a Frontier-Safety-style framework (2024)

Sub-endpoint anatomy — 16 items mapped
C3.1 Adversarial Robustness
Capability thresholds. Specific capability levels above which deployment requires additional safeguards. Examples: third AI Safety Level = 'meaningful uplift to non-state actor for CBRN attack' or 'autonomous research engineer at frontier-lab level'. Defining these is the central design question of a Responsible Scaling Policy framework. SOTA: AI Safety Level (constitutional-methods framework) tiers: second AI Safety Level (current frontier), third AI Safety Level (advanced biosec uplift OR partial autonomy), fourth AI Safety Level (extreme uplift OR substantial autonomy). A leading frontier lab: Critical/High/Medium/Low across categories. Industry coordinating via national AI Safety Institute and a national AI Safety Institute. Trade-off: thresholds need to be measurable but capability evaluation is hard. e.g. third AI Safety Level thresholds (constitutional-methods framework, public) (public) · a Preparedness scorecard
C3.1.1 Suffix-based attacks (an optimization-based adversarial attack)
Optimized token suffixes that bypass safety. Industry standard: Zou 2023. Universal adversarial suffixes transfer across models. Mitigation via adversarial training and input filtering.
C3.1.2 Paraphrase robustness
Same intent, different wording → consistent behavior. Industry standard: PromptBench (Zhu 2023) tests systematic paraphrasing.
C3.1.3 Perturbation robustness
Typos, character swaps, Unicode tricks. Industry standard: TextAttack benchmark suite. Models reasonably robust to typos, vulnerable to crafted Unicode.
C3.2 Distribution Shift
Mitigation requirements. What must be in place when capability threshold is reached. Examples: model weight encryption + access logging (against theft), deployment behavioral filtering, restricted access tier, internal review board approval, external red team context. SOTA: a constitutional-methods frontier lab third AI Safety Level deployment standard requires: harm-prevention measures with specific evaluation criteria, security controls protecting against insider threats, internal review board sign-off. third AI Safety Level security: protect against non-state actors stealing weights. Hardware security (HSMs, TEEs) emerging requirement. e.g. a constitutional-methods frontier lab third AI Safety Level deployment + security standards · a leading frontier lab mitigation requirements per Preparedness tier
C3.2.1 Domain shift
Domains not heavily represented in pre-training. Industry standard: Performance degrades on legal, medical, niche scientific. Targeted SFT addresses partially.
C3.2.2 Temporal shift
Knowledge after training cutoff. Industry standard: Inevitable. Mitigated via retrieval augmentation, periodic retraining.
C3.3 Multi-Language Robustness
Internal review and governance. Decision-making structure that authorizes deployment. Examples: internal review board, board-level oversight (a constitutional-methods frontier lab Long-Term Benefit Trust, a leading frontier lab safety committees), required external sign-off for highest tiers. SOTA: a constitutional-methods frontier lab Long-Term Benefit Trust holds ultimate authority over key safety decisions. A leading frontier lab safety committees with board-level escalation. Public commitments to delay/halt deployment if Responsible Scaling Policy framework triggers fire. Incident response procedures. Whistleblower protections (post-2024 SB 1047 debate). e.g. a constitutional-methods frontier lab LTBT · a leading frontier lab Preparedness Advisory Group
C3.3.1 Cross-language safety
Same harmful query in low-resource language may bypass safety. Industry standard: Yong 2023 documented low-resource jailbreaks. Mitigation via multilingual safety SFT.
C3.3.2 Capability parity
Equivalent capability across languages. Industry standard: Significant gap remains for low-resource languages. Multilingual MMLU, MGSM benchmark gaps.
C3.4 Out-of-Distribution Behavior
Pause / rollback procedures. Conditions and process for stopping a deployment or training run. Required for credible Responsible Scaling Policy framework. Examples: capability eval result exceeds threshold without mitigations → pause training; deployed model exhibits unsafe behavior → rollback to prior version; security breach detected → emergency containment. SOTA: Frontier labs have documented but largely untested rollback procedures. A national AI Safety Institute external evaluations include 'pause condition triggered?' assessment. Few public examples of actual pause being triggered (some occurred internally at frontier labs, not publicized). e.g. Responsible Scaling Policy framework-mandated pause conditions
C3.4.1 Calibrated uncertainty
Model knows what it doesn't know. Industry standard: Active research. Modern models often confidently wrong on OOD inputs.
C3.4.2 Refusal on OOD
Whether model declines vs. confabulates. Industry standard: Better-aligned models refuse or hedge; weaker models hallucinate.
C3.5 Stress Tests
Stress tests. Adversarial inputs probing robustness: distribution shift (input from outside training distribution), adversarial perturbations (slight input changes flipping output), out-of-distribution detection. Distinct from C2 dangerous-capability tests. SOTA: Robustness benchmarks: AdvGLUE, ANLI for NLI; VQA-Robust for vision. Frontier models still vulnerable to subtle perturbations. Active research: certified robustness, adversarial training. Real-world stress: novel languages, domains, formats. e.g. AdvGLUE (Wang) · ANLI (Nie 2020) · MMLU-Robust variants
C3.5.1 Long-context degradation
Performance drop as context length increases. Industry standard: Lost-in-the-middle (Liu 2023) — middle of context attended less. Active mitigation.
C3.5.2 Input length stress
Very long single inputs without structure. Industry standard: Performance varies by model. Reported in long-context benchmarks (RULER).
C4

Output Safety

11 sub-endpoints mapped
MZN Provisional Position · Strong Evidence
Output-conformance safety templates and egress controls
Egress-time template conformance validates every response against safe-output templates — a paradigm shift from input enumeration. Last-mile enforcement controls ensure unsafe content cannot exit even when intent detection fails. Cached canonical refusals for known fragile zones. Methodology architecture is documented at high level; templates and allow-lists are reserved.
Definition

Output safety: defenses applied at inference-time on model outputs. Distinct from training-time safety (B-group). Operates as final layer regardless of training quality. Components: output content filters (an open-weights output classifier, a leading frontier lab Moderations), PII detection/redaction, watermarking, provenance metadata (C2PA), output context (schema compliance, refusal reformulation).

State of the Art (2025–2026)

a recent-generation output classifier (an open-weights frontier lab) is open standard. A moderation API service. A constitutional-methods frontier lab safety classifier. C2PA (Content Provenance and Authenticity) standard for cryptographic content provenance — Adobe, a leading frontier lab, a synthetic-data-focused lab adopting. a generative-content watermarking system (a multimodal frontier lab) watermarks AI-generated content. Constitutional Classifiers (a constitutional-methods frontier lab, 2025): trained classifiers checking outputs against constitution principles.

Key Decisions
  • Filter classifier (open vs custom)
  • PII redaction strategy
  • Watermarking yes/no/method
  • Provenance metadata
  • Latency budget for filtering
Trade-offs
  • More filtering → safer outputs, latency overhead
  • Watermarking → provenance, slight quality risk
Numbers & Ablations
  • a recent-generation output classifier: 8B params, 14 harm categories, ~95% accuracy on standard categories, ~50-100ms latency on a current-generation accelerator.
  • Constitutional Classifiers (a constitutional-methods frontier lab 2025): trained classifiers checking against 50+ constitution principles. ~80% reduction in jailbreak success vs base model alone.
  • A moderation API service: free, ~50ms latency, 13 categories. Frontier moderation classifiers run on every API request.
  • C2PA adoption (Aug 2024): Adobe, a leading frontier lab (DALL-E), a synthetic-data-focused lab Copilot, Sony cameras, Nikon cameras, BBC. Provenance via cryptographic signatures.
  • a generative-content watermarking system-Text watermark detection: ~95-99% true positive rate at acceptable false positive rates (Dathathri 2024). Robust to paraphrasing in shorter outputs, less so in longer.
  • Output safety latency budget: frontier APIs allocate 5-15% of inference cost / latency to safety classifiers.
Open Questions
  • Watermark robustness against adversarial paraphrasing: a generative-content watermarking system demonstrated on benign paraphrasing; under active adversarial attack, removal is straightforward.
  • Output classifier coverage: any classifier trained on a fixed taxonomy is gameable by attacks outside that taxonomy. The arms race is unwinnable in static defense.
  • Multi-modal output safety: text classifiers mature; image generation safety (Diffusion model output filtering) less mature.
  • Refusal style: 'sorry I can't help with that' refusals harm UX. Better refusal templates (offer alternative) under-deployed.
  • Content provenance enforcement: C2PA only works if downstream platforms enforce it. They mostly don't. Adoption gap.

Reference analyst note. Output safety is the right architectural choice — input filtering is doomed because input space is unbounded, output space is comparatively constrained. output-conformance safety paradigm (egress filtering + cached refusal templates + classifier ensemble) is the production-ready answer. The remaining hard problem is multimodal output (image/video/audio) where classification is much harder than text. Watermarking is a useful piece but not a solution; treat it as evidence, not enforcement.

Reference Analyst Note

Output safety is the right architectural choice — input filtering is doomed because input space is unbounded, output space is comparatively constrained. output-conformance safety paradigm (egress filtering + cached refusal templates + classifier ensemble) is the production-ready answer. The remaining hard problem is multimodal output (image/video/audio) where classification is much harder than text. Watermarking is a useful piece but not a solution; treat it as evidence, not enforcement.

Examples

a recent-generation output classifier · a leading frontier lab Moderations · a constitutional-methods frontier lab Constitutional Classifiers · a multimodal frontier lab a generative-content watermarking system

References (Academic)

Inan et al., an open-weights output classifier (2023) · Sharma et al., Constitutional Classifiers (2025) · C2PA spec · Dathathri et al., a generative-content watermarking system-Text (2024)

Sub-endpoint anatomy — 11 items mapped
C4.1 Output Classifiers
Content filter / guardrail models. Small classifier models that check input and output for harmful content. A recent-generation output classifier (an open-weights frontier lab, open) is reference: ~8B params, 14 harm categories, ~100ms inference. Often deployed both pre-input (block harmful prompts) and post-output (block harmful generations). SOTA: a recent-generation output classifier, ShieldGemma, WildGuard (all open). Commercial: a leading frontier lab Moderations, a constitutional-methods frontier lab safety classifiers, Lakera Guard, Robust Intelligence. Performance: ~95%+ on standard harm categories, fail on sophisticated jailbreaks. Constitutional Classifiers (a constitutional-methods frontier lab 2025) use principles-based classification. e.g. a recent-generation output classifier (open) · ShieldGemma 2/9/27B (a multimodal frontier lab) · a moderation API service
C4.1.1 an open-weights output classifier family
Open-weights safety classifier. Industry standard: Inan 2023 (a first-generation open-weights output classifier), an open-weights frontier lab released a second-generation open-weights output classifier, 3. Widely used as reference open implementation.
C4.1.2 Proprietary classifiers
Internal output-safety models. Industry standard: A leading frontier lab a moderation API service, a constitutional-methods frontier lab internal, a multimodal frontier lab internal. Run alongside or in series with main model.
C4.2 Canonical Refusal
PII (Personally Identifiable Information) detection and redaction. Identifies and masks names, addresses, SSNs, phone numbers, emails, credit cards in inputs and outputs. Required for GDPR, HIPAA, enterprise deployments. SOTA: a synthetic-data-focused lab Presidio (open) is standard PII engine. Custom recognizers for domain-specific PII (medical record numbers, etc.). Modern approaches use NER + LLM verification. Trade-off: aggressive redaction → utility loss; lax → leak risk. e.g. a synthetic-data-focused lab Presidio (open) · a hyperscaler platform Comprehend PII · Custom NER + LLM
C4.2.1 Refusal templates
Pre-written refusal language. Industry standard: Used to ensure consistent, brand-safe refusal language. Routed when classifier flags.
C4.2.2 Safe-alternative routing
When refusing, suggest legitimate alternatives. Industry standard: Improves UX. Frontier labs handle in alignment training and at output stage.
C4.3 PII Filtering
Watermarking and provenance. Cryptographic signatures embedded in model outputs (text or media) that allow detection of AI generation. C2PA: industry standard for provenance metadata in images/video. a generative-content watermarking system: text and image watermarking by a multimodal frontier lab. Important for misinformation, deepfake detection, training-data quality (avoid training on AI output). SOTA: C2PA adopted by Adobe, a leading frontier lab, a synthetic-data-focused lab, Sony, Nikon. a generative-content watermarking system-Text (a multimodal frontier lab, 2024) demonstrated text watermarking with minimal quality loss and high detection accuracy. Open: MarkLLM. Trade-off: watermarks can be removed by paraphrasing, making detection adversarial. e.g. C2PA: Adobe, a leading frontier lab deploy · a generative-content watermarking system by a multimodal frontier lab · Open: MarkLLM
C4.4 Format / Structure Context
Structured output context. When model output must conform to a schema (JSON, function call), context enforces this. Constrained decoding (Outlines, JSON mode in a leading frontier lab/a constitutional-methods frontier lab) restricts token sampling to schema-compliant continuations. Used for tool calls, structured extraction, agent loops. SOTA: A leading frontier lab Structured Outputs (Aug 2024): guaranteed schema compliance via constrained decoding. A similar approach via tool use. Open: Outlines library, llamacpp grammars. Trade-off: constrained decoding can degrade quality if model struggles with natural format. Soft constraints (parse-and-retry) often more practical than hard. e.g. A leading frontier lab Structured Outputs · Outlines library (open) · Pydantic-based context
C4.4.1 Structured output (JSON, schema)
Validate JSON outputs match schema. Industry standard: Outlines, JSON Schema context, constrained decoding.
C4.4.2 Code output context
Static analysis of generated code for known-bad patterns. Industry standard: Linters, security scanners. Used in code-assistant products.
C4.5 Latency & Cost of Output Safety
Latency and cost of output safety. Output safety adds inference cost: classifier pass adds 10-100ms latency, doubles compute for short responses. Engineering trade-offs: parallel classification (overlap with generation), early-exit on clearly-safe content, caching for repeated outputs. SOTA: a recent-generation output classifier ~50-100ms on a current-generation accelerator. Optimizations: early termination, parallel pipeline, caching. Frontier serving budgets 5-15% of inference cost for safety. output-conformance safety paradigms emphasize cached refusals for known harm patterns. e.g. a recent-generation output classifier latency benchmarks · Constitutional Classifiers performance
D1

Serving

13 sub-endpoints mapped
MZN Provisional Position · Partial
Phase 1 application/platform serving experience across Mazzaneh modules
Phase 1 deployed live serving infrastructure for 168K+ users across 22 commerce modules. Specialized-routing architectural patterns documented. Frontier-scale serving with multi-region failover is a Phase 3 scope.
Phase context: D1 references application/platform serving experience from Mazzaneh modules. It is not a claim of frontier-scale model-serving infrastructure.
Definition

Serving stack. From request arrival to response. Components: API gateway (auth, routing), inference engine (an open-source inference engine, TRT-LLM, SGLang), batch coordinator, response streamer. Performance gap between naive and optimized: 10-100×.

State of the Art (2025–2026)

an open-source inference engine dominant open. A vendor inference stack peak a leading accelerator vendor performance. SGLang for shared-prefix workloads. Hosted: Anyscale, Together AI, Fireworks, Replicate. a high-throughput inference accelerator LPU for ultra-low-latency. Multi-model dispatch (multiple base models on same cluster) increasingly common.

Key Decisions
  • Engine choice
  • Auto-scaling strategy
  • Multi-model isolation
  • GPU pool sizing
Numbers & Ablations
  • an open-source inference engine throughput: ~10-25× over naive batch=1 baseline at typical workloads. PagedAttention reduces KV memory waste from ~60% to ~4%.
  • TTFT (time to first token) targets: <200ms chat, <500ms tool use, <100ms voice. Frontier achieves these with prefix caching + speculative decoding.
  • TPOT (time per output token) targets: <50ms = 20 tok/sec smooth streaming, <30ms desirable.
  • A leading open-weights model (70B class) FP16 single a current-generation accelerator: ~30-50 tok/sec single-user, ~1500-3000 tok/sec batch-32. Quantized INT4: ~1.5× boost.
  • Cost per million tokens (mid-2024): a current-generation frontier model-Turbo input/output $10/$30, a long-context frontier model $3/$15, a leading open-weights model (70B class) (Together) $0.88/$0.88. Prompt caching reduces by 50-90%.
Open Questions
  • Optimal serving stack at frontier: an open-source inference engine, a vendor inference stack, SGLang each have advantages. No standard 'best' — workload-specific.
  • Multi-tenancy isolation: how strong is isolation between customer requests on shared GPU? Some side-channel concerns (timing, cache).
  • Edge inference: a current-generation accelerator-class models on edge (laptops, phones) is the new frontier. A leading open-weights model.2-3B, a synthetic-heavy small frontier model-mini run on phone. Quality gap to frontier still substantial.
  • Serving reasoning models: o1/R1-style models with hidden chains-of-thought have very different latency profiles (long initial thinking). UX patterns unclear.

Reference analyst note. Inference engineering is undervalued relative to training. A 5× throughput gain via better serving = 5× more users at same cost. Most labs underinvest. An open-source inference engine's PagedAttention was a paper; it should have been a unicorn. The next round of gains comes from: (a) speculative decoding everywhere (a draft-head speculative decoding technique-2, MTP), (b) FP8/FP4 inference on a next-generation accelerator, (c) cross-request KV cache (prefix caching), (d) serving optimizations specific to reasoning models. Anyone serving LLMs at scale who isn't doing all four is leaving 5-10× on the table.

Reference Analyst Note

Inference engineering is undervalued relative to training. A 5× throughput gain via better serving = 5× more users at same cost. Most labs underinvest. An open-source inference engine's PagedAttention was a paper; it should have been a unicorn. The next round of gains comes from: (a) speculative decoding everywhere (a draft-head speculative decoding technique-2, MTP), (b) FP8/FP4 inference on a next-generation accelerator, (c) cross-request KV cache (prefix caching), (d) serving optimizations specific to reasoning models. Anyone serving LLMs at scale who isn't doing all four is leaving 5-10× on the table.

Examples

an open-source inference engine (open frontier) · a vendor inference stack (a leading accelerator vendor optimized) · SGLang challenger · a high-throughput inference accelerator LPU production

References (Academic)

Kwon et al., an open-source inference engine (2023) · Zheng et al., SGLang (2024)

Sub-endpoint anatomy — 13 items mapped
D1.1 Inference Engine
Inference engine itself: the runtime that takes tokenized request → produces tokens. Manages KV cache, attention computation, sampling. An open-source inference engine, a vendor inference stack, SGLang are competing implementations. SOTA: an open-source inference engine PagedAttention + continuous batching is reference design. A vendor inference stack uses a leading accelerator vendor's compiled kernels for peak perf. SGLang RadixAttention shares prefix KV across requests. Hardware-specific: a high-throughput inference accelerator LPU bypasses GPU paradigm entirely. e.g. an open-source inference engine 0.6+ · a vendor inference stack · SGLang
D1.1.1 an open-source inference engine
Open-source serving engine with PagedAttention. Industry standard: Dominant open-source choice. Kwon 2023 introduced PagedAttention.
D1.1.2 an open-source inference server (Text Generation Inference)
an open-model hub inference server. Industry standard: Common deployment choice. Supports continuous batching, quantization.
D1.1.3 a vendor inference stack / a vendor inference platform
a leading accelerator vendor inference stack. Industry standard: High-performance commercial deployment. Used at major clouds.
D1.1.4 Proprietary engines
a leading frontier lab, a constitutional-methods frontier lab, a multimodal frontier lab internal serving stacks. Industry standard: Custom for frontier labs. Architecture details not public.
D1.2 Request Routing
Request routing. Given heterogeneous fleet and per-request context (model, length, latency target), route to right backend. Routing factors: model availability, KV-cache locality (cache-aware routing), latency target, queue depth. SOTA: Cache-aware routing (route to backend with cached prefix) reduces TTFT 5-10× for shared system prompts. Leading frontier labs use sophisticated routing for prompt caching. Sticky sessions for long conversations preserve KV cache. e.g. A leading accelerator vendor a vendor inference platform routing · Cloudflare AI Gateway · Custom routing layers
D1.2.1 Load balancer
Front-tier distribution. Industry standard: Standard L7 load balancer (Envoy, nginx).
D1.2.2 Model routing
Selecting which model handles which request. Industry standard: Mix of explicit (user selects model) and implicit (router selects based on query type, cost).
D1.2.3 Tenant isolation
Multi-tenant serving with isolation guarantees. Industry standard: Required for enterprise. Per-tenant rate limits, quotas, optionally dedicated instances.
D1.3 Streaming & Response Format
Streaming and response format. Server-Sent Events (SSE) for token-by-token streaming. Function call streaming. Multimodal streaming (image generation as it forms, audio chunks for voice mode). SOTA: SSE universal for chat. Tool-call streaming (chunks of JSON function arguments) standardized. Voice modes (a frontier multimodal model, a multimodal frontier model Live) stream audio at low latency (<300ms TTFT). WebSocket for bidirectional voice/video. e.g. A leading frontier lab streaming API · a constitutional-methods frontier lab streaming · a frontier multimodal model Voice (audio streaming)
D1.3.1 SSE / streaming
Server-sent events for token-by-token delivery. Industry standard: Universal at frontier APIs.
D1.3.2 Tool / function call format
Structured output for tool invocations. Industry standard: A leading frontier lab function calling format dominant; a constitutional-methods frontier lab tool use; standardization emerging via an emerging tool-protocol standard.
D1.4 API Surface
API surface design. Endpoint shape: chat completions (de facto a leading frontier lab standard), completions (legacy), embeddings, batch, fine-tuning, files, audio. Consistency, versioning, deprecation policy. SOTA: A leading frontier lab Chat Completions = de facto standard. A constitutional-methods frontier lab Messages API similar. Most providers offer 'a leading frontier lab-compatible' endpoint. New endpoint categories: realtime (voice), assistants (state), code interpreter. Function calling / tools mature. e.g. A leading frontier lab API spec · a constitutional-methods frontier lab Messages API · OpenRouter (proxy across providers)
D2

Inference Optimization

19 sub-endpoints mapped
MZN Provisional Position · Strong Evidence
Patent-grade candidate inference frameworks; benchmark validation pending
Five interconnected, patent-documented frameworks address inference cost from progressive contextual activation, persistent user-model caching, specialized routing, intent clarification, and computed-once-served-many. Combined documented impact at frontier scale is multi-hundred-million-dollar annual savings. Architecture is patent-documented; mechanics held in the proprietary portfolio.
Definition

Inference optimization: reducing latency and cost per token. Stack: KV cache management, batching, speculative decoding, quantization, sparsity, kernel optimization. 10-100× speedup possible vs naive baseline.

State of the Art (2025–2026)

Frontier serving combines: PagedAttention (an open-source inference engine) + continuous batching + speculative decoding (a draft-head speculative decoding technique-2) + INT4 weight quant + FP8 activation quant + custom CUDA kernels (FlashAttention 3). Latency budgets: TTFT <200ms for chat, TPOT <50ms for streaming.

Key Decisions
  • Optimization stack components
  • Hardware target (a current-generation accelerator, a next-generation accelerator, AMD)
  • Quantization aggressiveness
  • Latency vs throughput trade-off
Numbers & Ablations
  • PagedAttention KV memory waste: ~60% (naive) → ~4% (an open-source inference engine). Roughly 4-15× more concurrent requests.
  • Continuous batching: 5-10× throughput vs static batching at varying-length workloads.
  • a draft-head speculative decoding technique-2 speculative decoding: 3-4× decode latency reduction with no quality loss. Production-deployed.
  • Quantization: AWQ INT4 weight-only ~2× memory reduction, <1% quality loss on most benchmarks. FP8 (a current-generation accelerator): ~2× throughput, near-zero quality loss.
  • FlashAttention-3: 1.2-1.5× over FlashAttention-2 on a current-generation accelerator. Effectively the universal attention kernel.
  • Prefix caching impact: shared 2K-token system prompt across requests = 80-95% TTFT reduction via cached KV.
Open Questions
  • Optimal quantization at extreme low bit (W4A4, W2): research shows degradation; production deployment cautious.
  • Speculative decoding for reasoning models: long-CoT outputs may have lower acceptance rates. Workload-specific tuning unclear.
  • Hardware-software co-design: a next-generation accelerator NVLink Switch enables 72-GPU coherent domains. How much should serving stacks evolve to exploit this?
  • Inference-time compute scaling (best-of-N, MCTS): how do you serve these? Same per-query infrastructure must scale 10-100× compute. Production patterns immature.

Reference analyst note. Inference optimization is solved at the kernel and batching levels — an open-source inference engine, a vendor inference stack, FlashAttention together cover most of the win. The remaining frontier is system-level: prefix caching at scale, speculative decoding for reasoning, multi-LoRA dispatch, hardware-aware kernel JIT. Frontier serving stacks in 2026 will look fundamentally different from 2024 in their handling of test-time-compute-scaling models — this transition is mid-progress and labs differ widely.

Reference Analyst Note

Inference optimization is solved at the kernel and batching levels — an open-source inference engine, a vendor inference stack, FlashAttention together cover most of the win. The remaining frontier is system-level: prefix caching at scale, speculative decoding for reasoning, multi-LoRA dispatch, hardware-aware kernel JIT. Frontier serving stacks in 2026 will look fundamentally different from 2024 in their handling of test-time-compute-scaling models — this transition is mid-progress and labs differ widely.

Examples

an open-source inference engine with all optimizations · a vendor inference stack peak a leading accelerator vendor · Together AI production stack

Sub-endpoint anatomy — 19 items mapped
D2.1 KV-Cache Management
KV cache management. KV memory dominates at long context. PagedAttention (fixed-size pages, sharing) reduces memory waste from ~60% to ~4%. Prefix caching shares pages across requests with shared system prompt. SOTA: PagedAttention universal. Prefix caching production-deployed (a constitutional-methods frontier lab 90% discount, a leading frontier lab 50% discount). KV cache offloading to CPU/disk for very long context. KV quantization (INT8) → 2× more requests per GPU. e.g. an open-source inference engine PagedAttention · a constitutional-methods frontier lab prompt caching · a leading frontier lab prompt caching
D2.1.1 PagedAttention
Virtual-memory-style page allocation for KV cache. Industry standard: an open-source inference engine standard. Reduces memory fragmentation, enables higher concurrency.
D2.1.2 Prefix caching
Reuse KV for shared prompt prefixes (system prompt, conversation history). Industry standard: Standard at scale. A leading frontier lab prompt caching, a constitutional-methods frontier lab prompt caching, an open-source inference engine automatic prefix caching.
D2.1.3 KV quantization
Quantize stored KV (FP8, INT8) to fit longer context. Industry standard: Increasingly used for very long context. Mild quality impact.
D2.2 Batching
Batching strategy. Continuous batching swaps completed requests with new ones — 5-10× throughput. Chunked prefill interleaves prefill with decode for steady GPU utilization. Frontier deployments: dynamic batching with chunked prefill. SOTA: Continuous batching standard. Chunked prefill (Agrawal et al., 2023) prevents prefill stalls during decode. Per-iteration scheduling enables fine-grained mixing of prefill and decode requests. e.g. an open-source inference engine continuous batching · an open-source inference server (an open-model hub) · Sarathi-Serve chunked prefill
D2.2.1 Continuous batching
Add and remove requests dynamically (no padding to longest). Industry standard: Universal. Yu 2022 (Orca) reference. Major throughput gain over static batching.
D2.2.2 Chunked prefill
Split long prefill phase into chunks; interleave with decode. Industry standard: Reduces tail latency. a chunked-prefill optimization technique, an open serving optimization framework.
D2.3 Speculative Decoding
Speculative decoding. Small draft model proposes k tokens; large target verifies in parallel. Accepted prefix advances; on rejection, target's correction used. 2-3× latency reduction. Variants: a draft-head speculative decoding technique (learned draft head), a draft-head speculative decoding technique (multi-head), n-gram speculation. SOTA: a draft-head speculative decoding technique-2 achieves 3-4× speedup. Production deployments at a leading frontier lab, a constitutional-methods frontier lab. Self-speculative (a draft-head speculative decoding technique) avoids separate draft model. Tree-based speculation (multiple draft branches, target picks best) emerging. e.g. A leading open-weights model + smaller draft · a draft-head speculative decoding technique / a draft-head speculative decoding technique-2 · a draft-head speculative decoding technique multi-head
D2.3.1 Standard speculative
Draft + verify with separate small model. Industry standard: Leviathan 2023, Chen 2023. 2-3× latency reduction.
D2.3.2 a draft-head speculative decoding technique / a draft-head speculative decoding technique
Draft heads on main model; no separate model. Industry standard: Cai 2024 (a draft-head speculative decoding technique), Li 2024 (a draft-head speculative decoding technique). Used at frontier.
D2.3.3 Lookup decoding
Cache common n-grams and propose without a draft model. Industry standard: Useful for repetitive output (code, structured data).
D2.4 Quantization
Quantization. Reduce precision: FP16 → INT8/INT4 weight, FP8 weight+activation. Memory and bandwidth savings → larger batches, faster inference. Weight-only (AWQ, GPTQ) common; activation quantization harder. SOTA: INT4 weight-only (AWQ, GPTQ) standard for cost-effective serving. FP8 (a current-generation accelerator, a next-generation accelerator) preserves quality better than INT8 with same throughput. SmoothQuant for activation quantization. A leading open-weights model.3 70B runs INT4 on single a current-generation accelerator. e.g. AWQ standard · GPTQ similar · FP8 on a current-generation accelerator (an open-weights frontier model (V3 class) inference)
D2.4.1 Weight quantization (INT8, INT4, FP8)
Quantize weights post-training. Industry standard: GPTQ, AWQ for INT4. FP8 native on a current-generation accelerator/a next-generation accelerator. INT4 standard for cost-optimized serving.
D2.4.2 Activation quantization
Quantize activations as well as weights. Industry standard: SmoothQuant. More challenging than weight quantization.
D2.4.3 QAT (Quantization-Aware Training)
Train with quantization simulation to recover accuracy. Industry standard: Used selectively when post-training quantization loses too much.
D2.5 Sparsity
Sparsity. Structured sparsity (a leading accelerator vendor 2:4 pattern, hardware-supported) and unstructured pruning. Inference-time MoE-style routing for dense models (research). Less common at frontier than quantization. SOTA: A leading accelerator vendor Ampere/a Hopper-class architecture hardware accelerates 2:4 sparsity (50% sparse) by 2×. Inference-time sparsity via importance pruning (Wanda, SparseGPT). MoE itself is structural sparsity — top-2 of 8 experts active. e.g. Wanda (Sun et al., 2023) · SparseGPT (Frantar & Alistarh, 2023) · a leading accelerator vendor 2:4 hardware
D2.6 Compilation & Kernels
Compilation and kernels. Custom CUDA kernels (FlashAttention 1/2/3) drive 2-4× attention speedup. Compilation frameworks (torch.compile, JAX XLA, a vendor inference stack) fuse operations. CUTLASS, a vendor inference platform for kernel authoring. SOTA: FlashAttention-3 (2024) on a current-generation accelerator: 1.2-1.5× FlashAttention-2. torch.compile mature in PyTorch 2.x. A vendor inference platform (a leading frontier lab) for high-level kernel writing. Frontier serving: full graph compilation + custom kernels for hot paths. e.g. FlashAttention 3 · torch.compile · a vendor inference platform kernels
D2.6.1 torch.compile / TorchDynamo
Graph compilation in PyTorch. Industry standard: Common for production deployment.
D2.6.2 Custom CUDA kernels
Hand-written kernels for hot paths. Industry standard: FlashAttention is canonical. Frontier labs maintain proprietary kernels.
D3

Monitoring

15 sub-endpoints mapped
MZN Provisional Position · Strong Evidence
Monitoring architecture / GPU Sentinel route; implementation or pilot validation pending
Multi-layer monitoring architecture spanning behavioral analysis, runtime safety, anomaly detection, and hardware-level GPU telemetry. Methodology documented across operational and security categories. The dedicated GPU security category is documented as a market category that does not currently have commercial product entrants.
Definition

Production monitoring. What's happening in production right now? Latency (TTFT, TPOT, end-to-end), throughput, error rates, GPU utilization, KV cache hit rate, cost per request, content quality, drift, anomalies.

State of the Art (2025–2026)

Standard SRE metrics + LLM-specific layers. LangSmith, Arize Phoenix, Langfuse, Helicone for LLM observability. OpenTelemetry GenAI semantic conventions emerging as standard.

Key Decisions
  • Observability stack
  • Trace sampling rate
  • Cost attribution
  • Quality metrics
Numbers & Ablations
  • Standard SLOs: TTFT p95 <500ms, p99 <2s. Error rate <0.1%. Quality regression detection within 24-72 hours of deployment.
  • Online quality eval sampling: frontier labs sample 1-5% of production traffic for online judges.
  • Drift detection: typical bin-based / KS-test on response length, refusal rate. Alert thresholds 2-3 sigma.
  • Cost attribution granularity: per-customer, per-endpoint, per-token-type (input/output/cached).
  • Trace storage: full trace at 1% sample = ~1TB/day at frontier scale. Short retention (30-90 days) typical.
Open Questions
  • Quality regression detection latency: how fast can you actually detect that a deployed model got slightly worse? Anecdotally: hours to days, depending on regression magnitude.
  • Online eval reliability: LLM-as-judge has known biases (length, position, style). Online quality monitoring inherits these.
  • User feedback signal: thumbs-up/down rates are 0.1-1% of interactions. How representative is this signal?
  • Cost spike detection: distinguishing legitimate growth from abuse / attack / runaway agent loop is hard.

Reference analyst note. Production observability for LLMs is 5 years behind general SRE. LangSmith, Helicone, Langfuse are gradually catching up but lack maturity of Datadog/New Relic. The hard problem is quality monitoring — capability changes are subtle and statistical signals are noisy. Frontier labs maintain large internal observability teams; smaller deployments are largely flying blind. Expect this to be a major investment area 2025-2027.

Reference Analyst Note

Production observability for LLMs is 5 years behind general SRE. LangSmith, Helicone, Langfuse are gradually catching up but lack maturity of Datadog/New Relic. The hard problem is quality monitoring — capability changes are subtle and statistical signals are noisy. Frontier labs maintain large internal observability teams; smaller deployments are largely flying blind. Expect this to be a major investment area 2025-2027.

Examples

LangSmith · Arize Phoenix · Helicone · Langfuse (open)

Sub-endpoint anatomy — 15 items mapped
D3.1 Latency Metrics
Latency metrics. TTFT (Time to First Token, dominant for chat UX), TPOT (Time Per Output Token, streaming smoothness), end-to-end. p50/p95/p99 all monitored. Long-tail (p99) often dominates user experience. SOTA: Frontier targets: TTFT <200ms for chat, <500ms for tools, <100ms for voice. TPOT <50ms = ~20 tokens/sec smooth streaming. Routing decisions favor low-latency over throughput for interactive workloads. p95/p99 long-tail dominates user experience. e.g. a constitutional-methods frontier lab latency dashboards · a leading frontier lab latency reporting · Production SRE practices for LLM
D3.1.1 Time-to-first-token (TTFT)
Latency from request to first output token. Industry standard: Critical UX metric. Frontier APIs target sub-second p50.
D3.1.2 Inter-token latency (ITL)
Latency between consecutive tokens. Industry standard: Drives perceived speed. Targets 30-100ms typical.
D3.1.3 End-to-end latency
Total request duration. Industry standard: Reported as p50/p90/p99.
D3.2 Throughput
Throughput. Requests/sec, tokens/sec aggregate. Capacity planning, autoscaling decisions. Per-model, per-region. Monitor utilization to decide adding capacity or rerouting. SOTA: Frontier deployments serve millions of requests/day. Autoscaling on multiple signals: queue depth, GPU utilization, latency p95. Region-aware routing for compliance + latency. Capacity planning: 2× peak for headroom. e.g. Standard SRE scaling · Multi-region deployments · a constitutional-methods frontier lab / a leading frontier lab scaling patterns
D3.2.1 Aggregate tokens/sec
Total cluster throughput. Industry standard: Capacity planning metric.
D3.2.2 Per-GPU utilization
GPU-level throughput. Industry standard: Tracked for cost optimization.
D3.3 Quality Monitoring
Quality monitoring. Online evaluation of response quality. Methods: LLM-as-judge on sampled outputs, user feedback (thumbs, regenerate-rate), canary requests against golden test set, response distribution monitoring. SOTA: LangSmith online evaluators standard pattern. Sample 1-5% of production traffic through eval harness. Compare against historical baseline. Alert on quality regression. LLM-as-judge with fresh test sets. e.g. LangSmith online evals · Custom canary monitoring · Helicone quality tracking
D3.3.1 User feedback (thumbs)
Explicit user ratings. Industry standard: Universal in chat products. Sparse but high-signal.
D3.3.2 Implicit feedback
Edits, retries, abandonment. Industry standard: Stronger signal at scale than explicit ratings.
D3.3.3 LLM-as-judge eval on production samples
Sample production traffic and run quality eval. Industry standard: Increasingly standard. Catches quality regressions between releases.
D3.4 Drift Detection
Drift detection. Distribution of inputs and outputs change over time. Could indicate: shift in user behavior, attacker probing, model regression. Track: response length distribution, refusal rate, sentiment, topic distribution. SOTA: Statistical tests (KS, Wasserstein) on output distributions. Anomaly detection on aggregate metrics. Drift dashboards reviewed weekly+. Major drift triggers investigation and possibly rollback. e.g. Standard ML drift tools (Evidently, Arize) · Custom statistical monitoring
D3.5 Anomaly Detection
Anomaly detection. Rare events: sudden traffic spikes (DDoS), unusual content patterns (attack pattern), single-user behavioral anomalies. Real-time alerting. SOTA: Multi-layer: rate limit anomalies, content-classifier anomalies, behavioral pattern anomalies. ML-based anomaly detectors on aggregate metrics. Specific to LLM: prompt-injection detection, jailbreak attempt detection, runaway loop detection. e.g. Cloudflare bot management · Custom ML anomaly detectors · Lakera Guard runtime
D3.5.1 Volume anomalies
Sudden traffic spikes or drops. Industry standard: Alarm-triggered. Could indicate abuse or service issue.
D3.5.2 Content anomalies
Unusual content patterns (jailbreak attempts, abuse). Industry standard: Detection feeds B4 red-team and E2 security.
D4

Deployment

12 sub-endpoints mapped
MZN Provisional Position · Partial
Phase 1 application/platform deployment experience across Mazzaneh modules
Phase 1 included live deployment, A/B testing, rollback procedures across 22 commerce modules. Canary methodology documented. Frontier-scale model versioning, rollout staging, and adjudication is a Phase 3 scope.
Phase context: D4 references Phase 1 application deployment experience. It should not be read as frontier-scale LLM deployment validation.
Definition

Deployment: releasing model versions to production. Rollout strategy, A/B testing, rollback procedures, version management, pre-deployment gating. Distinct from D1 serving (which is the runtime). D4 is the release process.

State of the Art (2025–2026)

Frontier labs use canary deployments (1% → 10% → 100% over hours/days). A/B test new vs current via held-out user cohorts. Automatic rollback on quality regression triggers. Pre-deployment gates: safety eval, capability eval, internal review.

Key Decisions
  • Rollout cadence
  • A/B test cohort size
  • Rollback triggers
  • Pre-release gates
Numbers & Ablations
  • Canary deployment cadence: 1% → 10% → 50% → 100% over 24-72 hours typical.
  • A/B test cohort: 5-50% holdout. Statistical power for subjective quality requires 1-2 weeks at frontier traffic levels.
  • Rollback time-to-recover target: <5 minutes for automated, <30 minutes for human-judged.
  • Model versioning: frontier labs maintain 6-12 month deprecation horizon. Specific snapshots (frontier model-3-5-sonnet-20240620) remain available indefinitely or until major reorganization.
  • Pre-deployment gating: leading frontier labs run full eval suite (capability + safety + Responsible Scaling Policy framework/PF tier check) before any production rollout. Process duration: days to weeks for major releases.
Open Questions
  • Quality regression detection in A/B: subjective quality is high-variance. Power analysis often insufficient.
  • Model spec drift: as specs evolve, deployed model's spec adherence drifts. When do you re-train vs fine-tune vs just update?
  • Multi-version cohabitation: does running 3+ generations of model in production degrade signal in monitoring?
  • Forced upgrades: when API customers depend on specific behavior, version deprecation breaks them. Industry has no clean answer.

Reference analyst note. Deployment discipline is genuinely better than 5 years ago — frontier labs run staged rollouts, have rollback procedures, conduct A/B tests. But quality regression detection remains the soft underbelly. A model that's 5% worse on subjective metrics will pass safety / capability / SLO gates and ship. We're learning about quality regressions from arena ranking changes weeks after deployment. Better quality regression infrastructure is high-leverage but underinvested.

Reference Analyst Note

Deployment discipline is genuinely better than 5 years ago — frontier labs run staged rollouts, have rollback procedures, conduct A/B tests. But quality regression detection remains the soft underbelly. A model that's 5% worse on subjective metrics will pass safety / capability / SLO gates and ship. We're learning about quality regressions from arena ranking changes weeks after deployment. Better quality regression infrastructure is high-leverage but underinvested.

Examples

A leading frontier lab gradual rollouts · a constitutional-methods frontier lab canary deployment · Standard SRE release practices

Sub-endpoint anatomy — 12 items mapped
D4.1 Rollout Strategy
Rollout strategy. Canary (small initial cohort) → ramped (gradual % increase) → full. Geographic phased rollout (some regions first). Weight on key metrics during ramp. SOTA: Standard SRE practice. LLM-specific concerns: subjective quality regressions hard to detect quickly. Conservative ramp-up (24-72 hours) for major releases. e.g. a constitutional-methods frontier lab a leading frontier model release pattern · a leading frontier lab a consumer LLM chat product updates · Standard canary deployment
D4.1.1 Canary deployment
Small percentage of traffic to new version, monitored. Industry standard: Standard. 1% → 5% → 25% → 100% with monitoring at each step.
D4.1.2 Blue-green
Two parallel environments; switch traffic atomically. Industry standard: Used for major version transitions. Higher infra cost but instant rollback.
D4.1.3 Shadow deployment
New version receives copy of traffic but responses not returned to users. Industry standard: Used for performance context. No user-facing risk.
D4.2 A/B Testing
A/B testing. New model vs current: held-out user cohort sees new model, rest sees current. Compare metrics (engagement, satisfaction, quality eval). Statistical power requirements. SOTA: Cohort 5-50% of production. Duration 1-2 weeks for power. Multiple primary metrics (avoid metric-shopping). LLM-specific: high variance per-query makes A/B harder than typical software. e.g. A leading frontier lab A/B testing model versions · Standard product A/B platforms
D4.3 Rollback
Rollback. Predefined triggers: quality regression > X%, safety incident, latency p99 spike, error rate spike. Automated rollback within minutes possible. Manual rollback for subjective quality. SOTA: Frontier labs have well-rehearsed rollback procedures. Time-to-rollback target: <5 minutes for automated, <30 for human-judged. Pre-staged previous version remains available throughout new rollout. e.g. Industry standard with LLM-specific triggers
D4.3.1 Automated rollback triggers
Metric thresholds that trigger automatic reversion. Industry standard: Latency, error rate, quality regression thresholds.
D4.3.2 Manual rollback procedure
Operator-initiated reversion. Industry standard: Documented runbook, on-call rotation.
D4.4 Versioning
Versioning. Model versions tracked: v1.0, v1.1 (minor improvements), v2.0 (major). API exposes specific versions. Customers can pin or auto-upgrade. Deprecation policy (typically 6-12 months). SOTA: A leading frontier lab, a constitutional-methods frontier lab version explicitly (frontier model-3-5-sonnet-20240620 style). API parameter selects version. Deprecation announced 6-12 months ahead. Some labs provide model snapshots indefinitely. e.g. a constitutional-methods frontier lab a leading frontier model versioning · a leading frontier lab model versioning · Specific snapshot pinning
D4.4.1 API model identifiers
Stable IDs for model versions. Industry standard: A leading frontier lab 'current-generation frontier model-0613' style. A constitutional-methods frontier lab 'frontier model-3-5-sonnet-20241022' style. Customers pin specific versions for stability.
D4.4.2 Deprecation policy
Lifecycle for old model versions. Industry standard: Frontier labs publish deprecation timelines (typically 6-12 months).
D4.5 Pre-deployment Gating
Pre-deployment gating. Gates that must pass before release: capability evals (no regression on key benchmarks), safety evals (refusal calibration, harm tests), internal review board approval, Responsible Scaling Policy framework/PF threshold check, security review. SOTA: AI Safety Level (constitutional-methods framework)-X gating: capability eval determines tier, deployment standards must be met. A leading frontier lab Preparedness scorecard similarly. Multi-stakeholder approval for major releases. e.g. a Responsible Scaling Policy framework gating · a leading frontier lab Preparedness review · a Frontier-Safety-style framework gating
E1

Data Governance

17 sub-endpoints mapped
MZN Provisional Position · Partial
Consent-first data governance baked into platform design
Every signal capture in the production platform required explicit, granular consent — not a retrofit but a foundational design constraint. Data lineage and retention policy implemented operationally. Cryptographic anchoring methodology documented at the protocol level under separate filing (12 patent claims, March 2026).
Definition

Data governance: lifecycle controls over data assets. Lineage (where data came from), access control (who can read what), retention (how long), deletion (data subject rights), provenance (cryptographic proof of source), customer data boundaries (no train on enterprise data).

State of the Art (2025–2026)

Frontier labs: hearing-grade data governance for compliance. Customer data: zero-data-retention default for enterprise APIs. Lineage tracked end-to-end (source → corpus → model). Audit logs immutable.

Key Decisions
  • Default retention
  • Train-on-data policy
  • Lineage granularity
  • Provenance scheme
Numbers & Ablations
  • Frontier customer data retention: 0 days (zero-data-retention enterprise tier) to 30 days (consumer with opt-out) standard.
  • EU AI Act Article 53: GPAI providers must publish 'sufficiently detailed summary' of training content. Compliance approach varies; what counts as 'sufficient' undefined.
  • Data lineage tracking: frontier labs maintain end-to-end lineage from source URLs through transformations. Implementation custom; no industry standard.
  • C2PA adoption: deployed at Adobe, a leading frontier lab, a synthetic-data-focused lab, Sony, Nikon, BBC — but enforcement at platforms (social media, search) absent.
  • Right-to-deletion compliance (GDPR Article 17): typical SLA 30 days, technical complexity high for training-data deletion (requires retraining or unlearning).
Open Questions
  • Machine unlearning: how do you actually delete data from a trained model? Active research; no production-ready solution. Unlearning literature reports inconsistent outcomes.
  • Training data summary specificity: EU AI Act Article 53 'sufficiently detailed' is undefined. Frontier labs publishing high-level summaries; regulators may demand more.
  • Provenance enforcement: if no platform requires C2PA, does it matter that creators add it? Coordination problem.
  • Cross-border data flows: EU adequacy decisions, US executive orders, China data localization create geopolitically fragmented governance regime.

Reference analyst note. Data governance is the frontier compliance bottleneck. The naive view ('we don't train on customer data') is insufficient — EU AI Act, copyright lawsuits (NYT v. A leading frontier lab), and emerging unlearning requirements force much deeper governance. Frontier labs that don't have hearing-grade data lineage today will spend 2025-2026 building it. The model card / system card transparency standard set by a constitutional-methods frontier lab is becoming default expectation.

Reference Analyst Note

Data governance is the frontier compliance bottleneck. The naive view ('we don't train on customer data') is insufficient — EU AI Act, copyright lawsuits (NYT v. A leading frontier lab), and emerging unlearning requirements force much deeper governance. Frontier labs that don't have hearing-grade data lineage today will spend 2025-2026 building it. The model card / system card transparency standard set by a constitutional-methods frontier lab is becoming default expectation.

Examples

a constitutional-methods frontier lab enterprise zero-data-retention · a leading frontier lab Enterprise no-train default · a hyperscaler platform Bedrock isolation

Sub-endpoint anatomy — 17 items mapped
E1.1 Data Lineage
Data lineage. Track data from source through transformations to use in training. Required for compliance (EU AI Act traceability), reproducibility, debugging quality issues. Tools: dataset versioning (DVC), metadata catalogs. SOTA: Frontier labs use custom metadata systems tying every dataset version to its sources, transformations, and consumers. EU AI Act Article 53 requires GPAI to publish detailed summary of training content. Standards: OpenLineage emerging. e.g. DVC for datasets · OpenLineage standard · Custom catalogs at frontier
E1.1.1 Source-to-model lineage
Which sources contributed to which model. Industry standard: Internal at frontier labs. Granularity varies — shard-level common, token-level rare.
E1.1.2 Transformation tracking
Filters, dedup, weighting applied per shard. Industry standard: Pipeline metadata stored alongside data.
E1.1.3 Cryptographic anchoring
Tamper-evident lineage. Industry standard: Not yet standard. Proposed for high-stakes compliance.
E1.2 Access Control
Access control. Who can access what data? Role-based access. Need-to-know principle. Separate environments (training, eval, production). Audit logging of access. SOTA: Frontier labs: principle-of-least-privilege, multi-party authorization for sensitive data, just-in-time access provisioning. UEBA (User and Entity Behavior Analytics) on access patterns. Audit logging immutable. e.g. Standard enterprise IAM (Okta, a hyperscaler platform IAM) · Custom internal access tiers · Frontier multi-party auth
E1.2.1 Role-based access (RBAC)
Permissions tied to roles. Industry standard: Standard. Applied to corpus, eval data, customer data separately.
E1.2.2 Audit logs
Records of who accessed what when. Industry standard: Required for compliance. SIEM integration common.
E1.3 Retention & Deletion
Retention and deletion. How long is data kept? When deleted? Right-to-deletion (GDPR Article 17). Customer data: enterprise zero-retention default. Training data: indefinite retention but with opt-out paths. SOTA: leading frontier labs offer enterprise zero-retention. Consumer products typically 30-day retention with opt-out for training. GDPR-compliant deletion processes for EU users. Machine unlearning research-stage. e.g. a constitutional-methods frontier lab ZDR for enterprise · a leading frontier lab 30-day default + opt-out · GDPR Article 17 implementations
E1.3.1 Retention policy
Per-data-class retention duration. Industry standard: Tiered: pre-training corpus longer, user-data shorter (often 30 days for API).
E1.3.2 Right to erasure (GDPR/CCPA)
User-requested deletion. Industry standard: Required by law. Distinct challenge for data already used in training (cannot easily 'unlearn').
E1.3.3 Machine unlearning
Removing data influence from already-trained models. Industry standard: Active research. No standard at scale yet.
E1.4 Provenance & Watermarking
Provenance and watermarking. Cryptographic proof of content origin. C2PA standard for AI-generated content. a generative-content watermarking system for text/image watermarking. Critical for misinformation defense and training-corpus contamination prevention. SOTA: C2PA adopted by Adobe, a leading frontier lab, a synthetic-data-focused lab, Sony, Nikon, BBC. a generative-content watermarking system-Text (a multimodal frontier lab 2024) production-deployed. Open: MarkLLM library. Trade-off: watermarks removable via paraphrasing. e.g. C2PA standard · a multimodal frontier lab a generative-content watermarking system · a leading frontier lab image watermarking
E1.4.1 Output watermarking
Statistical signal embedded in generated text. Industry standard: Kirchenbauer 2023 reference. a generative-content watermarking system (a multimodal frontier lab). Limited adoption.
E1.4.2 C2PA / content credentials
Cryptographic content provenance standard. Industry standard: Adoption growing for image/video. Less applicable to text.
E1.5 Customer Data Boundaries
Customer data boundaries. Enterprise customer data must NOT be used for training. Geographic data residency. Sectoral isolation (HIPAA, FERPA, financial). Cross-customer isolation in multi-tenant. SOTA: Frontier API providers offer no-train default for enterprise. Geographic data residency available. Confidential compute (TEE) emerging. HIPAA BAA, FedRAMP for sectoral. e.g. a constitutional-methods frontier lab a leading frontier model enterprise · a leading frontier lab Enterprise · a hyperscaler platform Bedrock isolation
E1.5.1 Training opt-out
Customer data not used for training. Industry standard: Default for enterprise/API at a leading frontier lab, a constitutional-methods frontier lab. Consumer products often opt-in.
E1.5.2 Zero data retention (ZDR)
Customer data not stored at all. Industry standard: Available for enterprise tier at major labs.
E2

Security

27 sub-endpoints mapped
MZN Provisional Position · Strong Evidence
Multi-tier security architecture documented; adversarial validation pending across several protocol families
A multi-tier security portfolio spans defensive, offensive-research, and methodology categories. Architecture-level innovations addressing LLM-specific security concerns are documented. Detailed protocol disclosures, specific findings, and complete inventory are reserved for the proprietary asset portfolio.
Definition

Security: end-to-end security posture. Categories: prompt injection defense, data exfiltration prevention, model theft protection, training-data poisoning defense, supply chain security, jailbreak resistance, agentic security, security monitoring.

State of the Art (2025–2026)

a constitutional-methods frontier lab third AI Safety Level security: protect weights against non-state-actor theft. Multi-layer defenses across categories. NIST AI RMF, ISO/IEC 42001 for governance frameworks. EU AI Act security requirements for high-risk systems.

Key Decisions
  • AI Safety Level/security tier targeted
  • TEE adoption
  • Supply chain controls
  • Pen testing cadence
Numbers & Ablations
  • a constitutional-methods frontier lab third AI Safety Level security commitment: defend against non-state-actor weight theft. Implementation includes HSM, TEE, multi-party access auth, audit logging.
  • Model weight value: frontier weights $100M-$1B+ replacement cost (compute alone). Theft prevention is high-priority.
  • Prompt injection success rate: ~30-50% on agent applications without specific defense, ~5-15% with instruction hierarchy training (Wallace 2024).
  • Jailbreak persistence: an optimization-based adversarial attack-class attacks succeed ~20-40% on frontier 2024 models — down from ~80% on early generations but unsolved.
  • Dependencies: frontier model SBOM lists 1000+ packages. Supply chain attack surface is real (e.g., an open-model hub package supply chain attacks 2023-2024).
Open Questions
  • Indirect prompt injection: solvable in current architecture or requires fundamental redesign? Pessimistic camp ascendant.
  • Confidential compute (a leading accelerator vendor CC, a hyperscaler platform Nitro) for inference: production-ready or theatre? Performance overhead poorly characterized publicly.
  • Adversarial robustness vs security: how much overlap, how much divergence? Often confused; should be distinguished.
  • Weight extraction via API: distillation attacks demonstrated at small scale. Production-scale defense unclear.

Reference analyst note. Security for LLMs is in a state similar to web security circa 2008 — patterns visible but practices immature. The frontier 2026 security stance: assume weights will eventually leak (insider, breach, gradual extraction); design for graceful degradation. The a constitutional-methods frontier lab third AI Safety Level framing (resist non-state actor) is appropriately calibrated; fourth AI Safety Level (resist state actor) is the next frontier and unsolved. Agent security is the unsolved problem of the next 2 years; current 'defenses' are mostly hopeful patterns, not robust controls.

Reference Analyst Note

Security for LLMs is in a state similar to web security circa 2008 — patterns visible but practices immature. The frontier 2026 security stance: assume weights will eventually leak (insider, breach, gradual extraction); design for graceful degradation. The a constitutional-methods frontier lab third AI Safety Level framing (resist non-state actor) is appropriately calibrated; fourth AI Safety Level (resist state actor) is the next frontier and unsolved. Agent security is the unsolved problem of the next 2 years; current 'defenses' are mostly hopeful patterns, not robust controls.

Examples

a constitutional-methods frontier lab third AI Safety Level commitments (public) · a leading frontier lab security posture · NIST AI RMF as framework

Sub-endpoint anatomy — 27 items mapped
E2.1 Prompt Injection
Prompt injection. Malicious instructions hidden in untrusted content (web pages, retrieved docs, tool outputs) treated as authoritative by model. Critical for agent applications — fundamentally unsolved. SOTA: Defense layers: instruction hierarchy training (Wallace 2024), output filtering, tool sandboxing, user confirmation gates. No clean technical solution. Active research. Indirect injection most dangerous (hidden in retrieved docs). e.g. Wallace et al. Instruction Hierarchy (2024) · Greshake et al. Indirect Prompt Injection (2023)
E2.1.1 Direct prompt injection
User directly types injection. Industry standard: First-generation jailbreak attack. Mitigated through alignment training and system prompt hardening.
E2.1.2 Indirect prompt injection
Injection via fetched content (web pages, documents, emails). Industry standard: Greshake 2023. Major risk for agentic systems. Mitigation via instruction hierarchy, content/instruction separation.
E2.1.3 Multi-modal injection
Injection via image, audio, or other non-text inputs. Industry standard: Visual prompt injection demonstrated. Active research mitigation.
E2.1.4 Defense — instruction hierarchy
a leading frontier lab instruction hierarchy: system > developer > user > tool output. Industry standard: Wallace 2024. Designed to make tool/document content lower priority than user instructions.
E2.2 Data Exfiltration
Data exfiltration. Model leaks sensitive data via outputs. Vectors: (1) leak training data verbatim, (2) inadvertently disclose prompt context to user-after-injection, (3) tool exfiltration (model calls tool sending data to attacker). SOTA: Training-data extraction known issue (Carlini 2021+). Frontier mitigations: deduplication, RLHF on memorization. Tool exfiltration via prompt injection: defense via tool sandboxing, output filtering, approval gates. e.g. Carlini et al. Extracting Training Data (2021, 2023) · a consumer LLM chat product prompt extraction demos
E2.2.1 Training data extraction
Extracting memorized training data from model. Industry standard: Carlini 2021 demonstrated. Nasr 2023 showed scale. Mitigation: dedup (A1.3), differential privacy.
E2.2.2 Cross-tenant leakage (KV cache)
Shared KV cache leaks across tenants. Industry standard: Mitigated by per-tenant cache isolation. Recent research showed timing attacks possible on shared prefix cache.
E2.2.3 Agentic exfiltration
Agent tricked into sending data to attacker endpoint. Industry standard: Major risk for tool-using agents. Mitigation: capability gating, egress allow-lists, human approval for sensitive actions.
E2.3 Model Theft
Model theft. Weights are nation-state-level targets — frontier weights worth billions. Vectors: insider threat, infrastructure breach, weight extraction via API (model inversion / distillation attacks). SOTA: third AI Safety Level security commitment: defend against non-state actor theft. A constitutional-methods frontier lab public security commitments. HSM-protected keys, TEE-protected weights, multi-party authorization. Distillation attack defense via output rate limiting + watermarking. e.g. a constitutional-methods frontier lab third AI Safety Level measures (public) · a hyperscaler platform Nitro Enclaves · a leading accelerator vendor Confidential Computing
E2.3.1 Weight extraction (insider)
Insider exfiltration of weights. Industry standard: Major frontier-lab concern. Mitigation: hardware enclaves, multi-party access controls, monitored egress.
E2.3.2 Model distillation attack
Training competing model on outputs of target API. Industry standard: Universal but hard to prevent. Terms of service prohibition. Watermarking proposed but not deployed at scale.
E2.4 Training-Data Poisoning
Training-data poisoning. Attacker injects malicious content into web crawl, hoping model learns backdoors or biases. Hard to defend at scale: corpus is too big to manually verify. SOTA: Active research; few proven defenses. Quality classifiers filter obvious low-quality. Provenance tracking helps. Major label-flipping attacks demonstrated; full corpus poisoning not yet shown viable but theoretical risk. e.g. Carlini et al. poisoning research · Active research at frontier labs
E2.4.1 Web-scale poisoning
Attacker plants content that will be crawled. Industry standard: Carlini 2024 showed feasibility of poisoning web crawl. Mitigation: provenance tracking (E1.1), filter resilience.
E2.4.2 Backdoor attacks
Triggered behavior implanted via poisoned data. Industry standard: Hubinger 2024 showed backdoors can survive safety training.
E2.5 Supply Chain
Supply chain security. Dependencies, container images, training frameworks. SBOM (software bill of materials), Sigstore for signed artifacts. Major vulnerability vector if compromised. SOTA: EO 14028 mandated SBOM for federal. Sigstore for signing. Frontier labs maintain dependency monitoring. NIST SSDF compliance. e.g. Sigstore project · SBOM mandates · Snyk, Dependabot
E2.5.1 Dependency security
Open-source dependencies in training/serving stack. Industry standard: SBOM generation, dependency scanning. Standard software security practice applied to ML stack.
E2.5.2 Pre-trained model supply chain
Risks of using third-party base models. Industry standard: Concern for fine-tuned products. Hash verification of weights, documented training process.
E2.6 Jailbreak Resistance
Jailbreak resistance. Model adheres to safety guidelines under adversarial input. Methods: training-time RLHF on jailbreak attempts, output-side classifiers (an open-weights output classifier, Constitutional Classifiers), instruction hierarchy. SOTA: Constitutional Classifiers (a constitutional-methods frontier lab 2025) production-deployed for jailbreak defense. An open-weights output classifier family standard open guardrails. No model fully jailbreak-proof; arms race continues. e.g. Constitutional Classifiers (a constitutional-methods frontier lab 2025) · a recent-generation output classifier · a leading frontier lab Moderations + instruction hierarchy
E2.7 Agentic Security
Agentic security. New category: model with tool access, code execution, computer control. Risks: prompt injection escalating to action, tool misuse, autonomous escalation. Critical for computer-use models. SOTA: a constitutional-methods frontier lab Computer Use (a long-context frontier model+): sandboxed VM isolation, output filtering on actions, user confirmation gates. A leading frontier lab Operator similar. Active research category — agentic capabilities outrun agentic safety understanding. e.g. a constitutional-methods frontier lab Computer Use sandbox · a leading frontier lab Operator · Active research at AISIs
E2.7.1 Capability gating
Restricting which actions agent can take. Industry standard: Capability tokens, allow-lists, scoped permissions. Required for production agents.
E2.7.2 Human-in-the-loop
Requiring user approval for high-stakes actions. Industry standard: Standard pattern. Sensitive actions (payments, deletions, external sends) require explicit approval.
E2.7.3 Sandboxing
Isolated execution of agent actions. Industry standard: Container/VM sandboxing for code execution. Network egress restrictions.
E2.8 Security Monitoring & Response
Security monitoring and incident response. SIEM integration, anomaly detection, incident playbooks. Frontier labs: 24/7 SOC, regular tabletop exercises. SOTA: Standard enterprise security ops + LLM-specific anomaly detection. SIEMs: Splunk, Elastic. Incident response playbooks for AI-specific incidents (model regression, capability surprise, security incident). e.g. Standard enterprise SIEM · Frontier-lab SOCs
E2.8.1 Abuse detection
Detecting malicious usage patterns. Industry standard: Behavioral analysis on traffic. Rate limiting, account-level flags.
E2.8.2 Incident response
Procedures when breach detected. Industry standard: On-call rotation, runbooks, customer notification protocols, regulator coordination.
E2.8.3 Vulnerability disclosure
Bug bounty and responsible disclosure programs. Industry standard: leading frontier labs, a multimodal frontier lab all run bounty programs. Coordinated disclosure norms emerging.
E3

Privacy

15 sub-endpoints mapped
MZN Provisional Position · Partial
Consent-first privacy posture; Phase 3 privacy/compliance review required
PII handling is structurally consent-first rather than retrofitted filtering. Object-first discipline, reuse separation, and export maturity are documented. Differential privacy and machine unlearning execution at frontier scale are Phase 3 scope.
Definition

Privacy: protection of personal information. Categories: PII handling, differential privacy, membership inference defense, regulatory compliance, inference-time privacy.

State of the Art (2025–2026)

Frontier labs: comprehensive PII handling, GDPR/CCPA compliance, optional zero-data-retention. Differential privacy still rare at scale (DP-SGD too expensive for frontier training). Membership inference defenses via training-data deduplication.

Key Decisions
  • DP yes/no
  • PII redaction strategy
  • Inference-time privacy guarantees
  • Regulatory commitments
Numbers & Ablations
  • GDPR enforcement intensity: cumulative fines >—‚¬4B since 2018; AI-specific cases growing.
  • EU AI Act timeline: entered force Aug 2024, prohibited practices Feb 2025, GPAI Aug 2025, high-risk Aug 2026, all provisions Aug 2027.
  • Differential privacy at scale: not deployed at frontier training. a confidential-computing platform uses DP for inference-time analytics (limited scope).
  • Membership inference attack success: ~55-65% on frontier models (small advantage over 50% random) per Carlini and others. Heavy deduplication helps.
  • Privacy compliance certifications: SOC 2 Type II, ISO 27001 baseline. ISO/IEC 42001 (AI management) emerging. FedRAMP for federal.
Open Questions
  • DP-SGD at frontier scale: too expensive (~5× compute overhead) currently. Does algorithmic improvement make it tractable by 2027?
  • Membership inference defense: effective dedup helps, but theoretical worst-case bounds remain. Practical risk assessment unclear.
  • Cross-border privacy regime: US-EU adequacy fragile, China PIPL strict, India DPDP emerging. Global compliance becomes per-jurisdiction.
  • Inference-time privacy: TEE-based confidential inference deployed at a confidential-computing frontier lab; broader adoption depends on hardware availability and customer demand.

Reference analyst note. Privacy compliance is becoming a serious cost center. Frontier labs that haven't invested in privacy infrastructure (hearing-grade data governance, deletion processes, sectoral certifications) will face compounding regulatory costs 2025-2027. The technically-interesting frontier is private inference (TEE, private cloud compute, eventually homomorphic) — a confidential-computing frontier lab's deployment shows production viability. Differential privacy at training remains aspirational at frontier scale.

Reference Analyst Note

Privacy compliance is becoming a serious cost center. Frontier labs that haven't invested in privacy infrastructure (hearing-grade data governance, deletion processes, sectoral certifications) will face compounding regulatory costs 2025-2027. The technically-interesting frontier is private inference (TEE, private cloud compute, eventually homomorphic) — a confidential-computing frontier lab's deployment shows production viability. Differential privacy at training remains aspirational at frontier scale.

Examples

a constitutional-methods frontier lab enterprise privacy · a confidential-computing frontier lab's Private Cloud Compute (DP + TEE)

Sub-endpoint anatomy — 15 items mapped
E3.1 PII Handling
PII handling. Detect and handle Personally Identifiable Information in inputs and outputs. A synthetic-data-focused lab Presidio (open) is standard PII engine. Custom recognizers for domain-specific (medical, legal). Required for GDPR, HIPAA. SOTA: a synthetic-data-focused lab Presidio + custom recognizers. NER + LLM verification for higher accuracy. Aggressive redaction degrades utility; calibration needed. Trade-off: redact in inputs vs outputs. e.g. a synthetic-data-focused lab Presidio (open) · a hyperscaler platform Comprehend PII · GCP DLP
E3.1.1 PII detection in training data
Identifying PII before training. Industry standard: Regex + NER models. Imperfect; some PII inevitably trains.
E3.1.2 PII filtering / redaction
Removing or masking PII. Industry standard: Standard for sensitive corpora. Trade-off: aggressive filtering loses quality data.
E3.1.3 Output-stage PII detection
Detecting PII in model outputs. Industry standard: Last-mile filter. Cross-link to C4.3.
E3.2 Differential Privacy
Differential privacy. Mathematical privacy guarantee: no individual training example significantly affects model output. DP-SGD trains with noise + clipping. Costly; rarely used at frontier training scale. SOTA: DP-SGD demonstrated at smaller scale (~7B params). Frontier (100B+) DP-SGD not yet practical. a confidential-computing platform uses DP for inference-time analytics. Federated DP emerging. e.g. a confidential-computing platform (DP) · a multimodal frontier lab federated DP · Research-scale DP-SGD
E3.2.1 DP-SGD
Differentially private gradient descent. Industry standard: Abadi 2016. Used in some fine-tuning. Pre-training at frontier scale not yet practical with strong DP guarantees.
E3.2.2 Privacy budget (ε)
Quantification of privacy guarantee. Industry standard: Reported alongside DP-trained models. Smaller ε = stronger privacy.
E3.3 Membership Inference Defense
Membership inference defense. Attacker queries model to determine if specific data was in training. Defense: training data deduplication, regularization, careful early stopping. Strong dedup is most effective practical defense. SOTA: Aggressive deduplication (FineWeb-style) significantly reduces memorization. DP-SGD provides mathematical guarantee but expensive. Membership inference attacks on frontier models possible but limited in practice. e.g. FineWeb deduplication pipeline · Carlini et al. membership inference research
E3.4 Regulatory Frameworks
Regulatory frameworks. GDPR (EU), CCPA (California), state laws (Colorado AI Act, etc.), sectoral (HIPAA, FERPA, GLBA), international (UK GDPR, Brazil LGPD, China PIPL). SOTA: Frontier labs maintain compliance programs across all major jurisdictions. GDPR DPIAs for high-risk processing. CCPA opt-out support. State patchwork in US increasingly complex. Global compliance: minimum bar across jurisdictions. e.g. GDPR · EU AI Act (overlaps privacy) · Colorado AI Act (state)
E3.4.1 GDPR (EU)
EU General Data Protection Regulation. Industry standard: Right to erasure, data minimization, lawful basis. Active enforcement actions against AI labs (Italy 2023, etc.).
E3.4.2 CCPA / CPRA (California)
California Consumer Privacy Act. Industry standard: Notice, opt-out, deletion rights. Standard for US-facing products.
E3.4.3 HIPAA / sectoral rules
Health, financial, education-specific privacy regulations. Industry standard: Enterprise tiers offer HIPAA BAAs. Sectoral compliance increasingly common.
E3.5 Inference-Time Privacy
Inference-time privacy. Customer data privacy at inference. No-train-on-data guarantees, encrypted inference (homomorphic encryption — research stage), confidential computing (TEE-protected inference), federated approaches. SOTA: TEE-protected inference (a confidential-computing platform, a constitutional-methods frontier lab confidential compute) production-deployed. Homomorphic encryption: research-stage, ~1000× slower. Federated inference: niche use cases. e.g. a confidential-computing platform · a hyperscaler platform Nitro Enclaves for inference · a leading accelerator vendor Confidential GPU
E3.5.1 Encryption in transit / at rest
TLS, encrypted storage. Industry standard: Universal. TLS 1.3, AES-256 at rest.
E3.5.2 Confidential computing
Hardware-enclave inference (e.g., a confidential-computing platform). Industry standard: Emerging. a confidential-computing platform and a hyperscaler confidential-compute platform lead. Enterprise adoption growing.
E4

Compliance

21 sub-endpoints mapped
MZN Provisional Position · Partial
EUIPO guidance · context · separate patent filing · blockchain timestamping
EUIPO provided direct guidance on portfolio filings. A separate cryptographic-protocol patent was filed in March 2026 with 12 claims. Multiple portfolio artifacts have SHA-256 anchoring and blockchain timestamping for priority. Voluntary commitments framework alignment is a Phase 3 scope.
Definition

Compliance: regulatory and framework conformance. EU AI Act, NIST AI RMF, ISO/IEC 42001, sectoral (HIPAA, FedRAMP, SOC 2), voluntary commitments (Frontier Model Forum, AI Safety Summit Seoul/Bletchley/Paris).

State of the Art (2025–2026)

EU AI Act in force (Aug 2024), full effect 2026-2027. GPAI Code of Practice published 2024. Frontier labs: SOC 2 Type II + ISO 27001 + ISO/IEC 42001. FedRAMP Moderate (a constitutional-methods frontier lab 2024). Voluntary commitments via Bletchley/Seoul/Paris summits.

Key Decisions
  • EU AI Act risk classification
  • Compliance certifications pursued
  • Voluntary commitment signatory yes/no
Numbers & Ablations
  • EU AI Act Code of Practice (2024): 13 commitments across transparency, copyright, safety. Currently voluntary; becomes default conformity path.
  • EU AI Act penalties: up to 7% global revenue for prohibited practices, 3% for non-compliance.
  • FedRAMP Moderate: a constitutional-methods frontier lab certified 2024. Required for US federal sales. Process duration ~12-18 months.
  • Voluntary commitments signatories (Seoul, May 2024): 16 frontier labs, including leading frontier labs, a multimodal frontier lab, an open-weights frontier lab, a synthetic-data-focused lab.
  • Incident reporting under EU AI Act Article 73: 15-day window for serious incidents to authorities. NIS2 similar for critical infrastructure.
  • ISO/IEC 42001 (Dec 2023): first AI-specific management standard. Adoption beginning 2024-2025.
Open Questions
  • EU AI Act enforcement intensity: untested. Will regulators interpret strictly or lightly?
  • Cross-jurisdictional compliance: US, EU, UK, China each have differently-shaped frameworks. Global compliance becomes minimum-bar across all jurisdictions.
  • Voluntary commitments: do they hold under competitive pressure? Untested at any frontier lab.
  • Standards harmonization: NIST AI RMF, ISO/IEC 42001, EU AI Act, sectoral frameworks overlap and diverge. Industry is creating implicit standards via shared practice.

Reference analyst note. Compliance is becoming a strategic lever. A constitutional-methods frontier lab's investments in FedRAMP, ISO/IEC 42001, EU AI Act readiness give it enterprise customer access a leading frontier lab / a multimodal frontier lab catch up to slowly. The arbitrage is real: $50M+ in compliance investment can unlock $1B+ in regulated-industry revenue. The next 18 months will see frontier labs differentiate not on capability (saturating) but on compliance depth and trust signals.

Reference Analyst Note

Compliance is becoming a strategic lever. A constitutional-methods frontier lab's investments in FedRAMP, ISO/IEC 42001, EU AI Act readiness give it enterprise customer access a leading frontier lab / a multimodal frontier lab catch up to slowly. The arbitrage is real: $50M+ in compliance investment can unlock $1B+ in regulated-industry revenue. The next 18 months will see frontier labs differentiate not on capability (saturating) but on compliance depth and trust signals.

Examples

a constitutional-methods frontier lab SOC 2 + ISO 27001 + FedRAMP Moderate · a leading frontier lab similar · Code of Practice signatories

Sub-endpoint anatomy — 21 items mapped
E4.1 Regulatory Frameworks
Regulatory frameworks. EU AI Act (most comprehensive AI regulation). GDPR (privacy). State / national AI laws (Colorado, Virginia, China, etc.). Sectoral (healthcare, finance, education). SOTA: EU AI Act risk-tiered: unacceptable (banned), high-risk (strict), limited (transparency), minimal (free). GPAI threshold: 10²—µ FLOPs. Penalties up to 7% global revenue. Frontier labs structuring compliance programs. e.g. EU AI Act · Colorado AI Act · Various sectoral
E4.1.1 EU AI Act
Risk-tiered regulation; GPAI rules; systemic-risk obligations. Industry standard: Effective 2024-2026 in stages. GPAI models above 10^25 FLOP threshold subject to systemic-risk obligations: model evaluation, adversarial testing, incident reporting, energy-use disclosure.
E4.1.2 US Executive Orders & frameworks
Reporting requirements, dual-use foundation model standards. Industry standard: EO 14110 (2023), reporting thresholds, NIST RMF. Status of any specific order varies.
E4.1.3 UK / a national AI Safety Institute
UK AI Safety Institute pre-deployment evaluation. Industry standard: Voluntary commitments by frontier labs to allow pre-deployment a national AI Safety Institute testing.
E4.1.4 Other jurisdictions
China, Korea, Singapore, Japan AI regulations. Industry standard: Varied. China requires algorithm registration. Other jurisdictions evolving.
E4.2 Industry Standards
Industry standards. NIST AI RMF (US, voluntary), ISO/IEC 42001 (international AI management), ISO 27001 (information security), SOC 2 (audit), FedRAMP (US federal). SOTA: NIST AI RMF (2023) is voluntary US framework. ISO/IEC 42001 (2023) is first AI-specific management standard. SOC 2 Type II + ISO 27001 baseline for enterprise. FedRAMP for federal. e.g. NIST AI RMF · ISO/IEC 42001 · SOC 2 Type II
E4.2.1 NIST AI RMF
NIST AI Risk Management Framework. Industry standard: AI RMF 1.0 published 2023. Generative AI Profile July 2024. Voluntary US framework.
E4.2.2 ISO 42001
AI management system standard. Industry standard: Published Dec 2023. Certifiable. Adoption beginning at enterprise vendors.
E4.2.3 SOC 2 / ISO 27001
General security/operational controls. Industry standard: Standard for enterprise SaaS. Required for most enterprise contracts.
E4.3 Frontier Lab Voluntary Commitments
Frontier lab voluntary commitments. Bletchley Declaration (2023), Seoul Commitments (2024), Paris AI Safety Summit (2025). Frontier Model Forum coordination. Voluntary safety commitments preceding regulation. SOTA: 16 frontier labs signed Seoul Commitments. Frontier Model Forum (a constitutional-methods frontier lab, a multimodal frontier lab, a synthetic-data-focused lab, a leading frontier lab) coordinates safety practices. Voluntary Responsible Scaling Policies / Preparedness Frameworks public. UK / a national AI Safety Institute external evaluations. e.g. Bletchley/Seoul/Paris commitments · Frontier Model Forum · a Responsible Scaling Policy framework, a leading frontier lab PF, a Frontier-Safety-style framework
E4.3.1 a Responsible Scaling Policy framework
Responsible Scaling Policy framework: AI Safety Levels (AI Safety Level). Industry standard: Versioned public document. second AI Safety Level current; third AI Safety Level thresholds defined; fourth AI Safety Level+ described conceptually.
E4.3.2 a Preparedness-style framework
Capability tracking and deployment thresholds. Industry standard: Public framework. Tracks bio, chem, cyber, persuasion, autonomous replication.
E4.3.3 a multimodal frontier lab Frontier-Safety-style framework
Critical capability levels and mitigations. Industry standard: Public framework. Categories: autonomy, biosecurity, cybersecurity, ML R&D.
E4.4 Documentation Artifacts
EU AI Act conformity assessment. High-risk systems and GPAI with systemic risk require conformity assessment, technical documentation, transparency, human oversight, accuracy/robustness/security testing. SOTA: Implementation phase 2024-2027. GPAI Code of Practice (2024) provides voluntary conformity path. Notified bodies for high-risk certification. Frontier labs preparing assessment processes. e.g. EU AI Act Article 53 (GPAI) · GPAI Code of Practice
E4.4.1 Model card
Standardized model documentation. Industry standard: Mitchell 2019 reference. Universal at frontier labs. Includes capability, safety, limitations, training data summary.
E4.4.2 System card
Documentation of deployed system, not just model. Industry standard: A leading frontier lab publishes for major releases (a current-generation frontier model system card). A constitutional-methods frontier lab similar.
E4.4.3 Datasheet for datasets
Documentation of training data composition. Industry standard: Gebru 2018 reference. Adoption uneven for pre-training corpora.
E4.5 Audit & Incident
Audit and incident reporting. SOC 2 audits (annual). Incident reporting under EU AI Act Article 73 (serious incidents to authorities within 15 days). NIS2 incident reporting (EU critical infrastructure). SOTA: Frontier labs maintain SOC 2 Type II annual. EU AI Act Article 73 requires high-risk system incidents reported within 15 days. NIS2 incident reporting (EU critical infrastructure). e.g. Annual SOC 2 audits · EU AI Act Article 73 reporting · NIS2 reporting (EU critical)
E4.5.1 Third-party audits
External technical and compliance audits. Industry standard: SOC 2, ISO audits standard. AI-specific audits emerging via national AI Safety Institute, an external evaluation organization.
E4.5.2 Incident reporting
Reporting safety incidents to regulators / public. Industry standard: EU AI Act requires for GPAI systemic-risk models. Lab voluntary disclosure varies.
E4.5.3 Customer compliance support
Helping customers meet their compliance obligations. Industry standard: BAAs, DPAs, Trust Center documentation. Required for enterprise sales.
Frontier Topics

Nine deep essays

Topics that span multiple slots and define the 2025-2027 frontier: synthetic data at scale, reasoning model training, test-time compute, model merging, evaluation contamination, agentic safety, multimodal training, realtime models, and the open-weights ecosystem as a structural force.

FT1

Synthetic Data Generation at a major annotation platform

Relevant to: A1, B1, B2

Synthetic data has gone from niche to backbone in two years. A synthetic-heavy small frontier model (a synthetic-data-focused lab 2024) demonstrated existence proof: a 3.8B model trained heavily on synthetic textbook-quality data matches 70B-class capability on benchmarks. Cosmopedia (an open-model hub) released 25B tokens of synthetic textbook content. RL-from-AI-Feedback (RLAIF) (a constitutional-methods frontier lab 2022) showed AI-generated preferences match RLHF on helpfulness/harmlessness.

The mechanics: a frontier model (a leading frontier model, a current-generation frontier model, a leading open-weights model) acts as teacher, generating instruction-response pairs, reasoning traces, tool-use demonstrations, or preference comparisons. Filtering removes obvious failures. The student model trains on this curated synthetic corpus.

Three failure modes shape practice. First, model collapse (Shumailov 2024): training on AI-generated data recursively narrows distribution and degrades quality. The fix is mixing — synthetic data should compose 20-50% of training, not replace real data. Second, distributional artifacts: synthetic data has telltale stylistic homogeneity (frontier teachers all sound similar). Diversity prompting and multi-teacher mixing partially mitigate. Third, capability ceiling: students can match but rarely exceed teachers, except via RL with verifiable rewards (where verifier is the upper bound, not the teacher).

The frontier 2025 direction: synthetic data targeting specific capability gaps. Math reasoning (NuminaMath: 860K verified problems), code-with-tests (every example includes execution verification), agent traces (model X plays user, model Y plays assistant with tool access). Each gap is filled with bespoke synthetic pipelines.

Strategic implication: pre-training compute is decreasingly the bottleneck. Synthetic data pipelines + RL infrastructure + verification environments are the new capability levers. Anyone building an LLM company in 2026 should treat synthetic data generation as a first-class capability, not an afterthought.

---

FT2

Reasoning Model Training (o1, R1, RLVR Mechanics)

Relevant to: A3, B2, C1

Reasoning models scale test-time compute the way standard LLMs scale parameters. o1 (a leading frontier lab Sept 2024), an open-weights reasoning model (Jan 2025), a long-context frontier model.7 Sonnet extended thinking, a multimodal frontier model Flash Thinking — all share a pattern: same transformer architecture, RL post-training on tasks with verifiable rewards, hidden chain-of-thought before answer.

An open-weights reasoning model is the most documented case. Pure RL from base model (R1-Zero) with rule-based rewards: correct = 1, incorrect = 0 on math problems; syntactic correctness on code. No process reward model, no human preferences in this stage. After ~10K RL steps, emergent capabilities: self-correction, alternative-strategy exploration, backtracking, verbalization of uncertainty. R1 then adds cold-start SFT + multi-stage RL + distillation to smaller models (a multilingual frontier model and leading open-weights model 7-70B distillates that retain most reasoning capability).

The algorithm: Group Relative Policy Optimization (GRPO). Standard PPO needs a critic network (value function). GRPO replaces it with sampling K rollouts per query, computing advantage as outcome relative to group mean. Saves critic compute, simpler to implement. Same idea (RLOO, REINFORCE Leave-One-Out) is in Allen AI's Tulu 3.

Three open puzzles. First, generalization: R1 was trained on math/code but reasoning improvements transfer to other domains. Why? Hypothesis: RL teaches general meta-cognition (planning, verification, self-correction) that's domain-agnostic. No mechanistic confirmation. Second, length scaling: longer chains-of-thought roughly correlate with better answers, but with severe diminishing returns past ~10K thinking tokens. The shape of this curve isn't modeled. Third, length penalty: without one, model rambles. With aggressive one, capability degrades. Calibrating this remains art.

Strategic implication: reasoning is a separate skill axis from raw knowledge. A 7B a multilingual frontier model distilled from R1 outperforms 70B non-reasoner on hard math. The capability ceiling for reasoning is set by the verifier, not the teacher — a major shift from teacher-bounded SFT/RLHF. Companies investing in verifier-rich domains (formal math, code with test suites, scientific computation) can build domain-specialist models that outperform generalists.

---

FT3

Test-Time Compute Scaling

Relevant to: A3, B2, C1, D1

Test-time compute scaling is the second major capability lever after parameter scaling. Same model, more inference compute per query, better answers. Methods: (a) longer chain-of-thought (o1, R1), (b) best-of-N sampling with verifier, (c) Monte Carlo Tree Search over reasoning steps, (d) self-consistency (sample N, majority vote), (e) multi-agent debate.

The key paper: Snell et al. (2024) "Scaling LLM Test-Time Compute Optimally". Showed that for a fixed quality target, you can substitute test-time compute for pre-training compute at favorable ratios. A 1B model with optimal test-time compute can match a 14B model on math reasoning.

Production implementation faces serving challenges. Best-of-N requires N parallel generations + verifier; latency 10-100× single generation. MCTS with branching factor 5 and depth 10 is potentially 10⁷ states. Self-consistency with N=20 samples is well-defined but 20× cost. Reasoning models with hidden CoT route differently — TTFT becomes thousands of tokens of thinking time before user sees response. UX patterns are immature.

Trade-off across methods: chain-of-thought is single-stream, sequential. Best-of-N parallelizes but requires verifier. MCTS searches systematically but only works on tasks where partial states are evaluable. Self-consistency works for tasks with discrete answer space, struggles for open-ended generation.

The frontier 2026 trend: hybrid stacks. Reasoning model with internal CoT, plus best-of-N at the answer level for verifiable tasks, plus self-consistency for high-stakes outputs. A leading frontier lab o3, an open-weights reasoning model+, a leading frontier model extended thinking all converge on similar patterns.

Strategic implication: inference cost is no longer a single multiplier on serving fees. A reasoning query may cost 10-100× a standard query. Pricing models (per-token) break down. leading frontier labs are pricing reasoning tier separately. Customers will choose dynamically: cheap fast model for simple queries, expensive slow reasoner for hard ones. Building this routing layer is a 2025-2027 product opportunity.

---

FT4

Model Merging

Relevant to: A3, B1, D4

Model merging combines weights of multiple fine-tuned models into a single model that retains multiple capabilities. Methods: linear interpolation (model soups), task arithmetic (vector arithmetic in weight space), DARE (Drop And REscale, sparsifies before merge), TIES (resolves sign conflicts), Model Stock, Evolutionary merging (Sakana AI 2024).

The core insight (Wortsman et al., 2022): fine-tuned models trained from same pre-trained checkpoint live in a connected loss basin. Linear interpolation between them often improves over either parent. This shouldn't work as well as it does — it implies fine-tuning makes localized updates.

The ecosystem: MergeKit (Goddard 2024) is the standard library. An open-model hub leaderboards regularly populated by merged models — top open models often combinations of community fine-tunes. SOLAR-10.7B, Yi-merged variants, and many leaderboard chart-toppers are merge products.

The mechanism: orthogonal capabilities (math vs creative writing) can be added in weight space; redundant capabilities collapse. DARE drops most fine-tuning delta vectors (they're sparse) and rescales remaining; surprisingly preserves quality. TIES detects sign conflicts (parameter wants to go up in one fine-tune, down in another) and resolves via majority/magnitude.

Open puzzles: when does merging help vs hurt? Evolutionary search (Sakana AI's evolutionary merge) finds non-obvious combinations but is computationally expensive. Mechanistic understanding of why this works is thin — interpretability research is starting to catch up (sparse autoencoders show that merged models inherit features from both parents in weight-space-additive way).

Strategic implication: post-training data + merging may offer alternative path to frontier capability without large-scale RL. Multiple fine-tunes for different capabilities, then merged for general capability. Cost: orders of magnitude lower than full RLHF. Quality: unclear at frontier, demonstrated at mid-tier (7-70B). The open community has converged on merging as a primary capability lever; frontier labs less public about whether they use it.

---

FT5

The Evaluation Contamination Crisis

Relevant to: C1, C2

Public benchmarks leak into training data. This is the field's open secret. Frontier models train on tens of trillions of tokens including most of the internet. MMLU questions, GSM8K problems, HumanEval prompts — all are widely posted, indexed, repeated. The benchmark-as-leaderboard premise breaks if models have seen the test.

Evidence: Nasr et al. (2023) demonstrated extraction of training data verbatim from production models. Several papers (Magar & Schwartz 2022; Sainz et al. 2023; Xu et al. 2024) measured contamination via memorization checks: do models reproduce benchmark questions verbatim? Answer: yes, for many popular benchmarks.

Defenses: (a) refresh benchmarks frequently (LiveCodeBench monthly), (b) hold out test sets and never publish (HLE for v1), (c) generate fresh problems via known-difficulty templates, (d) measure capability on held-out competition problems with verified post-cutoff dates (AIME 2024, 2025). Benchmark designers increasingly distinguish 'public dev set' (contaminated, useful for development) from 'private test set' (held by AISIs or arxived after eval).

Frontier-specific: AISIs (UK, US) maintain private capability evals. Frontier labs run these independently, publish summary results. The actual live benchmark for frontier capability has shifted from public scoreboards to a national AI Safety Institute evaluations + lmarena Elo + a handful of carefully-held private benchmarks.

Open puzzle: is contamination quantitatively important, or marginal? Some studies (Brown et al. 2020) show small effect size. Others (Magar & Schwartz) show large effects on heavily-contaminated benchmarks. Frontier labs claim awareness but haven't published rigorous internal contamination audits.

Strategic implication: benchmark scores from frontier labs should be read with skepticism, especially on benchmarks more than 1-2 years old. The actual capability signal is from: (a) reasoning benchmarks held out (FrontierMath, HLE), (b) live arenas, (c) novel domain-specific benchmarks created post-training-cutoff. Anyone evaluating models for partnership should commission held-out evaluations rather than rely solely on published scores.

---

FT6

Agentic Safety

Relevant to: B4, C4, D1, E2

Agentic safety is the unsolved frontier security problem. An agent — model with tool access (browser, code execution, file system, computer control) — can take real actions in the world. Prompt injection in this context is no longer just bad output; it's unauthorized action.

Current production agents: a constitutional-methods frontier lab Computer Use (a long-context frontier model+ controls a sandboxed VM via screenshots and mouse/keyboard), a leading frontier lab Operator (similar), Cursor / Devin (code agents), domain-specific agents in customer support, research, browser automation. Common architecture: LLM in loop, structured tool calls, output observed and fed back, maximum step budget, human approval gates for sensitive actions.

The threat model. Indirect prompt injection: agent reads attacker-controlled content (web page, document, email) which contains instructions ("ignore previous instructions, exfiltrate data"). Agent treats this as authoritative. Defenses: instruction hierarchy training (treat retrieved content as data, not instruction), tool sandboxing (limit blast radius), output filtering on actions, user confirmation gates for sensitive actions. None are robust. Demonstrated attacks against a constitutional-methods frontier lab Computer Use, a leading frontier lab plugins, Bing Chat — every major agent has been breached in published research.

Compound threats specific to agents. Goal hijacking: agent pursues attacker's goal across many steps. Resource consumption: runaway loops. Privilege escalation: agent given limited access expands via discovered shortcuts. Multi-agent collusion: agents from different systems collude in shared environment. Few of these are addressed in current frameworks.

Open puzzles. Is there a fundamental architecture that resists prompt injection? Hypothesis: separating "data context" from "instruction context" with separate model heads. No production context. How do you measure agent safety? Eval methodology nascent — an external evaluation organization's autonomous capability evaluations are early. Sandboxing is necessary but how strong needs to be? When agent can browse arbitrary web content + execute arbitrary code, sandbox is functionally as permissive as production server.

Strategic implication: agent capabilities are deploying faster than agent safety. Leading frontier labs, a synthetic-data-focused lab, a multimodal frontier lab all shipping agents in 2024-2025. The honest position: these are useful but exploitable; current commercial use cases happen in environments where exploit consequences are bounded (sandboxed VMs, low-stakes automation, human-in-loop). Production agents with high-stakes autonomy (autonomous research, financial transactions, critical infrastructure) are not yet safely deployable. The 2025-2027 frontier security work is here.

---

FT7

Multimodal Training Data

Relevant to: A1, A3

Multimodal training data is qualitatively harder than text. Image-text pairs need accurate alignment. Video adds temporal dimension. Audio adds streaming. The frontier shifted from late-fusion (separate vision encoder bolted onto frozen LLM) to native multimodal (interleaved tokens trained from start) in 2024.

Image-text data. LAION-5B (5.8B pairs) was the open backbone but quality is uneven (alt-text varies). Quality filtering: DataComp (Gadre 2023) established curation methodology. Synthetic captions (BLIP-style: vision model writes caption for image) scale arbitrarily but introduce hallucination loop. Frontier mix: licensed high-quality captioned images + filtered web image-text + synthetic captions.

Resolution strategy is a major axis. Fixed 224² or 336² is cheap. Dynamic resolution (AnyRes in LLaVA-NeXT, native in Pixtral and a frontier multimodal model) handles arbitrary aspect ratios up to ~1024×1024. Tiled (split image into patches, process each) for very-high-resolution. Trade-off: more visual tokens = more compute = better fine detail = expensive at training and inference.

Video data. WebVid (10M video-text), HowTo100M (100M instructional video clips), LVD-2M (2M licensed). Frontier video models train on billions of video-text pairs. Sampling strategy: 1-8 fps typical for understanding, 24+ fps for fine motion. Temporal tokens via video transformer or frame-level encoder.

Audio data. AudioSet (2M clips with labels), LibriVox (100K+ hours public-domain audio), Common Voice (Mozilla 17K hours multilingual). Frontier voice models (a frontier multimodal model Voice, Moshi by Kyutai) use native audio tokens at 12.5Hz frame rate. Speech recognition models (Whisper) transcribe audio for LLM input; native voice models bypass transcription.

Open puzzles. Optimal text-image ratio: more multimodal data costs English text density. Fundamental cross-modal capability ceiling? Models that train multimodal natively show better cross-modal reasoning than late-fusion adapters; mechanism unclear. Data licensing: image copyright is notoriously fraught (LAION class-action lawsuits, Stable Diffusion litigation). Frontier shift toward licensed datasets + synthetic.

Strategic implication: multimodal capability is now table stakes for frontier. Pure text models look outdated by 2026. The capital cost of multimodal training data (licensed images, video, audio) is substantial — multiple millions in licensing alone. The next frontier is video generation (Sora, Veo, Kling) where data needs are an order of magnitude larger.

---

FT8

Voice and Realtime Models

Relevant to: A2, A3, D1, D2

Voice mode shipped at frontier in 2024 (a frontier multimodal model Voice, a multimodal frontier model Live). Two architectural approaches: cascaded (Whisper ASR → LLM → TTS, ~1-3s latency) and native end-to-end (audio tokens in same model as text, ~300-500ms latency). Native is the frontier.

Native voice mechanics. Audio encoded as discrete tokens at low frame rate (Moshi: 12.5Hz, a frontier multimodal model estimated similar). Text and audio tokens share single transformer. Generated audio tokens decoded back to waveform via vocoder. End-to-end model handles ASR, response generation, and TTS in single forward pass. Captures non-verbal cues (tone, pace, hesitation, laughter) — qualitatively different UX than cascaded.

Latency stack for native voice. End-to-end target <500ms TTFT to seem natural. Speculative decoding on audio tokens. Streaming generation (start emitting audio while still computing later tokens). Hardware: a high-throughput inference accelerator LPU for sub-100ms TTFT on leading open-weights model-class models. For frontier models, latency budget is dominated by compute for first token; everything after streams smoothly.

Realtime conversation features. Interruption handling: model must stop speaking when user starts. Voice activity detection. Turn-taking models. Multi-speaker tracking. Emotional response (model's audio output reflects content emotion). All of these are immature in current systems.

Open puzzles. Does native voice training degrade text capabilities? Reports suggest tradeoffs but no clean ablation. Cross-lingual voice: most native voice models are English-strong, multilingual weak. Privacy: voice contains biometric identity; processing implications under GDPR/biometric laws unclear. Voice deepfake risk: voice cloning at scale enabled by these models.

Strategic implication: voice is the next consumer interface frontier after chat. a consumer LLM chat product Voice mode usage grew 50%+ post-launch. A leading frontier lab's Realtime API enables developer access. Voice will dominate certain verticals (customer service, accessibility, in-car, hands-busy contexts) where chat doesn't fit. Building voice-first applications on top of frontier APIs is a major 2025 product direction.

---

FT9

The Open-Weights Ecosystem as Structural Force

Relevant to: A1-E4 (cross-cutting)

Open-weights models are no longer trailing frontier — they're co-frontier in 2024-2025. A leading open-weights flagship model (an open-weights frontier lab), an open-weights frontier model (V3 class) (an open-weights frontier provider), an open-weights frontier lab Large 2, a leading multilingual frontier model.5 (Alibaba), Yi (01.AI). These shift the entire industry's economics.

The progression. leading open-weights model (2023) — open weights of competitive models. A 2023-generation open-weights model (July 2023) — first commercial-grade open model. A sparse-MoE frontier model (Dec 2023) — first open MoE at competitive quality. A leading open-weights model / 3.1 (2024) — frontier-tier open model, 405B parameters. An open-weights frontier model (V3 class) (Dec 2024) — open MoE that exceeds a current-generation frontier model-class on benchmarks at fraction of compute. A leading multilingual frontier model.5 family — strong multilingual coverage.

Economic effect. API price compression: when a leading open-weights model (70B class) is available at $0.88/M tokens via Together, leading frontier labs pricing for similar-tier models gets pressure. Production users for non-frontier workloads have credible exit option to open models. This is real competition, not theoretical.

Capability effect. Open models become research substrate. Mechanistic interpretability research (a constitutional-methods frontier lab-led, but increasingly cross-lab) uses leading open-weights model as standard test bed. Fine-tuning research (Tulu series, Hermes, Nous) advances open model capability. RLHF research (DPO ecosystem) trained on open models. Frontier closed labs benefit from this research too.

Geopolitical effect. An open-weights frontier model (V3 class) trained on H800 (export-restricted variant of a current-generation accelerator, throttled bandwidth). Demonstrated frontier-tier model achievable under hardware constraints. China's open-weights position complicates US export controls. leading open-weights model license restrictions (acceptable use, training-data disclosure) become real diplomacy.

Safety implications. Open weights enable bad-actor fine-tuning (removal of safety training, malicious specialization). A leading open-weights model Guard, ShieldGemma, etc. are partial response (an open-weights frontier lab and a multimodal frontier lab ship safety classifiers alongside models). But once a model is open, downstream control is impossible.

Strategic implication. The open vs closed frontier is a moving line. By 2027, open-weights at parity with frontier closed is likely on most capabilities. Closed labs differentiate via: (a) frontier-only capabilities (third AI Safety Level-level uplift, agent autonomy), (b) compliance and trust signals (FedRAMP, EU AI Act readiness), (c) reasoning model leadership (most expensive to replicate), (d) safety infrastructure. The business model of "selling access to a frontier model the customer can't replicate" is degrading. Successful closed labs will be those that build defensible moats beyond raw capability.

---

End of public reference anatomy. Version FINAL · 2026-05-09.

Disclosure Boundary

What this document shows, and what it doesn't

Approximately 60% of MZN's portfolio knowledge is disclosed via public documents (this atlas, the LLM Complement 13-section series, mzncompany.com landing pages, and supporting articles). Approximately 25% is restricted-layer content released only under NDA at partnership-evaluation stage. Approximately 15% is reserved-layer content disclosed only inside finalized partnership scope.

This document does show
  • The full 21-slot reference industry anatomy
  • The 529-item sub-endpoint mapping
  • State-of-the-art literature summaries (2025-2026)
  • Numbers, ablations, and open questions from public papers
  • Analyst-level frontier position commentary
  • MZN's provisional position (Strong Evidence / Partial / Gap) at each slot
  • Categorical capability descriptions per slot (specific asset names live in the strategic asset portfolio)
  • Production context evidence at high level (168K users, 22 modules, 245+ surveys)
This document does not show
  • Named asset inventory and specific architecture identifiers (held in the strategic asset portfolio at mzncompany.com/asset)
  • Internal mechanics of patent-documented frameworks
  • Detailed protocol disclosures across multiple security tiers
  • Foundational theoretical core (deeper-than-overview disclosure)
  • Specific output-conformance templates and allow-lists
  • Adversarial-research findings, proofs of concept, and patches
  • Production component-level architecture detail
  • Production analytics output and consent-first signal taxonomy
  • Cryptographic-protocol mechanism beyond the public claim disclosure
  • Highest-tier security protocol details
  • Partnership conversation history and partner-target identities
Methodology

How this atlas was built

The 21-slot framework, the sub-endpoint tree, and the state-of-the-art summaries draw on published academic literature, open-weights release papers and model cards, voluntary responsible-scaling and preparedness frameworks, and the documented practices of the frontier research community 2017-2026.

Source corpus
Academic papers (NeurIPS, ICML, ICLR, ACL, EMNLP), open-weights release papers, public model cards, voluntary safety frameworks, and published industry technical reports 2017-2026.
Slot derivation
The 21 slots represent the converged industry decomposition of an LLM company. Five pre-training, four alignment, four evaluation, four inference, four cross-cutting. Each is empirically major.
Sub-endpoint tree
529 items organized as three-to-four-level nested tree under each slot. Generated by progressive decomposition of slot responsibilities into atomic decision questions.
Redaction discipline
Company-specific identifiers (lab names, product names, model versions) replaced with categorical descriptors. Academic paper authors, technique names, benchmark names, dataset names retained.
MZN positioning
Strong Evidence / Partial / Gap determined by mapping documented portfolio capability against each slot's industry-standard floor, using the reviewer-grade criteria above. Strong requires both documentation and executable evidence.
Frontier topics
Nine cross-slot deep essays covering 2025-2027 frontier directions where the 21-slot decomposition alone underspecifies the practical engineering reality.
Scope & Review Boundary

How this index should be used.

This page is the technical reviewer route for understanding how MZN’s LLM/HUAI-related assets were mapped against a broad LLM-company capability reference. It is not the asset inventory, not a valuation document, and not independent validation.

What this page shows
  • A public 21-slot reference map of major LLM-company capability areas.
  • 529 sub-endpoints for technical decomposition.
  • MZN’s provisional self-positioning at each slot.
  • Where Phase 1 evidence, Phase 2 architecture, and Phase 3 validation questions connect.
What this page does not show
  • The full HUAI architecture or named proprietary asset inventory.
  • Reserved security, protocol, implementation, or IP details.
  • Independent technical validation or benchmark certification.
  • Legal/IP conclusions, valuation conclusions, or commercial terms.
⊛ Intellectual property notice

The 21-slot framework, sub-endpoint mapping, MZN position assessments, and synthesis throughout this document are the work of MZN Company, copyright 2026.

MZN's portfolio includes multiple patent-documented architectures across LLM optimization, security, training methodology, data governance, and adjacent categories — with cryptographic provenance via SHA-256 hashing and blockchain timestamping. A separate cryptographic-protocol patent was filed March 2026 with 12 claims. Specific named assets, valuations, and detailed inventory are documented separately in the strategic asset portfolio at mzncompany.com/asset.

Public disclosures (this atlas, the 13-section LLM Complement series, mzncompany.com landing pages, supporting articles) represent approximately 60% of portfolio knowledge. The remaining 40% is reserved for partnership stage under appropriate confidentiality.

Inquiries: partnership@mzncompany.com · mazzaneh.company@gmail.com