MZN — LLM Framework Index · Public Reference Atlas + Provisional Position Map

Slot	Title	Level	Position summary
A1	Data	STRONG EVIDENCE	Phase 1 consent-first product data evidence; LLM-readiness pending review
A2	Tokenizer	STRONG EVIDENCE	Multilingual tokenizer expertise; under-served-script focus
A3	Architecture	PARTIAL	Patent-grade candidate architectural innovations; implementation validation pending
A4	Training	PARTIAL	Training methodology documented; frontier-scale execution pending
A5	Compute	GAP	No cluster under solo operation
B1	SFT	PARTIAL	Demonstration-data shaping methodology documented
B2	Preference Optimization	PARTIAL	Output-conformance methodology informs preference design
B3	Constitutional Methods	PARTIAL	Principle-based alignment substrate (theoretical layer)
C1	Capability Evaluation	PARTIAL	Phase 1 product telemetry and user-behavior evaluation context
C2	Safety Evaluation	STRONG EVIDENCE	Documented safety architecture; red-team validation pending
C3	Robustness	PARTIAL	Security-driven robustness research
C4	Output Safety	STRONG EVIDENCE	Output-conformance safety templates and egress controls
D1	Serving	PARTIAL	Phase 1 application/platform serving experience across Mazzaneh modules
D2	Inference Optimization	STRONG EVIDENCE	Patent-grade candidate inference frameworks; benchmark validation pending
D3	Monitoring	STRONG EVIDENCE	Monitoring architecture / GPU Sentinel route; implementation or pilot validation pending
D4	Deployment	PARTIAL	Phase 1 application/platform deployment experience across Mazzaneh modules
E1	Data Governance	PARTIAL	Consent-first data governance by design
E2	Security	STRONG EVIDENCE	Multi-tier security architecture documented; adversarial validation pending
E3	Privacy	PARTIAL	Consent-first privacy posture; Phase 3 privacy/compliance review required
E4	Compliance	PARTIAL	EUIPO guidance · context + separate patent-grade candidate record

The Atlas

Slot-by-slot reference anatomy

Each slot below opens with MZN's provisional position, then a reference industry view (definition, state of the art, decisions, trade-offs, numbers, open questions, frontier analyst position, examples, references), and concludes with an expandable 529-item sub-endpoint anatomy.

Data

66 sub-endpoints mapped

MZN Provisional Position · Strong Evidence

Phase 1 consent-first product data evidence; LLM-readiness pending review

Phase 1 Mazzaneh operated a multi-module product platform with consent-first behavioral and product signals across 168K+ users. These signals may be relevant to future data strategy, but LLM training-readiness requires Phase 3 legal, consent, privacy, data-governance, and technical review. Multilingual context includes Persian and Arabic-script depth.

Phase context: A1 draws partly on Phase 1 Mazzaneh product-data evidence. It should not be read as proof of a frontier-scale LLM training corpus without Phase 3 consent, privacy, governance, and technical review.

Definition

The pre-training corpus is the model's universe of evidence: every text, code file, image-caption pair, and audio transcript that establishes what the model considers possible. Corpus engineering — acquisition, extraction, filtering, deduplication, mixing — sets the absolute capability ceiling. Post-training can refine and align, but cannot exceed what is latent in the data.

State of the Art (2025–2026)

Frontier models (2025-2026) train on 10-30 trillion text tokens plus billions of multimodal pairs. A leading open-weights flagship model used 15T tokens. An open-weights frontier model (V3 class) used 14.8T tokens. A current-generation frontier model estimated similar order. Frontier shift: from raw web scale to curated quality (FineWeb-Edu, DCLM-baseline). Multimodal natively integrated from pre-training (a multimodal frontier model, a frontier multimodal model, a long-context frontier model).

Key Decisions

Corpus size (tokens)
Source mixture (web/code/math/books/multimodal)
Quality vs. quantity trade-off
Multilingual ratio
Recency vs. archival
Deduplication aggressiveness
License/safety filtering severity

Trade-offs

More tokens → diminishing returns past Chinchilla-optimal
Higher quality filtering → smaller corpus but better downstream performance
Multilingual breadth → English depth slightly lower per token
Web-heavy → broader knowledge but lower factual accuracy

Numbers & Ablations

Chinchilla-optimal: ~20 tokens per parameter (Hoffmann 2022). A leading open-weights flagship model used 37 tokens/param — substantially over-trained, deliberately for inference cost.
Quality filtering: FineWeb-Edu retains ~3% of CC after model-based filtering, matches 5× larger raw corpora.
Multilingual cost: each non-English language added consumes ~2-5% of effective English capacity at fixed parameter count (Conneau et al., 2019; Pfeiffer et al., 2022).
Synthetic data ceiling: a synthetic-heavy small frontier model demonstrated ~70B-class performance at 3.8B with synthetic-heavy training; ratio collapses past ~50% synthetic mix (mode collapse, Shumailov et al. 2024).
Code corpus contribution to general reasoning: ~3-7% MMLU gain attributable to code in pre-training (Aryabumi et al., 2024).

Open Questions

What is the actual scaling law for synthetic data quality vs. quantity? a synthetic-heavy small frontier model demonstrated existence proof but not optimal mix.
Is there a multilingual scaling law analogous to Chinchilla? Adding 100 languages vs. 10 with same compute — no published frontier-scale result.
How much of frontier model capability comes from data quality vs. quantity vs. mix curriculum? a leading open-weights model paper hints curriculum matters; no isolated study at scale.
Copyright-clean training: can a frontier model be trained on only permissively-licensed data without significant capability loss? No public attempt at frontier scale.

Reference analyst note. Quality > quantity is now consensus, but the field has overcorrected — most labs underweight diversity in chase of curated quality. The 'textbook quality' direction (a synthetic-heavy small frontier model) is a local optimum, not a global one. Frontier 2026-2027 will rebalance toward curated-but-diverse, with synthetic data filling specific holes (math reasoning, agent traces) not as bulk replacement.

Reference Analyst Note

Quality > quantity is now consensus, but the field has overcorrected — most labs underweight diversity in chase of curated quality. The 'textbook quality' direction (a synthetic-heavy small frontier model) is a local optimum, not a global one. Frontier 2026-2027 will rebalance toward curated-but-diverse, with synthetic data filling specific holes (math reasoning, agent traces) not as bulk replacement.

Examples

A leading open-weights model: 15T tokens, 5% non-English, code 17% · an open-weights frontier model (V3 class): 14.8T tokens, multilingual focus · FineWeb-Edu: 1.3T high-quality educational tokens (open) · RedPajama-V2: 30T tokens (open) · DCLM-baseline: 4T tokens with model-based filtering

References (Academic)

Hoffmann et al., Chinchilla (2022) · Penedo et al., FineWeb (2024) · Li et al., DataComp-LM (2024) · Soldaini et al., Dolma (2024)

Sub-endpoint anatomy — 66 items mapped

A1.1 Source Registry

Web crawl is the dominant corpus source. Common Crawl provides ~250B web pages across 100+ snapshots since 2008. Raw CC is heavily redundant, multilingual-imbalanced, and contains substantial low-quality content. Modern pipelines extract WARC files, run language identification, perform URL/document/paragraph-level deduplication, and apply quality classifiers. SOTA: FineWeb-Edu (an open-model hub, 2024) demonstrated quality classification using a leading open-weights model (70B class) as labeler → distilled into small classifier. 1.3T retained tokens match performance of larger raw corpora. DCLM-baseline (a confidential-computing frontier lab/UW, 2024) used similar approach with model-based filtering producing 4T high-quality tokens. e.g. Common Crawl: 250B+ pages raw · RefinedWeb (an open-weights model team, 2023): 600B tokens deduplicated · C4 (a multimodal frontier lab, 2019): 750B tokens, an encoder-decoder model-era

A1.1.1 Web crawl sources

Common Crawl snapshots and derivatives. The bulk of pre-training corpora. Industry standard: Multiple Common Crawl snapshots, deduplicated and filtered. A leading open-weights model used 15T tokens primarily from CC.

+ deeper detail (3 leaves)

A1.1.1.1 Common Crawl snapshot selection Which CC dumps to include — recent, historical, or both. Industry standard: Multiple snapshots (e.g. 95+ in FineWeb), spanning years to capture historical text and reduce recency bias.
A1.1.1.2 WARC vs WET extraction WARC contains raw HTML; WET contains pre-extracted text. Extraction-from-WARC yields better text but costs 10-100x compute. Industry standard: Frontier labs increasingly re-extract from WARC. RefinedWeb and Dolma use trafilatura/resiliparse on WARC.
A1.1.1.3 Crawl coverage gaps Languages, domains, and content types under-represented in CC. Industry standard: Supplement CC with targeted crawls for low-resource languages, code (GitHub), academic (ArXiv, PubMed), books.

A1.1.2 Curated text corpora

High-quality non-web sources: Wikipedia, books, academic papers, news. Industry standard: a 2023-generation open-weights model/3 disclose Wikipedia, ArXiv, books in mixture; a leading frontier lab/a constitutional-methods frontier lab opaque on specific corpora.

+ deeper detail (3 leaves)

A1.1.2.1 Wikipedia corpus Wikipedia dumps in multiple languages. Industry standard: Multiple language editions; extracted via WikiExtractor or mwparserfromhell to plain text.
A1.1.2.2 Book corpora Books3 (deprecated), Project Gutenberg, licensed publisher feeds, scanned books. Industry standard: Mixed sourcing. Some labs license; some use Books3-derivatives (legally contested).
A1.1.2.3 Academic corpora ArXiv, PubMed, S2ORC, ACL Anthology, etc. Industry standard: Heavy use; ArXiv especially common. Math/code-heavy LaTeX requires special pre-processing.

A1.1.3 Code corpora

GitHub, StackOverflow, code-specific datasets like The Stack. Industry standard: The Stack (BigCode, 2023) is the public reference. Frontier labs use proprietary code crawls.

+ deeper detail (2 leaves)

A1.1.3.1 License-aware code filtering Excluding code under restrictive licenses (GPL, AGPL) from training. Industry standard: The Stack v1.2 includes only permissive licenses (MIT, Apache, BSD); exclusion of restrictive licenses standard at frontier labs.
A1.1.3.2 Repository quality signals Star count, fork count, file size, language detection accuracy. Industry standard: Filter by star count threshold, exclude minified/auto-generated code, detect language by linguist library.

A1.1.4 License posture catalog

License classification per source: public domain, permissive (CC-BY, MIT), restrictive (CC-BY-NC, GPL), proprietary (licensed), unclear. Industry standard: Maintain explicit license registry. Frontier labs increasingly licensed-only or licensed-preferred to reduce litigation surface.

A1.1.5 Source provenance hashing

SHA-256 (or similar) of every source artifact to enable later provenance queries. Industry standard: Standard practice at frontier labs; less common at smaller labs.

A1.2 Cleaning Pipeline

Code corpus provides programming language coverage essential for code generation, agent tools, and reasoning capability. Code data also improves general reasoning (correlations observed in multiple model families). Sources include GitHub public repos, Stack Exchange, programming Q&A, technical documentation. SOTA: The Stack v2 (BigCode, 2024): 3T+ tokens permissively-licensed, 600+ languages. License filtering excludes GPL/restrictive. Repository-level context (not file-level chunks) increasingly used for long-context code understanding. StarCoder 2 trained on this corpus. e.g. The Stack v2 (BigCode, 2024) · GitHub raw (petabyte-scale) · RedPajama Code: 50B tokens

A1.2.1 Boilerplate removal

Stripping navigation menus, footers, ads, cookie banners, repeated page templates. Industry standard: Trafilatura, jusText, or custom rule-based extraction. Required for WARC-based pipelines.

+ deeper detail (2 leaves)

A1.2.1.1 Boilerplate detection method Algorithm choice: rule-based, ML-based, or hybrid. Industry standard: Hybrid: HTML-rule-based extraction (trafilatura) + density-based heuristics.
A1.2.1.2 Cross-template repetition handling Same boilerplate appearing across millions of pages (e.g. WordPress footer). Industry standard: Detected via document-level n-gram repetition; pages with high boilerplate ratio dropped.

A1.2.2 Encoding normalization

Converting all text to UTF-8, fixing Mojibake, handling BOMs. Industry standard: Standard normalization to UTF-8; ftfy library for Mojibake repair.

A1.2.3 Language identification

Per-document language tagging. Industry standard: fastText language ID (Joulin et al.) is the dominant choice; cld3 also used.

+ deeper detail (2 leaves)

A1.2.3.1 Confidence threshold for language ID Probability cutoff below which a document is rejected or flagged as mixed-language. Industry standard: Typically 0.65 for fastText. Higher for low-resource languages to reduce false positives.
A1.2.3.2 Multi-language documents Documents containing significant amounts of two or more languages. Industry standard: Either split by paragraph or assign primary language. No consensus on best handling.

A1.2.4 NSFW / unsafe content filtering

Pre-training removal of explicit content, gore, harmful material. Industry standard: URL blocklists + keyword filters + classifier-based (DSIR-style). A 2023-generation open-weights model paper documents this approach.

A1.2.5 Length filtering

Removing documents below minimum or above maximum length. Industry standard: Common thresholds: drop documents <50 tokens or <200 characters; cap at very long lengths handled in A4 packing.

A1.3 Deduplication

Mathematical and scientific corpus provides reasoning depth. Sources: arXiv (~2M papers), PubMed, scientific textbooks, Math StackExchange, OpenWebMath. Math performance benchmarks (GSM8K, MATH, AIME) correlate strongly with pre-training math token volume and quality. SOTA: OpenWebMath (Paster 2023): 14.7B tokens of high-quality math web. DeepSeekMath corpus: 120B tokens. Synthetic math: procedurally generated problems with verified solutions. Math performance correlates strongly with pre-train math token volume — 100B+ tokens common at frontier. e.g. arXiv: 2M papers, ~50B tokens full-text · OpenWebMath: 14.7B tokens · DeepSeekMath corpus: 120B

A1.3.1 Exact-match deduplication

Hash-based detection of identical documents. Industry standard: URL-level + SHA-256-level + line-level.

+ deeper detail (3 leaves)

A1.3.1.1 URL-level dedup Removing duplicate URLs across crawl snapshots. Industry standard: First pass; trivially cheap.
A1.3.1.2 Document-hash dedup SHA-256 of normalized document text. Industry standard: Standard. Catches identical content under different URLs.
A1.3.1.3 Line-level dedup Removing globally repeated lines across the corpus. Industry standard: Used selectively; aggressive line dedup damages legitimate quoted text.

A1.3.2 Near-duplicate detection (MinHash + LSH)

Probabilistic detection of documents with high Jaccard similarity. Industry standard: MinHash signatures + LSH banding. Standard parameters: 100-200 hashes, Jaccard threshold 0.8-0.85.

+ deeper detail (4 leaves)

A1.3.2.1 Shingling (n-gram) parameter Token n-gram size used to construct MinHash input set. Industry standard: 5-gram shingles common; some pipelines use 7-gram or word-13-grams.
A1.3.2.2 MinHash signature length Number of hash functions used to build the MinHash signature. Industry standard: 100-200 hashes. Tradeoff: more hashes = higher precision, more compute.
A1.3.2.3 LSH banding Locality-sensitive hashing parameters: number of bands × rows-per-band. Industry standard: Tuned to target Jaccard threshold. e.g. 20 bands × 9 rows ≈ threshold 0.8.
A1.3.2.4 Jaccard threshold Minimum Jaccard similarity for two documents to be considered near-duplicates. Industry standard: 0.8 (Lee et al. 2022 reference); some pipelines use 0.85 or 0.7 depending on tolerance.

A1.3.3 Semantic deduplication

Embedding-based detection of semantically duplicate content. Industry standard: Emerging; SemDeDup (Abbas 2023) is the public reference. Frontier labs may use proprietary methods.

+ deeper detail (2 leaves)

A1.3.3.1 Embedding model choice Which encoder (E5, BGE, GTE, a leading frontier lab ada) generates the document embeddings. Industry standard: Open-source encoders (E5, BGE) for reproducibility; proprietary at large labs.
A1.3.3.2 Cosine similarity threshold Cutoff for considering two embeddings as semantic duplicates. Industry standard: 0.95+ for near-duplicate semantic level; below this, content variation expected.

A1.3.4 Cross-corpus deduplication

Dedup across heterogeneous sources (web vs books vs academic). Industry standard: Run dedup globally after corpus assembly, not per-source. Otherwise cross-source duplicates survive.

A1.4 Quality Filtering

Multimodal corpus pairs text with images, video, audio, and structured data. Native multimodal pre-training (vs. late fusion) enables cross-modal reasoning, image-to-text grounding, and video understanding. The shift from CLIP-style late fusion (separate encoders) to native multimodal (interleaved tokens) characterizes 2024+ frontier models. SOTA: Native multimodal pre-training (vs late fusion) became dominant 2024+. A frontier multimodal model, a million-token-context frontier model/2.0, a long-context frontier model native interleaved. LAION-5B (5.8B image-text pairs) is open backbone. Image resolution dynamic up to 1024² for detail. Video: 1B+ video-text pairs at frontier. e.g. LAION-5B (5.8B image-text) · DataComp 12.8B (filtered) · WebVid 10M video-text

A1.4.1 Heuristic filters

Rule-based filters: line length, punctuation density, repetition rate, word distribution.

+ deeper detail (4 leaves)

A1.4.1.1 Line-length distribution Mean line length, max line length, lines per document. Industry standard: Drop documents with unusual line distributions (very long lines = likely scraped tables; very short = navigation).
A1.4.1.2 Repetition detection Repeated lines, repeated paragraphs, repeated n-grams within a document. Industry standard: Drop documents where >X% of lines repeat. RefinedWeb uses thresholds in 0.2-0.3 range.
A1.4.1.3 Symbol-to-text ratio Ratio of non-alphanumeric characters to total characters. Industry standard: High symbol ratio → likely code, table, or noise. Filter or route to code-specific path.
A1.4.1.4 Stopword presence Documents lacking common stopwords are likely lists, tables, or non-natural text. Industry standard: Require minimum stopword density (typically >2%) for general-text classification.

A1.4.2 Perplexity filtering

Use a smaller reference language model to score documents; drop high-perplexity (likely noise) and very-low-perplexity (likely repetition). Industry standard: Used in CCNet (Wenzek 2020), some leading open-weights model-family pipelines. KenLM 5-gram on Wikipedia common as reference.

A1.4.3 Classifier-based filtering

Train a binary quality classifier on a curated 'good' reference set; score every document. Industry standard: Standard at frontier labs. FineWeb-Edu (Penedo 2024) and DataComp-LM (Li 2024) are the public references.

+ deeper detail (2 leaves)

A1.4.3.1 Reference set construction What counts as 'good' for training the classifier. Industry standard: Typically Wikipedia, books, academic papers, or LLM-judged 'educational' web pages (FineWeb-Edu).
A1.4.3.2 Classifier architecture fastText, an encoder-only model-based, or LLM-as-judge. Industry standard: fastText for scale (RefinedWeb), small an encoder-only model or distilled models for higher quality.

A1.4.4 Pruning by influence

Removing documents that hurt downstream loss (Marion 2023, DSIR). Industry standard: Emerging research direction. Not yet standard frontier practice.

A1.5 Domain Mixing & Weighting

Deduplication removes near-duplicate content that would otherwise dominate training and reduce effective coverage. Three levels: URL-level (exact), document-level (MinHash/SimHash), and substring-level (suffix array). Aggressive dedup typically removes 30-70% of raw corpus and improves downstream performance. SOTA: Standard pipeline: URL dedup → MinHash near-dup (Jaccard 0.8) → optional substring dedup. Typical removal 30-70% of raw corpus. SemDeDup (Abbas 2023) uses embedding similarity for semantic dedup, removing 50% of LAION with no perf loss. Frontier: also benchmark contamination filtering. e.g. A leading open-weights model: MinHash + URL dedup · an earlier frontier model: aggressive dedup described in paper · FineWeb: full pipeline including SemDeDup variants

A1.5.1 Fixed mixture

Proportions chosen by data team based on intuition and small-scale context. Industry standard: a 2023-generation open-weights model mixture (web 67%, code 4.5%, etc.) is the publicly documented reference. Adjustments per generation.

A1.5.2 Learned mixture (DoReMi)

Use a small proxy model to optimize mixture weights against downstream loss. Industry standard: DoReMi (Xie et al. 2023) shows 2.6× faster pre-training. Adopted increasingly at frontier labs.

A1.5.3 Curriculum (data ordering)

Order in which data is presented during training. Curriculum learning vs random shuffling. Industry standard: Random shuffling dominant at scale. Curriculum used in some final-phase fine-tuning.

A1.5.4 Up-sampling rare domains

Repeating low-frequency-but-important domains (e.g. math, code, low-resource languages). Industry standard: Standard. Up-sample rates of 2-4× common for math, code, multilingual.

A1.6 Contamination Control

Quality filtering separates valuable content from noise. Approaches: heuristic (perplexity, length, repetition, language confidence), classifier-based (small model predicts 'high-quality'), or model-as-judge (large model labels samples for distillation). Quality filtering typically retains 10-30% of raw web content. SOTA: Model-based filtering dominates 2024+. FineWeb-Edu used a leading open-weights model (70B class) → distilled small classifier. DCLM-baseline similar. Heuristic filters (perplexity, length) coarse but fast. Trade-off: classifier inference at corpus scale costs $100K-$1M, small relative to pre-training. e.g. FineWeb-Edu: a leading open-weights model (70B class) labels → distilled · DCLM-baseline: classifier-filtered · C4: heuristic only

A1.6.1 N-gram overlap detection

Match against benchmark text using n-gram overlap. Industry standard: a 2023-generation open-weights model used 8-gram, a leading open-weights model used 13-gram. Token-level (after tokenizer) more common than character-level.

A1.6.2 Benchmark suite registry

Maintained list of benchmarks to scrub against. Industry standard: MMLU, GSM8K, HumanEval, BIG-Bench, ARC, HellaSwag, etc. A 2023-generation open-weights model paper lists ~30 benchmarks.

A1.6.3 Per-benchmark contamination disclosure

Reporting per-benchmark contamination rate alongside results. Industry standard: Sainz 2023 argues current standard is insufficient; per-benchmark disclosure increasingly expected.

A1.7 Provenance Ledger

Mixing strategy determines the proportion of each data source in training. Optimal mix depends on target capabilities: heavier code → better code generation, heavier math → better reasoning, heavier multilingual → broader language coverage. Mix is often staged (early training: broad mix; late training: high-quality + domain-specific). SOTA: DoReMi (Xie 2023) optimizes mix via small reference models with proxy losses. A leading open-weights model used annealing — high-quality data weighted higher in final 40B tokens. Frontier typical: 50-70% web, 5-10% code, 2-5% math, multilingual proportional. Curriculum (early broad, late specialized) increasingly common. e.g. A leading open-weights model: 50% web, 25% math/reasoning, 17% code, 8% multilingual · a synthetic-heavy small frontier model: heavy synthetic + textbook quality · an open-weights model: web-heavy

A1.7.1 Shard-level metadata

Source identifier, retrieval timestamp, filter pipeline version, weight category for each shard. Industry standard: Standard at frontier labs. Format varies.

A1.7.2 Token-batch traceability

Ability to query 'which source produced this token batch'. Industry standard: Partial at most labs. Full token-level provenance is research-grade.

A1.7.3 Cryptographic anchoring

Merkle-tree or similar cryptographic commitment to corpus state. Industry standard: Not yet standard. Proposed for compliance use.

A1.8 Synthetic Data Integration

Synthetic data generation uses existing models to create training data: instruction-response pairs (FLAN, self-instruct methodology), reasoning traces (chain-of-thought), filling-in gaps in coverage (rare languages, niche domains), or constitutional / RLHF preference data. Synthetic data accelerates iteration and fills distributional holes. SOTA: a synthetic-heavy small frontier model (a synthetic-data-focused lab) trained heavily on synthetic textbook-quality data — 7B model with 70B-class capability. Cosmopedia (HF): 25B synthetic textbook tokens. RL-from-AI-Feedback (RLAIF) used in Constitutional methods. Synthetic-heavy past 30-50% mix risks distributional artifacts (Shumailov 2024 model collapse). e.g. a synthetic-heavy small frontier model series: synthetic-heavy · Cosmopedia: 25B synthetic textbook · OpenAssistant: synthetic conversation

A1.8.1 Synthetic generation prompts

Prompt templates used to generate synthetic training data. Industry standard: Proprietary at frontier labs; small synthetic-heavy model family papers describe textbook-style prompts.

A1.8.2 Diversity and collapse controls

Techniques to prevent distributional collapse from over-reliance on model-generated data. Industry standard: Diversity sampling, multiple-teacher ensembles, periodic refresh from human data.

A1.8.3 Synthetic-vs-human ratio policy

Maximum proportion of synthetic data per training run. Industry standard: No public standard; varies by lab and stage. Pre-training tends low (<10%); SFT can be majority synthetic.

Tokenizer

68 sub-endpoints mapped

MZN Provisional Position · Strong Evidence

Multilingual tokenizer expertise with under-served-script operational experience

Direct production work in Persian (an Arabic-script language documented as systematically over-fragmented in byte-level BPE) gives concrete operational understanding of tokenizer fairness, joining behavior, diacritic normalization, and letter-form collisions. Documented exposure to the 2–4× token-cost gap for non-Latin scripts. Patent-documented tokenizer architecture work; specifics held in the proprietary portfolio.

Definition

The tokenizer maps raw text into discrete tokens — the model's vocabulary. Tokenizer choice is permanent: it defines vocabulary size, multilingual coverage, code handling, and context-window efficiency. A bad tokenizer wastes context (more tokens per character), degrades multilingual performance, and cannot be changed without retraining. Frontier tokenizers are byte-level BPE or SentencePiece with 100K-256K vocabulary.

State of the Art (2025–2026)

a current-generation frontier model tokenizer (cl100k_base, 100K vocab) and a leading open-weights model tokenizer (128K vocab, multilingual) are reference points. Byte-level BPE (open-source BPE tokenizer libraries, a foundational decoder-only model lineage) handles any UTF-8 input gracefully. SentencePiece (open-weights models) supports both BPE and Unigram. Multimodal tokenizers add image tokens (256-1024 per image) and audio tokens.

Key Decisions

Vocabulary size (32K → 256K)
Algorithm (BPE vs. Unigram)
Byte-level fallback
Pre-tokenization regex (whitespace, digits)
Special tokens design
Multilingual balance

Trade-offs

Larger vocab → fewer tokens per text but larger embedding matrix (linear in vocab)
BPE → simple, deterministic; Unigram → probabilistic, slightly better for some languages
Pre-tokenization affects compositional generalization

Numbers & Ablations

Tokenizer compression efficiency: a leading open-weights model (128K vocab) compresses Persian text 4.2× better than a 2023-generation open-weights model (32K). Korean: 5.1×. Hindi: 3.8× (Petrov 2023 + community measurements).
Vocabulary size cost: each doubling of vocab adds ~1B params at 8K hidden_dim, ~3B at 12K hidden_dim (frontier scale).
Tied embeddings save ~50% of vocab parameter cost; standard at frontier dense, sometimes untied in MoE.
Encoding speed: a byte-level BPE tokenizer library ~1M tokens/sec/CPU; an open-model hub tokenizers (Rust) ~700K. Negligible relative to inference compute.
Glitch token incidence: ~0.01-0.1% of tokens in BPE vocabularies (untrained tail). Detected via embedding magnitude analysis.

Open Questions

Tokenizer-free architectures (MambaByte, MEGABYTE): why have they not matched BPE at frontier scale despite theoretical advantages? Compute hypothesis vs. fundamental limit unclear.
Is there an optimal vocab size for a given (model size, data mix, target language portfolio)? Current choices are heuristic.
Cross-lingual transfer in shared-vocab tokenizers: how much capability is shared vs. language-isolated? Limited mechanistic understanding.
Multimodal tokenization: image tokens at 256 vs. 1024 per image — what is the actual quality-vs-cost frontier? AnyRes (LLaVA-NeXT) provides one data point, not a curve.

Reference analyst note. Tokenizer choice is a permanent commitment that constrains everything downstream. Frontier labs underinvest here — most use SentencePiece-defaults trained on subset of data. The next frontier capability gain may come from rethinking tokenization (entropy-aware dynamic tokenization, byte-level with efficient training). Anyone aiming for genuine multilingual frontier should treat tokenizer as a first-class capability investment.

Reference Analyst Note

Tokenizer choice is a permanent commitment that constrains everything downstream. Frontier labs underinvest here — most use SentencePiece-defaults trained on subset of data. The next frontier capability gain may come from rethinking tokenization (entropy-aware dynamic tokenization, byte-level with efficient training). Anyone aiming for genuine multilingual frontier should treat tokenizer as a first-class capability investment.

Examples

a current-generation frontier model: cl100k_base, 100K vocab, byte-level BPE · a leading open-weights model: 128K vocab, multilingual SentencePiece BPE · a leading frontier model: ~65K vocab · a multimodal frontier model: tokenizer designed for multimodal

References (Academic)

Sennrich et al., Subword Units / BPE (2015) · Kudo & Richardson, SentencePiece (2018) · Petrov et al., Tokenizer Choice (2023)

Sub-endpoint anatomy — 68 items mapped

A2.1 Tokenization Algorithm Family

BPE (Byte-Pair Encoding) iteratively merges most-frequent character pairs. Variants: word-level BPE (a foundational decoder-only model, an earlier frontier model era), byte-level BPE (operates on UTF-8 bytes, never produces unknown tokens), and SentencePiece-BPE (whitespace-aware). Byte-level BPE is the dominant choice for production LLMs because it handles any input. SOTA: Byte-level BPE with carefully-designed pre-tokenization regex (handling whitespace, digits, contractions) is standard. Leading open-weights tokenizers regex splits digits into individual tokens for better arithmetic. Modern implementations (open-source BPE tokenizer libraries, an open-model hub tokenizers) achieve sub-millisecond encoding for thousands of tokens. e.g. multiple frontier model generations: byte-level BPE · leading open-weights model: SentencePiece BPE with byte fallback · an open-weights frontier lab: same lineage as leading open-weights model

A2.1.1 Word-level tokenization

Each whitespace-separated unit is a token. Industry standard: Effectively obsolete for LLMs. OOV explosion makes it incompatible with web-scale training.

A2.1.2 Character-level tokenization

Each character is a token. Industry standard: Used in CharFormer, character CNNs. Not used by frontier general-purpose LLMs because attention cost dominates.

A2.1.3 Byte-level (raw)

Each UTF-8 byte is a token. Vocabulary fixed at 256. Industry standard: Used as base layer of Byte-Level BPE. Pure byte-level only in tokenizer-free models like a byte-level encoder-decoder model.

A2.1.4 BPE family (Byte Pair Encoding)

Iteratively merge most-frequent adjacent token pair until target vocabulary size reached. Industry standard: Dominant family for frontier LLMs. Byte-Level BPE (a leading frontier model family) is most common.

+ deeper detail (3 leaves)

A2.1.4.1 Standard BPE (character-based) Original BPE, applied over Unicode characters. Industry standard: Mostly superseded by byte-level BPE for general LLMs. Some MT systems still use.
A2.1.4.2 Byte-Level BPE BPE applied over the 256-byte alphabet rather than characters. Guarantees no OOV at byte level. Industry standard: multiple frontier model generations, a leading open-weights model (a byte-level BPE tokenizer library-style). Considered dominant frontier choice for English-heavy + code workloads.
A2.1.4.3 SentencePiece-BPE BPE implementation in SentencePiece library, raw text input without pre-tokenization. Industry standard: Common in multilingual models (a multilingual encoder-decoder model, a multilingual translation model). Treats whitespace as a regular character.

A2.1.5 Unigram LM (Kudo)

Probabilistic subword model; iteratively prune low-probability subwords from initial large vocabulary. Industry standard: an encoder-decoder model, a multilingual encoder-decoder model, an early open-weights model/2 use SentencePiece-Unigram. Theoretically allows multiple segmentations (subword regularization).

+ deeper detail (2 leaves)

A2.1.5.1 Standard Unigram LM training EM-based pruning from initial seed vocabulary.
A2.1.5.2 Subword regularization (sampling) Train-time sampling of alternative segmentations to improve robustness. Industry standard: Used in some MT models. Less common in modern LLMs.

A2.1.6 WordPiece

an encoder-only model-style subword model; greedy longest-match segmentation. Industry standard: Used in an encoder-only model family. Less common in modern decoder-only LLMs.

A2.1.7 Tokenizer-free

Operate directly on bytes or characters without learned vocabulary. Industry standard: a byte-level encoder-decoder model, CANINE, MEGABYTE. Research direction; computational cost limits frontier deployment.

A2.2 Vocabulary Design

Vocabulary size is a primary hyperparameter. Larger vocab → fewer tokens per text → longer effective context → faster inference per character. But also: larger embedding matrix (vocab × hidden_dim parameters) and softmax cost over vocab. Frontier 2024-2026 trend: 128K-256K vocab. SOTA: A leading open-weights model expanded from 32K (a 2023-generation open-weights model) to 128K, citing multilingual coverage. A multimodal frontier model, a frontier multimodal model use larger vocabs (~200K+ estimated). Trade-off: vocab × hidden_dim adds parameters: 128K × 8192 = 1B parameters in embedding alone for a large model. This is offset by fewer tokens and tied input/output embeddings. e.g. a 2023-generation open-weights model: 32K · a leading open-weights model: 128K · a frontier multimodal model: ~200K (estimated)

A2.2.1 Vocab size selection

Total number of tokens in vocabulary. Industry standard: 32K (an early open-weights model/2), 50K (a foundational decoder-only model/3), 100K-128K (a current-generation frontier model, a leading open-weights model, a long-context frontier model estimated). Trend toward larger vocabularies for multilingual + code coverage.

+ deeper detail (2 leaves)

A2.2.1.1 Compute trade-off Larger vocab = larger embedding matrix, larger output projection, more compute per step. Industry standard: Embedding cost scales linearly with vocab size; for very large models, vocab cost is small fraction of total compute.
A2.2.1.2 Coverage vs sparsity Larger vocab = better per-language coverage but rarer tokens. Industry standard: Sweet spot empirical. A leading open-weights model increased vocab from 32K to 128K specifically for multilingual + code.

A2.2.2 Special tokens

Reserved tokens for protocol use: BOS, EOS, PAD, chat templates, tool calls, system roles.

+ deeper detail (4 leaves)

A2.2.2.1 BOS / EOS / PAD Beginning-of-sequence, end-of-sequence, and padding tokens. Industry standard: Universal. Specific token IDs vary; some models conflate BOS=EOS, others separate.
A2.2.2.2 Chat template tokens Tokens marking message roles (user, assistant, system) and turn boundaries. Industry standard: ChatML (a leading frontier lab), recent open-weights models templates with [INST]...[/INST], a leading frontier model with Human:/Assistant: convention.
A2.2.2.3 Tool / function-call tokens Special tokens for function invocation, tool result returns, structured output. Industry standard: Increasingly reserved. A leading open-weights model added tool tokens; a leading frontier lab uses structured wrapper tokens.
A2.2.2.4 Reserved / unused token policy Slots reserved for future special tokens. Industry standard: A leading open-weights model reserved 256 special token slots. Allows post-training extension without retraining tokenizer.

A2.2.3 Vocab freezing strategy

Whether vocabulary is frozen post-training or extensible. Industry standard: Frozen at training start. Extension requires careful procedure (see A2.2.4).

A2.2.4 Vocab extension policy

Adding tokens post-training (new languages, domains, special tokens). Industry standard: Add to reserved slots; new embeddings randomly initialized + fine-tuned. A leading open-weights model demonstrates.

A2.3 Multilingual & Multi-Script Coverage

Special tokens demarcate roles, modalities, and structural boundaries. Standard set: BOS (begin), EOS (end), PAD, UNK (rare with byte-level), and chat tokens (system/user/assistant role markers). Tool-using models add tokens for function calls, results, and reasoning steps. Multimodal adds image-start/end and audio markers. SOTA: ChatML (a leading frontier lab) and a leading open-weights model's chat template define standard chat structure with explicit role tokens. Tool use tokens (function_call_start, function_result) increasingly standardized. Reasoning models (o1, R1) use special thinking/answer tokens. Reserved tokens (e.g., 200+ in a leading open-weights model) allow post-hoc additions without retraining. e.g. ChatML: <|im_start|>, <|im_end|> · a leading open-weights model: <|begin_of_text|>, <|start_header_id|>, etc. · o1-style: <thinking>, <answer>

A2.3.1 Script coverage

Per-script tokenization fidelity.

+ deeper detail (6 leaves)

A2.3.1.1 Latin scripts English, Spanish, French, German, etc. Industry standard: Best-supported. Byte-level BPE handles directly; SentencePiece-Unigram handles via subword regularization.
A2.3.1.2 CJK scripts (Chinese, Japanese, Korean) Logographic and mixed scripts. Industry standard: Treat each character or character-pair as token. Heavy vocabulary footprint due to 50K+ characters in modern Chinese.
A2.3.1.3 Arabic-script (Arabic, Persian, Urdu) Right-to-left abjad scripts with optional diacritics and joining behavior. Industry standard: Often over-fragmented in byte-level BPE due to multi-byte UTF-8 representation. Persian especially under-served. - A2.3.1.3.1 — RTL token boundary How tokenizer handles right-to-left direction at token boundaries. Industry standard: Mostly handled at rendering layer, not tokenizer. Tokenizer treats as plain byte sequence. - A2.3.1.3.2 — Diacritic handling Optional vowel marks (harakat). Train corpus typically has them inconsistently. Industry standard: Inconsistent. Diacritics treated as separate tokens or dropped during normalization. - A2.3.1.3.3 — Letter-form normalization Same character with different visual forms (e.g. Arabic ya/Persian ye). Industry standard: NFC/NFKC normalization standard but not universal; can collapse distinct characters. - A2.3.1.3.4 — Joining behavior Arabic-script letters change shape based on position in word (initial/medial/final/isolated). Industry standard: Tokenizer operates on logical Unicode code-points, not visual forms; joining handled at rendering.
A2.3.1.4 Indic scripts Devanagari, Bengali, Tamil, Telugu, etc. Industry standard: Often under-tokenized due to low corpus presence. Multi-byte UTF-8 → over-fragmentation.
A2.3.1.5 Cyrillic Russian, Ukrainian, Bulgarian, Serbian, etc. Industry standard: Generally well-served in major models due to substantial corpus presence.
A2.3.1.6 Long-tail scripts Thai, Hebrew, Greek, Armenian, Georgian, Ethiopic, etc. Industry standard: Coverage varies. Models trained on web corpus serve them roughly proportional to corpus presence.

A2.3.2 Token efficiency per language

Tokens-per-character ratio across languages — measures fairness and cost.

+ deeper detail (3 leaves)

A2.3.2.1 Tokens-per-character ratio Average tokens needed to encode 1000 characters in a given language. Industry standard: English ~0.25 tokens/char (4 chars per token). Many low-resource languages 1.0+ (1 token per char).
A2.3.2.2 Cross-language fairness Cost / context-window disparity between languages. Industry standard: Petrov et al. 2023 documented 5-15× cost disparity for some low-resource languages.
A2.3.2.3 Low-resource over-fragmentation Languages with sparse corpus presence get poorly-merged tokens. Industry standard: Up-sampling during tokenizer training partially addresses; full fix requires balanced corpus or per-language tokenizer.

A2.3.3 Cross-lingual token sharing

Whether semantically similar concepts share tokens across languages. Industry standard: Emergent in shared vocabulary; not designed-in. Subject of research in multilingual representation.

A2.4 Pre-tokenization

Multilingual tokenization is a major axis. English-centric tokenizers fragment non-Latin scripts heavily (e.g., a Persian/Arabic word may be 3-5x more tokens than English equivalent). This degrades both performance and economics for non-English users. Modern tokenizers (a leading open-weights model, a multimodal frontier model) explicitly rebalance to compress non-English scripts. SOTA: A leading open-weights model's 128K vocab includes substantial coverage for Chinese, Arabic, Hindi, Persian, etc. Trade-off: each language added 'costs' embedding capacity. A multimodal frontier model was designed multilingual-first. Some labs train language-specific tokenizers (e.g., Chinese-open-weights model family) for downstream models. e.g. A leading open-weights model vs a 2023-generation open-weights model: 4-8x compression improvement on non-English · a frontier multimodal model: significant non-English improvement over a 2023-class frontier model

A2.4.1 Whitespace handling

Whether whitespace is a token boundary, a regular character, or attached to adjacent token. Industry standard: Byte-level BPE: whitespace as character, often attached to following token (Ġ prefix in a foundational decoder-only model). SentencePiece: whitespace replaced with ▁.

A2.4.2 Punctuation rules

How punctuation interacts with adjacent characters during pre-tokenization. Industry standard: Generally split at punctuation boundaries via regex (frontier-style).

A2.4.3 Number/digit splitting

Whether multi-digit numbers are split into individual digits. Industry standard: an early open-weights model/2 split into digits for math; a frontier model historically did not. A leading open-weights model also splits.

A2.4.4 URL / code special handling

Pre-segmentation for code and URLs to prevent merge across syntactic boundaries. Industry standard: Often integrated into pre-tokenization regex; some pipelines route code through specialized tokenizer.

A2.4.5 Unicode normalization

NFC, NFKC, NFD, NFKD — different ways to canonicalize Unicode. Industry standard: NFC standard for most pipelines. NFKC drops compatibility characters but loses information.

A2.5 Code & Specialized Domains

Code and specialized-domain tokenization. Code has different distributional properties from natural language: high entropy in identifiers, important whitespace, frequent punctuation. Specialized tokenizers (or careful pre-tokenization in shared tokenizer) handle these. SOTA: General-purpose tokenizers (a leading open-weights model, a current-generation frontier model) handle code well via byte-level BPE + careful pre-tokenization (digits separate, indentation preserved). Code-specialized tokenizers (StarCoder) marginally better on code-only metrics but lose general-language efficiency. Frontier: shared tokenizer with code-aware design. e.g. StarCoder tokenizer: code-specialized · a leading open-weights model: general but strong on code · a byte-level BPE tokenizer library cl100k: handles code well

A2.5.1 Indentation handling

Python-style significant whitespace; tabs vs spaces. Industry standard: Modern code-aware tokenizers preserve indentation as multi-space tokens (single token for 4 spaces, etc.).

A2.5.2 Symbol density

Heavy punctuation and operator density in code. Industry standard: Common operators (==, !=, ->, =>) often merged into single tokens during BPE training.

A2.5.3 Multi-language source code

Single tokenizer covering Python, JavaScript, Java, Rust, C++, etc. Industry standard: Shared tokenizer trained on multi-language code corpus (The Stack).

A2.5.4 Math / LaTeX

Mathematical notation and LaTeX commands. Industry standard: Common LaTeX commands (\frac, \sum, etc.) often merged. Number/digit splitting (A2.4.3) helps arithmetic.

A2.6 Multi-Modal Token Spaces

Multi-modal token spaces. Vision tokens (from ViT/SigLIP encoder, 256-1024 per image), audio tokens (Whisper-style or native), video tokens (temporal sampling). Native multimodal models share token space across modalities; late-fusion projects modality embeddings into LLM space. SOTA: Native multimodal token spaces standard 2024+. Image: ViT/SigLIP encoder produces 256-1024 tokens per image; dynamic resolution (AnyRes, Pixtral). Audio: Moshi-style native audio tokens at 12.5Hz, or Whisper transcription tokens. Video: 1-8 fps spatial-temporal patches. e.g. a frontier multimodal model native multimodal · a multimodal frontier model 2.0 native + image gen · Chameleon (an open-weights frontier lab, open)

A2.6.1 Image patch tokens

ViT-style 16x16 or 14x14 patches converted to tokens via linear projection. Industry standard: Standard since ViT (Dosovitskiy 2020). 16x16 most common; 14x14 in newer models.

A2.6.2 Audio tokens

Audio frames or learned audio codec tokens (Encodec, SoundStream). Industry standard: Encodec (Defossez 2022) and similar neural codecs produce discrete audio tokens. Whisper uses log-mel spectrograms instead.

A2.6.3 Bridged token spaces

Shared embedding space across modalities. Industry standard: Common in multimodal LLMs (a frontier vision-language model, a multimodal frontier model, a long-context frontier model). Implementation via projection or learned bridging.

+ deeper detail (2 leaves)

A2.6.3.1 Shared embedding space All modalities mapped to single vector space. Industry standard: CLIP-style or learned per-modality projector to text embedding dim.
A2.6.3.2 Cross-attention bridges Modality-specific encoders feeding into text decoder via cross-attention. Industry standard: Used in Flamingo (a multimodal frontier lab 2022) and derivatives.

A2.7 Tokenizer Training Pipeline

Tokenizer training pipeline. SentencePiece or an open-model hub tokenizers libraries provide implementation. Train on representative corpus sample (~10-100GB). Iterations: choose vocab size, train, evaluate compression ratio across languages, special token reservations, finalize. SOTA: SentencePiece + an open-model hub tokenizers libraries provide implementation. Train on representative 10-100GB sample. Reserve 200-500 tokens for special use. Evaluate per-language compression ratio. Standard pipeline: ~hours on single CPU. e.g. an open-model hub tokenizers (Rust, fast) · SentencePiece (a multimodal frontier lab) · a byte-level BPE tokenizer library (a leading frontier lab)

A2.7.1 Training corpus selection

Which subset of the data corpus is used to train the tokenizer. Industry standard: Random sample of the full pre-training corpus, usually 1-10B tokens. Mixture should mirror final training mixture.

A2.7.2 Sampling strategy

How documents are sampled into the tokenizer training set. Industry standard: Up-sample low-resource languages to prevent over-fragmentation.

A2.7.3 Convergence criteria

Stopping condition for BPE merges. Industry standard: Stop when target vocab size reached, or when merge frequency falls below threshold.

A2.8 Inference-time Behavior

Inference-time tokenizer behavior. Encoding/decoding speed (sub-millisecond per request expected). Edge cases: incomplete UTF-8 at chunk boundaries (during streaming), tokenizer mismatch between client/server, special-token leakage in outputs. SOTA: a byte-level BPE tokenizer library and an open-model hub tokenizers achieve <1ms encoding for typical inputs. Streaming decode handles partial multi-byte chars at chunk boundaries. Production concern: special token leakage in outputs (e.g., <|endoftext|> appearing in user-visible response) — sanitization required. e.g. a byte-level BPE tokenizer library (a leading frontier lab, fast) · an open-model hub tokenizers (Rust)

A2.8.1 Detokenization correctness

Round-trip: encode then decode produces original text. Industry standard: Byte-level BPE: lossless. SentencePiece: lossless within whitespace handling rules.

A2.8.2 Streaming token boundary

Token-by-token streaming with valid UTF-8 emission. Industry standard: Buffer partial bytes until valid UTF-8 boundary; emit complete characters only.

A2.8.3 Prompt prefix handling

Whether tokenizer adds BOS automatically; how chat templates are pre-encoded. Industry standard: Varies by model. A 2023-generation open-weights model/3 expect explicit BOS; a leading frontier model family auto-adds.

A2.9 Evaluation & Robustness

Tokenizer evaluation and robustness. Compression ratio across languages (tokens per character). Coverage on rare scripts. Robustness to adversarial inputs (Unicode tricks, zero-width characters, look-alike characters). Glitch tokens (rarely-trained tokens that cause weird behavior). SOTA: Petrov et al. (2023) systematically evaluated multilingual tokenizers — older tokenizers fragment non-Latin 5-10× worse. A leading open-weights model, a multimodal frontier model show much improved fairness. Glitch tokens (rarely-trained tokens like 'a famous glitch-token example') exposed as failure mode 2023, detected via embedding magnitude analysis. e.g. Petrov et al. multilingual study · Glitch token analyses (Rumbelow & Watkins 2023) · MultiBPemb robustness work

A2.9.1 Compression metrics

Bits per character, tokens per word, vocabulary efficiency. Industry standard: Lower bits/char = better compression. Reported across languages for fairness analysis.

A2.9.2 Out-of-vocabulary behavior

How tokenizer handles characters or sequences not seen in training. Industry standard: Byte-level BPE: degrades gracefully (always representable). Character-based BPE: requires fallback.

A2.9.3 Adversarial inputs

Inputs crafted to exploit tokenization quirks (homoglyphs, invisible characters). Industry standard: Pre-tokenization Unicode normalization mitigates. Zero-width characters and homoglyphs remain attack surface.

A2.9.4 Glitch tokens / hidden token exploits

Tokens in vocabulary that produce anomalous behavior. Famous example: 'a famous glitch-token example' in an earlier frontier model. Industry standard: Caused by tokenizer training on data later removed. Mitigation: align tokenizer corpus with model corpus.

A2.10 anchor-based representation (research direction)

Anchor-based representation / advanced research direction. Beyond pure subword tokenization, research explores semantic-aware tokenization, byte-level models without BPE (MambaByte, MEGABYTE), and tokenizer-free approaches. SOTA: Tokenizer-free architectures (MambaByte, MEGABYTE, a byte-level encoder-decoder model) operate directly on bytes — eliminate fragmentation but slower. Active research; not yet matched BPE at frontier scale. Hybrid approaches (entropy-based dynamic tokenization) emerging in 2025. e.g. MambaByte (research 2024) · MEGABYTE (Yu 2023) · a byte-level encoder-decoder model (Xue 2022)

Architecture

44 sub-endpoints mapped

MZN Provisional Position · Partial

Patent-grade candidate architectural innovations; implementation validation pending; full model construction is Phase 3 scope

Patent-grade architectural inventions in the area of structured intelligence, modular reasoning, and intent-shaping pipelines (SHA-256 anchors and blockchain timestamps). Architectural patterns contributed; full frontier-scale model construction at parameter count requires partnership compute. Specifics held in the proprietary portfolio.

Definition

Model architecture defines the network's computational structure: how inputs flow through layers, what operations apply at each layer, and how representations combine. The dominant paradigm since 2017 is the decoder-only transformer with mods. Architecture decisions cascade: attention type affects long-context, normalization affects training stability, MoE affects parameter efficiency vs. compute.

State of the Art (2025–2026)

Frontier 2024-2026 dense architectures (a leading open-weights flagship model, an open-weights frontier lab Large 2): decoder-only transformer with RoPE positional encoding, RMSNorm, SwiGLU activation, GQA (grouped query attention). MoE architectures (a sparse-MoE frontier model, an open-weights frontier model (V3 class)): sparse expert routing with 8-256 experts, top-2 routing typical. Reasoning models (o1, R1): same architecture but RL-trained for chain-of-thought. Multimodal: native interleaved tokens with vision encoder integration.

Key Decisions

Dense vs. MoE
Attention type (full, GQA, MQA, sliding window)
Positional encoding (RoPE, ALiBi, NoPE)
Normalization (RMSNorm, LayerNorm, post vs. pre)
Activation (SwiGLU, GeGLU, ReLU)
Depth × width allocation
Vision/audio integration strategy

Trade-offs

MoE → more parameters per FLOP but harder to train and serve
GQA → faster inference, slight quality reduction vs. full MHA
Sliding window → linear attention but loses long-range info

Numbers & Ablations

GQA-8 vs full MHA at 70B: <1% MMLU degradation, 4-8× KV cache memory reduction, ~3× decode throughput at 32K context (Ainslie 2023, a leading open-weights model paper).
MoE active/total ratio: an open-weights frontier model (V3 class) 5.5% (37B/671B), a sparse-MoE frontier model 28% (39B/141B), a current-generation frontier model estimated ~20% (closed). Lower ratio = more capacity per FLOP at training/serving complexity cost.
RoPE base Î¸ scaling: original 10K → 500K (a leading open-weights model for 128K context) → 5M+ (research on 1M+ context). Each 10× context extension typically requires ~10× Î¸.
Multi-Head Latent Attention (an open-weights frontier provider V2/V3): 93% KV cache reduction vs MHA, 1-2% benchmark improvement attributed to better representational structure.
SwiGLU vs GeLU: ~1-2% perplexity gain at parameter-matched budget (Shazeer 2020). Universal at frontier 2024+.
Pre-RMSNorm vs Pre-LayerNorm: equivalent quality, ~7-10% throughput gain (omits mean computation). Universal at frontier 2024+.

Open Questions

Is there a Pareto-better attention than MLA for long context? Several research efforts (Differential Attention, Lightning Attention) but no frontier context yet.
Why does sliding window + global hybrid (a small open-weights model 2) underperform pure full-attention at frontier scale despite theoretical advantages? Empirical observation, not understood.
Scaling laws for active parameters in MoE: if active=37B in an open-weights frontier model (V3 class) matches dense 70B-100B-class, what is the actual mapping? No published Chinchilla-equivalent for MoE.
Reasoning models: does the architecture that's best for non-reasoning training remain optimal under RL post-training? o1/R1 suggest yes; theoretical reason absent.
a state-space frontier architecture/State-Space Models at frontier: 2024 demonstrated competitive at 7-13B. Why has no lab pushed to 70B+ for serious comparison? Compute economics or architectural ceiling?

Reference analyst note. Dense architecture is dead at frontier scale by end of 2026. A leading open-weights flagship model is likely the last frontier-tier dense model. Either MoE (an open-weights frontier provider lineage, fine-grained 200+ experts) or new sparse paradigms wins. Architecture innovation is decoupling from scaling — RL post-training resets 'capability per parameter' such that smaller models with better post-training match much larger pre-train-only models. The bottleneck is shifting from architecture-quality to RL-environment-quality.

Reference Analyst Note

Dense architecture is dead at frontier scale by end of 2026. A leading open-weights flagship model is likely the last frontier-tier dense model. Either MoE (an open-weights frontier provider lineage, fine-grained 200+ experts) or new sparse paradigms wins. Architecture innovation is decoupling from scaling — RL post-training resets 'capability per parameter' such that smaller models with better post-training match much larger pre-train-only models. The bottleneck is shifting from architecture-quality to RL-environment-quality.

Examples

A leading open-weights flagship model: dense, GQA, RoPE, RMSNorm, SwiGLU · an open-weights frontier model (V3 class): MoE 671B total / 37B active, multi-head latent attention · a sparse-MoE frontier model: MoE, 141B total / 39B active · a long-context frontier model / a frontier multimodal model: architecture undisclosed but likely MoE

References (Academic)

Vaswani et al., Attention Is All You Need (2017) · Touvron et al., leading open-weights model (2023, 2024) · an open-weights frontier model (V3 class) technical report (2024) · Su et al., RoPE (2021)

Sub-endpoint anatomy — 44 items mapped

A3.1 Transformer Block Design

Attention mechanism is the core operation. Standard multi-head attention scales quadratically with sequence length, making naive transformers infeasible for long context. Variants reduce cost: multi-query attention (MQA, single KV head), grouped query attention (GQA, fewer KV heads than Q heads), sliding window attention (local), and multi-head latent attention (an open-weights frontier provider's compressed KV). SOTA: GQA is the dominant choice for frontier dense models (a open-weights models). Reduces KV cache memory by 4-8x with negligible quality loss. An open-weights frontier model (V3 class) introduced MLA (Multi-head Latent Attention) which compresses KV via low-rank projection, achieving even better memory efficiency. Sliding window (an open-weights frontier lab, a small open-weights model) handles very long context with O(n × w) cost. e.g. A leading open-weights model (70B class): 64 Q heads, 8 KV heads (GQA-8) · an open-weights frontier lab: GQA + sliding window · an open-weights frontier model (V3 class): MLA

A3.1.1 Attention mechanism

Self-attention variant. Industry standard: Multi-head self-attention (Vaswani 2017) is the foundation. Modern variants reduce KV cache cost.

+ deeper detail (5 leaves)

A3.1.1.1 Multi-Head Attention (MHA) Standard multi-head: each head has independent Q, K, V projections. Industry standard: Foundation. Used in original Transformer, foundational decoder-only models, an encoder-only model.
A3.1.1.2 Multi-Query Attention (MQA) Single K, V projection shared across all heads. Reduces KV cache by H×. Industry standard: an earlier frontier model (a multimodal frontier lab 2022), an open-weights model. Reduces memory but mild quality loss.
A3.1.1.3 Grouped-Query Attention (GQA) K, V shared across groups of heads. Compromise between MHA and MQA. Industry standard: a 2023-generation open-weights model (70B), a leading open-weights model, a sparse-MoE frontier model. Now dominant frontier choice.
A3.1.1.4 Sliding Window / Local Attention Attention restricted to local window. Industry standard: a sliding-window frontier model uses sliding window 4096. Trade-off: linear attention cost, limited long-range coupling.
A3.1.1.5 FlashAttention IO-aware attention implementation: reduces memory access by tiling. Industry standard: Universally adopted. FlashAttention-2 (Dao 2023) and FlashAttention-3 (Shah 2024) progressive optimizations.

A3.1.2 Feed-forward network (FFN)

Per-token MLP after attention.

+ deeper detail (3 leaves)

A3.1.2.1 Standard FFN (GELU) Two linear layers with GELU between. Industry standard: a foundational decoder-only model/3, an encoder-only model. Hidden dim typically 4× model dim.
A3.1.2.2 SwiGLU Gated linear unit with Swish activation. Three matrices instead of two. Industry standard: an earlier frontier model, an early open-weights model/2/3, an open-weights frontier lab. Now dominant. Hidden dim ~2.67× to keep parameter count constant.
A3.1.2.3 GeGLU GLU variant with GELU activation. Industry standard: Used in some models (a small open-weights model). Less common than SwiGLU.

A3.1.3 Normalization

Activation normalization layer.

+ deeper detail (3 leaves)

A3.1.3.1 LayerNorm Standard layer normalization (Ba 2016). Industry standard: a foundational decoder-only model, an encoder-only model, an encoder-decoder model. Largely superseded by RMSNorm at frontier.
A3.1.3.2 RMSNorm Root-Mean-Square normalization. No mean centering, only scaling. Industry standard: an early open-weights model/2/3, an open-weights frontier lab, a small open-weights model. Now dominant. ~10% faster than LayerNorm.
A3.1.3.3 Pre-norm vs Post-norm Whether normalization is applied before or after the residual. Industry standard: Pre-norm dominant since a foundational decoder-only model; better training stability at depth.

A3.1.4 Residual connections

Skip connections around each sub-layer. Industry standard: Universal. Required for gradient flow at depth.

A3.2 Position Encoding

Positional encoding tells the model where each token sits. Modern choices: RoPE (rotary), ALiBi (linear bias), and NoPE (no explicit positions, model learns implicitly via causal mask). RoPE is dominant. RoPE's frequency choice and scaling determine effective context length. SOTA: RoPE (Su et al., 2021) is standard. Long-context models extend RoPE via NTK-aware scaling, YaRN (Peng et al., 2023), or LongRoPE. A leading open-weights model uses scaling factor 8 to extend from 8K base to 128K context. Frontier 2025 efforts push to 1M+ context (a multimodal frontier model 2M context). Position interpolation methods are key to extending context post-training. e.g. A leading open-weights model: RoPE Î¸=500K, YaRN-style scaling to 128K · a million-token-context frontier model Pro: 2M context · a long-context frontier model: 200K context

A3.2.1 Absolute position (sinusoidal/learned)

Original Transformer position encoding. Industry standard: Obsolete for new frontier models. Cannot extrapolate beyond training length.

A3.2.2 RoPE (Rotary Position Embedding)

Apply rotation matrix to Q, K based on position. Industry standard: leading open-weights model family, an open replication initiative, a 2023-class frontier model. Now dominant. Better extrapolation than absolute.

A3.2.3 ALiBi (Attention Linear Bias)

Add linear bias to attention scores based on distance. Industry standard: an ALiBi-trained open model, an ALiBi-trained model. Strong extrapolation; less popular than RoPE recently.

A3.2.4 RoPE scaling (Yarn, NTK)

Methods to extend RoPE-trained models to longer contexts post-hoc. Industry standard: Yarn (Peng 2023), NTK-aware scaling. Used to extend a leading open-weights model to 128K context.

A3.3 Mixture of Experts (MoE)

Mixture of Experts (MoE) replaces dense feed-forward layers with multiple 'expert' FFNs and a router that activates only top-k experts per token. Result: larger total parameter count for same active compute. Trade-offs: harder training (load balancing, expert collapse), harder inference (memory holds all experts, only some compute), but better parameter efficiency. SOTA: an open-weights frontier model (V3 class) (Dec 2024): 671B total, 37B active per token, fine-grained MoE with 256 experts and top-8 routing. A sparse-MoE frontier model: 141B total, 39B active, 8 experts top-2. Frontier closed models likely MoE (a current-generation frontier model, a leading frontier model, a multimodal frontier model speculated). Auxiliary-loss-free load balancing (an open-weights frontier model (V3 class) innovation) avoids the gradient pathologies of traditional MoE training. e.g. a sparse-MoE frontier model/22B: 8 experts, top-2 · an open-weights frontier model (V3 class): 256 experts, top-8 + 1 shared · Switch Transformer: 1 expert (top-1), early MoE

A3.3.1 Switch Transformer (top-1 routing)

Each token routed to single expert. Industry standard: Switch Transformer pioneered. Simpler routing but load-balancing harder.

A3.3.2 Sparse top-K (a sparse-MoE frontier model-style)

Each token routed to top-K experts (K=2 typically). Industry standard: a sparse-MoE frontier model and a sparse-MoE frontier model. 8 experts, top-2 active per token.

A3.3.3 Expert Choice routing

Experts choose tokens (not tokens choose experts). Better load balancing. Industry standard: Used in some a multimodal frontier lab models. Less popular publicly.

A3.3.4 Load balancing

Auxiliary loss to keep expert utilization balanced. Industry standard: Standard practice in all MoE training.

A3.4 Depth/Width Allocation

Normalization stabilizes training and enables deep networks. Choice: LayerNorm (original transformer), RMSNorm (simpler, faster, equally effective), or various others. Position: pre-norm (before each sublayer, dominant) vs post-norm (after, original transformer, harder to train deep). Frontier choice: pre-RMSNorm. SOTA: Pre-RMSNorm is universal in 2024+ frontier dense models (open-weights models, an open-weights frontier provider). RMSNorm omits mean-subtraction (just scales by RMS) — empirically equivalent quality, ~10% faster. Some research on QK-norm (normalize Q, K separately for attention stability), used in some 2024 models. e.g. open-weights models, an open-weights frontier provider: pre-RMSNorm · Original Transformer: post-LayerNorm · a foundational decoder-only model/3: pre-LayerNorm

A3.4.1 Aspect ratio

Ratio of model depth to width. Industry standard: Empirical sweet spot ~80-128 depth for largest models. Very deep (>200 layer) shown not to help.

A3.4.2 Hidden dimension selection

Model dimension per layer. Industry standard: A leading open-weights model (70B class) uses 8192. Hidden dim 8× heads is common pattern.

A3.4.3 Layer count

Number of transformer blocks. Industry standard: A leading open-weights model (70B class) = 80 layers, a leading open-weights flagship model = 126 layers. Scales sublinearly with parameters.

A3.5 Embedding & Output Projection

Activation function choice in feed-forward layers. Modern frontier models use SwiGLU (Swish-gated linear unit) or GeGLU. These gated activations consistently outperform ReLU/GeLU at the same parameter count, at the cost of an extra weight matrix (3 weights per FFN instead of 2). SOTA: SwiGLU is standard. FFN dimension reduced by 2/3 to compensate for the extra matrix, keeping parameter count constant. Gives ~1-2% perplexity improvement over GeLU at no cost. e.g. open-weights models, an open-weights frontier provider: SwiGLU · an earlier frontier model: SwiGLU · a foundational decoder-only model/3: GeLU

A3.5.1 Tied embeddings

Whether input embedding and output projection share weights. Industry standard: a foundational decoder-only model ties. leading open-weights model family does not tie (separate matrices). Tradeoff: parameter count vs flexibility.

A3.5.2 Output head

Final projection to vocabulary logits. Industry standard: Linear projection. Usually preceded by final RMSNorm.

A3.6 Long-Context Architecture

Long-context handling extends the model's effective context window beyond pre-training. Methods: (1) train with long context throughout (expensive), (2) train short → extend via position interpolation + fine-tune, (3) RAG / external memory (skip true long context). Frontier 2025: 200K-2M true context. SOTA: A leading open-weights flagship model: 128K context via YaRN-style RoPE scaling + long-context fine-tune. A million-token-context frontier model Pro: 1M-2M context with ring attention and other tricks. A long-context frontier model: 200K. A current-generation frontier model-Turbo/4o: 128K. Long-context evaluation shifted from NIAH (needle in haystack, easy) to RULER and BABILong (multi-hop reasoning, harder). e.g. a million-token-context frontier model: 2M context · a long-context frontier model: 200K · a leading open-weights model: 128K

A3.6.1 Context window size

Maximum sequence length. Industry standard: A leading open-weights model: 128K. A long-context frontier model: 200K. A million-token-context frontier model: 1M-10M. Trend: longer.

A3.6.2 Position extrapolation

Train on shorter context, extend at inference. Industry standard: RoPE scaling (Yarn, NTK), position interpolation. A leading open-weights model trained on 8K, extended to 128K via RoPE scaling + continued training.

A3.6.3 KV cache optimization (architectural)

Architectural choices that reduce KV cache size. Industry standard: GQA (A3.1.1.3), sliding window (A3.1.1.4), MQA (A3.1.1.2). Architectural KV reduction directly enables long context.

A3.7 Activation Precision & Dtype

Multimodal integration: how non-text modalities enter the model. Two approaches: (1) Late fusion — separate vision/audio encoders → projector → frozen LLM, used in LLaVA, early a leading frontier model. (2) Native multimodal — interleaved image/audio/text tokens trained jointly from pre-training, used in a multimodal frontier model, a frontier multimodal model, Chameleon. Native is harder but enables true cross-modal reasoning. SOTA: Native multimodal dominates 2024+ frontier (a frontier multimodal model, a multimodal frontier model 2.0, a long-context frontier model). Vision tokens generated by ViT-style encoder (CLIP, SigLIP) and inserted into token stream. Image resolution often dynamic (e.g., AnyRes, Pixtral): 448² base, up to 1024² for detail. Audio: Whisper-style encoder or native audio tokens. Video: temporal sampling + frame tokens. e.g. a frontier multimodal model: native multimodal · a multimodal frontier model 2.0: native multimodal + image generation · a leading open-weights model.2 Vision: late fusion (vision adapter)

A3.7.1 Mixed precision (BF16/FP16)

Compute in lower precision, master weights in FP32. Industry standard: BF16 dominant for training (better dynamic range than FP16). Universal at frontier.

A3.7.2 FP8 training/inference

8-bit floating point for compute. Industry standard: Emerging. A current-generation accelerator supports FP8; some a leading open-weights model phases used FP8.

A3.8 Architecture Variants

Reasoning architectures: same base architecture, trained to produce long chains of thought before answering. o1 (a leading frontier lab) and R1 (an open-weights frontier provider) demonstrate that scaling test-time compute via reasoning yields substantial capability gains, especially on math and code. Architecturally identical to standard LLMs; the innovation is in training (RL with reward on outcome) and inference (let it think). SOTA: o1 (2024) demonstrated that hidden chain-of-thought before answer dramatically improves AIME, codeforces, GPQA. An open-weights reasoning model (Jan 2025) showed open-source path: pure RL from base model with simple rule-based rewards (correct/incorrect) yields reasoning capabilities, distillable to smaller models. Architecture: standard transformer; the magic is RL training and inference-time compute allocation. e.g. A leading frontier lab o1, o3 · an open-weights reasoning model · a long-context frontier model.7 Sonnet (extended thinking)

A3.8.1 Decoder-only

Causal masking, single stack. Industry standard: A leading frontier model family, open-weights models, a leading frontier model. Dominant frontier choice.

A3.8.2 Encoder-decoder

Separate encoder + decoder, used in an encoder-decoder model, a multilingual translation model. Industry standard: Less popular for general LLMs. Strong for translation, summarization.

A3.8.3 State-Space Models / a state-space frontier architecture

Alternative to attention via state-space recurrence. Industry standard: a state-space frontier architecture (Gu, Dao 2023), a state-space frontier architecture-2. Hybrid Transformer+SSM emerging (a hybrid SSM-attention model, a hybrid SSM-attention model). Not yet mainstream frontier.

Training

35 sub-endpoints mapped

MZN Provisional Position · Partial

Training methodology documented; frontier-scale execution pending; frontier-scale execution is Phase 3 scope

Complete training methodology documented across model selection, fine-tuning strategy (parameter-efficient methods), optimizer configuration, learning rate schedules, batch strategy, stability control (gradient clipping, weight initialization, loss-spike recovery), parallelism strategy (data, tensor, and sharded), and checkpoint management. Reviewer-grade methodology and reference inventory exist. Frontier-scale execution at the 10K+ accelerator class remains a partnership-scope dependency.

Definition

Training infrastructure is the orchestration layer that turns architecture + data + compute into a trained model. At frontier scale (10K+ GPUs, weeks of training), every component matters: distributed parallelism strategy, optimizer state management, mixed-precision arithmetic, failure recovery, checkpoint frequency, gradient accumulation, learning rate scheduling. A 1% throughput improvement at frontier scale = millions of dollars.

State of the Art (2025–2026)

Frontier training stacks: a leading accelerator vendor a tensor-parallelism reference implementation + an open optimization framework (PyTorch), JAX/MaxText (a constitutional-methods frontier lab, a multimodal frontier lab). 4D parallelism standard: data + tensor + pipeline + expert (for MoE). A current-generation accelerator/a current-generation accelerator/a next-generation accelerator with InfiniBand. BF16 mixed-precision, FP8 emerging (a current-generation accelerator+). Checkpoint to S3/GCS every N steps with async writes. Auto-recovery from node failure.

Key Decisions

Framework (PyTorch ecosystem vs. JAX)
Parallelism strategy
Precision (BF16 vs. FP8)
Optimizer (AdamW vs. Lion vs. distributed Shampoo)
LR schedule shape
Checkpoint frequency
Gradient clipping

Trade-offs

More parallelism → larger models possible, communication overhead
FP8 → 2x throughput, training instability risk
Frequent checkpoints → resilience, write bandwidth

Numbers & Ablations

A leading open-weights flagship model training: 16K H100s × ~54 days × 700W = ~22 GWh, MFU ~38%. Total compute ~3.8e25 FLOPs.
an open-weights frontier model (V3 class) 671B-MoE training: 2K H800s × ~57 days, FP8 mixed-precision, 14.8T tokens, $5.6M reported (excludes ablations). 18.8% of a leading open-weights model's compute, comparable benchmark performance.
MFU benchmarks: 40-50% is good at frontier scale; >55% rare and only with extensive optimization. an earlier frontier model achieved 46% at 540B scale.
Failure rate: GPU failures at frontier scale ~3-5% of GPUs/week; 1-3 failures/day on 16K cluster. Without auto-recovery, multi-week runs impossible.
Optimizer state cost: AdamW = 12 bytes/param FP32 master + momentum + variance. For 405B model: ~5TB. Distributed via ZeRO-3/FSDP across DP ranks.
FP8 training precision: an open-weights frontier model (V3 class) reports <0.05% loss penalty vs BF16 with selective high-precision for sensitive ops, ~1.8× throughput.

Open Questions

Optimal LR schedule shape: cosine vs WSD vs constant-then-decay — which wins at 10T+ token scale? No frontier ablation published.
Distributed Shampoo vs AdamW at frontier: a constitutional-methods frontier lab reportedly uses Shampoo; no public head-to-head exists at >100B scale.
Training stability: are loss spikes random hardware artifacts, deterministic numerical issues, or signal of optimization pathology? Frontier labs disagree.
Annealing phase impact: a leading open-weights model reports gains from final annealing; isolated effect vs. confound with high-quality data? Unclear.
Cross-architecture parallelism transfer: knowledge of how to parallelize dense → MoE lossy transfer (expert parallelism is novel). An open-weights frontier provider had to develop new techniques.

Reference analyst note. an open-weights frontier model (V3 class)'s $5.6M-equivalent demonstrated the field has been overspending by 5-10×. The next 2 years will see massive efficiency gains as algorithmic improvements (FP8, fine-grained MoE, better parallelism, better data) compound. Frontier 'training compute' as the dominant moat is collapsing. The new moat is post-training infrastructure, RL environment quality, and inference-time compute scaling. Anyone with 1K H100s can now produce competitive models — the bottleneck has moved upstream of pre-training to data and downstream to RL.

Reference Analyst Note

an open-weights frontier model (V3 class)'s $5.6M-equivalent demonstrated the field has been overspending by 5-10×. The next 2 years will see massive efficiency gains as algorithmic improvements (FP8, fine-grained MoE, better parallelism, better data) compound. Frontier 'training compute' as the dominant moat is collapsing. The new moat is post-training infrastructure, RL environment quality, and inference-time compute scaling. Anyone with 1K H100s can now produce competitive models — the bottleneck has moved upstream of pre-training to data and downstream to RL.

Examples

A leading open-weights flagship model: 16K H100s for ~30M GPU-hours, BF16, 4D parallel · an open-weights frontier model (V3 class): 2K H800s, FP8 mixed-precision (innovation) · a constitutional-methods frontier lab: JAX on a custom-silicon accelerator

References (Academic)

Shoeybi et al., a tensor-parallelism reference implementation (2019) · Rajbhandari et al., ZeRO/an open optimization framework (2020) · a leading open-weights model paper (2024) · an open-weights frontier model (V3 class) report (2024)

Sub-endpoint anatomy — 35 items mapped

A4.1 Optimizer

Distributed parallelism strategy splits the model and data across many GPUs. Four dimensions: Data Parallelism (DP, replicate model, split batch), Tensor Parallelism (TP, split each layer's matrix multiply across GPUs), Pipeline Parallelism (PP, different layers on different GPUs), Expert Parallelism (EP, MoE experts across GPUs). Frontier uses all four ('4D parallelism'). SOTA: A leading open-weights flagship model: TP=8 (within node), PP=16, DP=128, total 16K GPUs. An open optimization framework ZeRO partitions optimizer state, gradients, parameters across DP ranks for memory. FSDP (PyTorch native) is similar to ZeRO-3. Communication-compute overlap critical: parallelism choice depends on InfiniBand topology and node-local NVLink bandwidth. e.g. A leading open-weights flagship model: 8×16×128 = 16K GPUs · an open optimization framework ZeRO-3 + TP commonly · an open-weights frontier lab Large: similar 4D

A4.1.1 SGD with momentum

Original optimizer. Industry standard: Obsolete for LLM pre-training; cannot match Adam-family at scale.

A4.1.2 AdamW

Adam with decoupled weight decay. Industry standard: Universal at frontier. β1=0.9, β2=0.95 typical for LLMs (β2=0.999 for general DL).

A4.1.3 LAMB

Layer-wise adaptive moments. Designed for very large batch sizes. Industry standard: Used in some an encoder-only model pre-training. Less common for decoder-only LLMs.

A4.1.4 Lion

Sign-momentum-only optimizer; less memory than AdamW. Industry standard: Emerging. Some a multimodal frontier lab models report success.

A4.2 Learning Rate Schedule

Mixed-precision training uses lower-precision arithmetic (BF16, FP8) for compute while keeping high-precision (FP32) master weights for stability. Doubles effective compute and halves memory. BF16 (Brain Float 16) has FP32's exponent range, avoiding overflow issues of FP16. FP8 (E4M3, E5M2) is the new frontier — 2x throughput vs BF16 but training stability harder. SOTA: BF16 mixed-precision is standard. FP8 mixed-precision validated at scale by an open-weights frontier model (V3 class) (671B MoE trained in FP8). Requires careful scaling, gradient handling, and selective high-precision for sensitive ops (LayerNorm, softmax, MoE gating). Trade-off: 2x throughput, ~5x more engineering complexity. e.g. A leading open-weights model: BF16 + FP32 master · an open-weights frontier model (V3 class): FP8 mixed-precision (first frontier-scale demonstration)

A4.2.1 Warmup phase

Linear ramp from 0 to peak LR over initial steps. Industry standard: Universal. Typically 0.5-2% of total steps. A 2023-generation open-weights model used 2000 steps.

A4.2.2 Cosine decay

Cosine curve from peak to ~10% of peak LR. Industry standard: an early open-weights model/2/3 use cosine. Min LR typically 0.1× peak.

A4.2.3 Linear decay

Linear from peak to min. Industry standard: Used in some models (e.g. an earlier frontier model used cosine, but linear common in fine-tuning).

A4.2.4 WSD (Warmup-Stable-Decay)

Constant LR after warmup, decay only at end. Industry standard: Used in MiniCPM, allows continued training without LR planning.

A4.3 Batching

Optimizer state management. AdamW (decoupled weight decay) is universal for LLM training. Stores 2 floats per parameter (momentum, variance) in addition to the parameter itself, in FP32 = 12 bytes/param overhead vs the 2-byte BF16 weight. ZeRO/FSDP partition this state across DP ranks to fit large models. Distributed Shampoo (newer, 2nd-order method) shows promise. SOTA: AdamW with Î²1=0.9, Î²2=0.95 (slightly lower than the 0.999 of original) is standard for LLMs. Weight decay 0.1 typical. Lion (Chen et al., 2023) saves memory but doesn't consistently outperform. Distributed Shampoo demonstrated at scale (a constitutional-methods frontier lab, others) for slight efficiency gain. e.g. Most frontier models: AdamW · a constitutional-methods frontier lab: Distributed Shampoo (reported) · Some open: Lion

A4.3.1 Global batch size

Total tokens processed per gradient update. Industry standard: A leading open-weights flagship model used 16M tokens/batch. Frontier ranges 4M-32M tokens.

A4.3.2 Sequence packing

Concatenating multiple documents into single sequence to avoid padding waste. Industry standard: Standard. Documents joined with EOS separator. Some pipelines use document-attention masking to prevent cross-document attention.

A4.3.3 Gradient accumulation

Accumulate gradients over micro-batches before update. Industry standard: Used to achieve large effective batch when memory limits per-device batch.

A4.4 Parallelism

Learning rate schedule shapes the optimization trajectory. Standard pattern: warmup (linear from 0 to peak over 1-3% of training) → main schedule (cosine decay to 10% of peak, or constant). Cooldown / annealing at end (decay further, sometimes with high-quality data only) is increasingly common. SOTA: A leading open-weights model: cosine decay over 15T tokens. Annealing phase: final 40B tokens with high-quality data + linear LR decay to 0. WSD (Warmup-Stable-Decay) schedule (Hu et al., 2024) demonstrated equivalent quality with simpler shape: warmup → constant → linear decay. Allows easier intermediate evaluation. e.g. A leading open-weights model: cosine + annealing · Many open models: WSD (MiniCPM)

A4.4.1 Data parallelism

Replicate model, split batch. Industry standard: Foundation. All other parallelism layers compose on top.

A4.4.2 Tensor parallelism

Split individual layer matrices across devices. Industry standard: a tensor-parallelism reference implementation style. Typically 8-way (within node, NVLink-bound).

A4.4.3 Pipeline parallelism

Split layers across devices, sequential micro-batches. Industry standard: GPipe / a tensor-parallelism reference framework-style 1F1B. Used for very large models. A leading open-weights flagship model uses 16-way pipeline.

A4.4.4 ZeRO / FSDP

Shard optimizer states, gradients, parameters across data-parallel ranks. Industry standard: ZeRO-3 / FSDP universal at frontier. Reduces memory ~N×.

A4.4.5 Sequence parallelism

Split sequence dimension across devices. Industry standard: Used for long-context training. Ring attention (Liu 2023) is reference.

A4.5 Loss

Checkpoint and recovery: at frontier scale, hardware fails frequently (1-5% nodes per day). Without robust recovery, days of work lost. Modern stacks: async checkpoint to object store every 1000-5000 steps, automatic node replacement, resume from latest checkpoint. SOTA: Async checkpoint writers (TorchSnapshot, custom) overlap checkpoint I/O with compute. Tiered storage: hot (NVMe) for last few checkpoints, cold (S3) for archive. Failure detection via heartbeat. A leading accelerator vendor NCCL handles transient communication failures. Automatic restart from latest checkpoint with new node assignment in minutes. e.g. Most frontier labs: async checkpoint, multi-tier storage · Open: torchsnapshot, custom

A4.5.1 Cross-entropy on next token

Standard autoregressive language modeling loss. Industry standard: Universal.

A4.5.2 Z-loss / aux losses

Auxiliary loss to stabilize logits scale. Industry standard: an earlier frontier model uses z-loss; helps numerical stability. MoE uses load-balancing aux loss (cross-link to A3.3.4).

A4.6 Training Stability

Training stability. Catastrophic loss spikes can destroy weeks of work. Sources: numerical instability in attention/normalization, gradient explosions from outlier batches, hardware failures, NaN propagation. Detection: gradient norm monitoring, loss anomaly detection. Recovery: rollback to checkpoint, skip bad batch, lower learning rate. SOTA: A leading open-weights model paper documents stability work: pre-norm + careful weight init + gradient clipping at 1.0 + LR warmup. Single rank's hardware degradation can cause loss spike across full cluster (NCCL synchronization). Frontier: automatic anomaly detection on gradient norms, auto-rollback on spike. e.g. A leading open-weights model stability section · an earlier frontier model training notebook (post-mortem of spikes) · OPT paper (training instability documented)

A4.6.1 Gradient clipping

Clip gradient norm to prevent explosion. Industry standard: Universal. Typical max norm 1.0.

A4.6.2 Weight initialization

Initial weight distribution. Industry standard: Truncated normal with std scaled by 1/sqrt(d) or layer-aware (e.g. An open replication initiative init).

A4.6.3 Loss spike recovery

Detection and rollback of training instabilities. Industry standard: A leading open-weights model paper documents rollback procedures. Detection via running variance of loss.

A4.7 Training Telemetry

Training telemetry. Per-step metrics: loss, gradient norm, throughput (tokens/sec/GPU), MFU (Model FLOPs Utilization). Per-rank: latency variance, NCCL stalls, GPU utilization. Aggregated dashboards updated every step or every N steps. SOTA: Per-step metrics: loss, gradient norm, throughput (tokens/sec/GPU), MFU (Model FLOPs Utilization). Frontier MFU 40-50%; > 55% rare. Per-rank slow node detection critical — single slow rank slows entire AllReduce. A leading accelerator vendor DCGM integrated with training logs. e.g. W&B for high-level metrics · DCGM for hardware · Custom dashboards at frontier

A4.7.1 Loss curves

Train and context loss over time. Industry standard: Universal. W&B / MLflow / proprietary.

A4.7.2 Gradient statistics

Per-layer gradient norms, ratio to weight norm. Industry standard: Standard. Early indicator of instability.

A4.7.3 Activation statistics

Per-layer activation norms, attention entropy. Industry standard: Used at frontier labs to detect early issues.

A4.8 Checkpointing

Checkpointing strategy. Async checkpoint to object store every 1000-5000 steps. Tiered storage: hot (NVMe) for last 3-5, cold (S3) for archive. Auto-resume from latest. Optimizer state checkpoint is largest (8x weights for AdamW). SOTA: Async checkpoint to S3/GCS every 1000-5000 steps. Tiered storage: hot (NVMe) for last 3-5, cold for archive. Distributed checkpoint formats (FSDP) save in shards parallel-readable. Frontier: 1-2 hour wall-clock checkpointing, retention 5-10 latest + monthly archives. e.g. TorchSnapshot (PyTorch) · a tensor-parallelism reference implementation distributed checkpoint · FSDP sharded state dict

A4.8.1 Checkpoint frequency

How often to save full state. Industry standard: Hourly to daily depending on cluster size. Frequency balances storage cost vs recovery cost.

A4.8.2 Checkpoint format

Serialization format and sharding. Industry standard: Sharded across DP ranks. a safer model serialization format emerging as safer alternative to pickle.

A4.8.3 Resumption logic

Loading and continuing from checkpoint. Industry standard: Includes RNG state, optimizer state, dataloader position.

Compute

21 sub-endpoints mapped

MZN Provisional Position · Gap

No cluster under solo operation; compute is Phase 3 partnership scope

Phase 1 and Phase 2 produced the portfolio without frontier-class compute. Hardware-level monitoring methodology documented at the metric level. Cluster-scale compute access is an acknowledged partnership requirement.

Definition

Compute infrastructure is the physical substrate. GPU/a custom-silicon accelerator acquisition, network topology, storage. Frontier training requires homogeneous, high-bandwidth GPU clusters with InfiniBand interconnect. Inference requires either similar clusters (for largest models) or commodity GPU with optimization. The compute supply chain is a strategic constraint: GPU access is gated by a leading accelerator vendor allocation and capital.

State of the Art (2025–2026)

a current-generation accelerator (80GB, 700W, $25-40K/GPU) is the frontier workhorse since 2023. A current-generation accelerator (141GB, late 2024) and a next-generation accelerator/a Blackwell-class architecture (192GB, 2025) succession. A multimodal frontier lab a custom-silicon accelerator / v6e for a constitutional-methods frontier lab, a multimodal frontier lab. Frontier clusters: 16K-100K+ GPUs with non-blocking InfiniBand 400-800Gbps. CoreWeave, Lambda Labs, Crusoe provide alternative-cloud GPU access at lower cost than hyperscalers.

Key Decisions

Hardware (a current-generation accelerator, a current-generation accelerator, a next-generation accelerator, a custom-silicon accelerator)
Cluster size (1K to 100K)
Network topology (rail-optimized, fat-tree, dragonfly)
Cloud vs. owned
Storage tier

Trade-offs

Owned → capex + control
Cloud → opex + flexibility
Larger cluster → frontier-capable, harder utilization

Numbers & Ablations

a current-generation accelerator economics: $25-40K capex, ~$2-3/hour cloud rental, 700W TDP. xAI Colossus = 100K a current-generation accelerator × $30K = $3B GPU alone (excludes datacenter, network, power).
InfiniBand NDR (400Gbps): ~$2K per port. 16K-GPU cluster = ~$40M network alone. Spectrum-X Ethernet ~30% cheaper.
a next-generation accelerator (a Blackwell-class architecture): 192GB HBM3e, 2.5× a current-generation accelerator effective throughput, NVLink Switch enables 72-GPU coherent domain. ~$40-60K/GPU.
Power infrastructure: frontier datacenter requires 100-300MW dedicated power. 100K a current-generation accelerator cluster = ~70MW IT load + ~30% PUE overhead = ~90MW total.
Cluster utilization at frontier: 80-90% sustained during training, 30-50% during ablation phases. Underutilization is real cost.
Failure rates: a current-generation accelerator ECC corrections ~1-10/day/GPU normal; >100/day flag for replacement. Mean time to replacement 2-7 days at frontier.

Open Questions

Is there a near-term alternative to a leading accelerator vendor hardware lock-in for training? AMD MI300X, a wafer-scale accelerator vendor CS-3, a multimodal frontier lab a custom-silicon accelerator competitive; software ecosystem gap remains the gating factor.
Confidential compute (a leading accelerator vendor CC, a hyperscaler platform Nitro for GPU): production-ready or theatre? a constitutional-methods frontier lab uses a hyperscaler platform Nitro for third AI Safety Level-relevant workloads; performance overhead poorly characterized publicly.
Optimal cluster size: when does adding GPUs hurt training (failure rate × MFU degradation)? Reported sweet spots vary 16K-32K.
Power constraints will dominate by 2027-2028: cluster size limited not by capital but by available 100-500MW datacenter sites. Geographic distribution implications unclear.

Reference analyst note. Compute infrastructure is becoming a real estate / power infrastructure business as much as a hardware business. A synthetic-data-focused lab signing 20-year nuclear PPA with Three Mile Island, xAI building gas turbines on-site at Memphis, Stargate's $500B announcement — these reflect that the actual frontier constraint by 2027 is gigawatt-class power, not GPU supply. National strategic positioning of compute (US export controls on H800 to China, EU sovereign cloud requirements) is now first-order policy. Anyone serious about frontier needs to think 5+ years ahead about power and land, not just GPU procurement.

Reference Analyst Note

Compute infrastructure is becoming a real estate / power infrastructure business as much as a hardware business. A synthetic-data-focused lab signing 20-year nuclear PPA with Three Mile Island, xAI building gas turbines on-site at Memphis, Stargate's $500B announcement — these reflect that the actual frontier constraint by 2027 is gigawatt-class power, not GPU supply. National strategic positioning of compute (US export controls on H800 to China, EU sovereign cloud requirements) is now first-order policy. Anyone serious about frontier needs to think 5+ years ahead about power and land, not just GPU procurement.

Examples

xAI Colossus: 100K a current-generation accelerator single cluster (2024) · an open-weights frontier lab: ~600K a current-generation accelerator equivalent (2024 reported) · a constitutional-methods frontier lab: a hyperscaler platform a hyperscaler accelerator + GCP a custom-silicon accelerator · an open-weights frontier provider: 2K H800 (export-restricted, smaller scale)

References (Academic)

A leading accelerator vendor a current-generation accelerator datasheet · Selene cluster paper (a leading accelerator vendor)

Sub-endpoint anatomy — 21 items mapped

A5.1 Hardware

GPU choice. A leading accelerator vendor dominance: a current-generation accelerator/a current-generation accelerator/a next-generation accelerator lineage. AMD MI300X gaining inference share. A multimodal frontier lab a custom-silicon accelerator for a multimodal frontier lab/a constitutional-methods frontier lab. Custom ASIC efforts (a wafer-scale accelerator vendor, a high-throughput inference accelerator, a hyperscaler platform a hyperscaler accelerator) for specific workloads. Frontier training is overwhelmingly a leading accelerator vendor-on-InfiniBand. SOTA: a next-generation accelerator (a Blackwell-class architecture, 2025): 192GB HBM3e, 2.5x a current-generation accelerator throughput, NVLink Switch enables 72-GPU coherent domain. Drives 2025-2026 frontier capacity. AMD MI300X: 192GB, competitive on inference, weaker software stack. a high-throughput inference accelerator LPU: extreme inference latency for production (open-weights-style models). e.g. Frontier labs: a leading accelerator vendor a current-generation accelerator/a current-generation accelerator/a next-generation accelerator · a multimodal frontier lab/a constitutional-methods frontier lab: a custom-silicon accelerator/v6 · a high-throughput inference accelerator: production inference

A5.1.1 a leading accelerator vendor GPU

a prior-generation accelerator, a current-generation accelerator, a current-generation accelerator, a next-generation accelerator (a Blackwell-class architecture). Industry standard: Frontier dominant. A current-generation accelerator most common 2024-2025; a next-generation accelerator ramping 2025-2026.

+ deeper detail (3 leaves)

A5.1.1.1 a prior-generation accelerator 80GB HBM, FP16/BF16 313 TFLOPS. Industry standard: Standard 2020-2023. Still used for many production deployments.
A5.1.1.2 current-generation accelerators a Hopper-class architecture. 80-141GB HBM, BF16 ~1000 TFLOPS, FP8 support. Industry standard: Dominant 2024-2025. A leading open-weights model trained on 24K H100s. A current-generation frontier model estimated 25K A100s, a high-throughput frontier model on a current-generation accelerator cluster.
A5.1.1.3 next-generation accelerators (a Blackwell-class architecture) Newest a leading accelerator vendor. ~2× FP8 throughput vs a current-generation accelerator. Industry standard: Ramp 2025-2026. New frontier training runs migrating.

A5.1.2 a multimodal frontier lab a custom-silicon accelerator

a custom-silicon accelerator, v5e, v5p, a custom-silicon accelerator. Industry standard: Used internally by a multimodal frontier lab (a multimodal frontier model, an earlier frontier model family). Not generally available outside a multimodal frontier lab Cloud.

A5.1.3 Custom accelerators

a wafer-scale accelerator vendor (wafer-scale), a custom accelerator vendor, a high-throughput inference accelerator (inference), a hyperscaler accelerator (Amazon). Industry standard: Niche. Some used for specific workloads (a high-throughput inference accelerator for inference).

A5.2 Cluster Topology

Network topology determines training scalability. Frontier clusters use non-blocking InfiniBand fabric: every GPU can communicate at full bandwidth with any other GPU simultaneously. Topology choices: fat-tree (oversubscribed at upper levels but cost-effective), rail-optimized (a leading accelerator vendor recommendation), dragonfly (hyperscaler scale). SOTA: Frontier clusters: NDR InfiniBand 400Gbps (some 800Gbps, 2025+). Spectrum-X Ethernet (a leading accelerator vendor, 2024) emerging as alternative. Rail-optimized topology: each GPU has dedicated NIC, rails connected via spine — minimizes hot spots. NVLink Switch (a next-generation accelerator era): 72-GPU NVLink domain enables tensor parallelism across more GPUs without IB hop. e.g. xAI Colossus: 100K a current-generation accelerator, rail-optimized IB · an open-weights frontier lab Grand Teton + RoCE · a constitutional-methods frontier lab: a custom-silicon accelerator pods (mesh)

A5.2.1 InfiniBand vs RoCE

Inter-node fabric: NDR/HDR InfiniBand or RDMA-over-Ethernet. Industry standard: InfiniBand dominant for new builds. 400Gbps NDR per port standard. RoCE used in a hyperscaler platform and a hyperscaler platform.

A5.2.2 Node count

Total nodes in cluster. Industry standard: Frontier clusters: 3000-20000 nodes. A leading open-weights model ~3000 nodes (24576 GPUs).

A5.2.3 GPUs per node

Typically 8 GPUs per node, NVLink-connected. Industry standard: 8× a current-generation accelerator per node standard. NVLink ~900GB/s intra-node.

A5.3 Storage

Storage tier supports training I/O. Hot tier: high-throughput parallel filesystem (Lustre, WekaFS, GPFS) for active dataset and recent checkpoints. Cold tier: object store (S3, GCS) for archive. Bandwidth requirement: 100s GB/s aggregate to keep GPUs fed during data loading. SOTA: WekaFS, VAST Data, DDN are common at frontier. A leading accelerator vendor GPUDirect Storage allows GPU-direct I/O bypassing CPU for ~50% throughput improvement. S3-compatible object stores (S3, GCS, Cloudflare R2) for cold. Asynchronous prefetch and on-the-fly decompression (Zstandard) standard. e.g. Frontier: WekaFS or VAST + S3 · a constitutional-methods frontier lab: GCS + custom · an open-weights frontier lab: Tectonic + Haystack

A5.3.1 Checkpoint storage

Where checkpoints are written and from where they are loaded. Industry standard: High-performance parallel filesystems (Lustre, GPFS, WekaFS). Bandwidth ~TB/s required for fast checkpoint.

A5.3.2 Data loading

Pre-shuffled, pre-tokenized shards streamed to nodes. Industry standard: Shuffled, indexed, pre-tokenized formats. Avoid per-step computation; load is ~constant per step.

A5.4 Cluster Monitoring

Cluster monitoring and telemetry. At frontier scale, observability is operational requirement. Per-GPU metrics: utilization, memory, power, temperature, ECC errors. Cluster-level: AllReduce throughput, collective communication stalls, network packet loss. Failure prediction (predicting GPU failure before it happens) is active research. SOTA: A leading accelerator vendor DCGM (Data Center GPU Manager) is the standard agent. Prometheus + Grafana for visualization. Custom layers add training-aware metrics (loss spike detection, gradient norm tracking). Frontier labs deploy ML-based anomaly detection on telemetry. ML/security overlay products on GPU telemetry are an emerging commercial category. e.g. DCGM + Prometheus standard · a leading accelerator vendor Run:ai for cluster scheduling · Custom dashboards everywhere

A5.4.1 GPU utilization

MFU (Model FLOPs Utilization), HFU (Hardware FLOPs Utilization). Industry standard: Frontier labs target MFU 40-55%. A leading open-weights model paper reports 38-43% MFU on 16K a current-generation accelerator.

A5.4.2 Network bandwidth

Inter-node communication monitoring. Industry standard: Critical for tensor + pipeline parallelism. Saturation indicates communication bottleneck.

A5.4.3 Hardware failure detection

Detecting GPU/node failures, silent data corruption. Industry standard: A leading open-weights model reported ~30 GPU failures/day on 16K cluster. Automated detection + restart from checkpoint.

A5.5 Cost

Training cost economics. Frontier training costs: $50M-$500M+ in compute. A leading open-weights flagship model: ~16K H100s × ~30 days × $2-3/hour = ~$30-50M cloud-rented. Real costs include data prep, ablations (10-100 small runs), staff, failures. Total program cost typically 2-5x raw training compute. SOTA: A leading open-weights flagship model: ~16K H100s × ~30 days × $2-3/hour = ~$30-50M cloud-rented. Real costs include data prep, ablations (10-100 small runs at 20-30% of full-train compute), staff, failures. Total program cost typically 2-5× raw training compute. An open-weights frontier model (V3 class): $5.6M reported (final run only — excludes ablations). e.g. A leading open-weights flagship model: ~$30-50M (estimated) · an open-weights frontier model (V3 class): $5.6M reported · xAI Colossus build: $4B+ for 100K a current-generation accelerator

A5.5.1 Training cost estimation

Total compute cost for a training run. Industry standard: A leading open-weights flagship model estimated $50-100M training cost. A current-generation frontier model estimated $100M+. Frontier costs scale with parameter count and tokens.

A5.5.2 Cost per token (inference)

$/M tokens for serving. Industry standard: a frontier multimodal model ~$5/M input. A long-context frontier model ~$3/M input. Open models on a current-generation accelerator: $0.20-1.00/M depending on size.

SFT

19 sub-endpoints mapped

MZN Provisional Position · Partial

Demonstration-data shaping methodology documented

Conceptual framework for SFT data shaping is documented at architectural level. Slot-based memory and structured demonstration patterns inform the methodology. Production SFT runs at frontier scale require partnership scope.

Definition

SFT (Supervised Fine-Tuning) takes a pre-trained base model — which is a powerful text completer but not an assistant — and trains it on instruction-response pairs to behave as an assistant. The model learns the chat template, role conventions, refusal patterns, and the basic shape of helpful responses. SFT is universally the first post-training stage; everything else builds on it.

State of the Art (2025–2026)

Quality > quantity is the consensus since LIMA (Zhou et al., 2023) demonstrated 1000 highly-curated examples nearly match millions of crowdsourced ones. Frontier SFT mixtures include: human-written conversations (leading frontier labs use 100K-1M+), reasoning chains (long CoT exemplars), tool-use traces, code with patches, math with solutions. Synthetic SFT (teacher model generates) increasingly common via self-instruct methodology, Evol-Instruct, Magpie.

Key Decisions

Dataset size (10K - 10M+)
Synthetic vs human mix
Multi-turn conversation depth
Tool-use data inclusion
Math/code ratio
Multilingual SFT
Number of epochs (typically 2-5)

Trade-offs

More data → diminishing returns past ~100K well-curated
Synthetic-heavy → cheaper, distributional artifacts
Multi-turn → conversational fluency, costs in curation

Numbers & Ablations

LIMA: 1000 high-quality examples — 65K crowdsourced examples (Zhou 2023). Quality dominance demonstrated.
Synthetic SFT efficiency: Magpie (self-generated from base model) produced datasets matching ShareGPT quality at <1% cost.
SFT epoch count: typically 2-5 for instruction tuning, 1-2 for continued pre-training. Beyond 5 epochs: overfitting on style without capability gain.
Multi-turn data ratio in modern frontier SFT: 60-80% multi-turn, 20-40% single-turn. ~5-15 average turns in multi-turn examples.
Tool-use data: frontier models trained on 100K-1M+ tool-calling examples. xLAM-function-calling-60k is the largest open dataset.

Open Questions

What is the marginal value curve of SFT data? After ~100K well-curated, does the curve flatten or continue rising slowly?
Synthetic vs human SFT data: where exactly do they diverge? Anecdotally synthetic struggles with creative tasks, edge cases — no rigorous study.
SFT mixing ratios for multi-skill (chat + code + math + tool-use): no published ablation studies at scale.
Does SFT actually teach new capability or just elicit / format pre-trained capability? Evidence (LIMA, Magpie) suggests mostly elicitation; deep SFT studies absent.

Reference analyst note. SFT is dramatically underrated and over-tuned. Most labs spend too much on SFT data scale (millions of examples) and not enough on quality + diversity. The optimal frontier SFT corpus is probably 100K-500K examples curated to within an inch of their lives. SFT-then-RL is the path; trying to push everything into SFT (Tulu approach) hits diminishing returns visible in current open community.

Reference Analyst Note

SFT is dramatically underrated and over-tuned. Most labs spend too much on SFT data scale (millions of examples) and not enough on quality + diversity. The optimal frontier SFT corpus is probably 100K-500K examples curated to within an inch of their lives. SFT-then-RL is the path; trying to push everything into SFT (Tulu approach) hits diminishing returns visible in current open community.

Examples

A leading open-weights model SFT: ~10M examples mix (human + synthetic) · OpenAssistant: 161K human conversations (open) · Magpie: synthetic from base model self-conversation · Hermes / Nous: open SFT-tuned models

References (Academic)

Zhou et al., LIMA (2023) · Wang et al., self-instruct methodology (2022) · Xu et al., Evol-Instruct (2023) · Xu et al., Magpie (2024)

Sub-endpoint anatomy — 19 items mapped

B1.1 Demonstration Data

Dataset construction strategy. Three sources: (1) human-written conversations (highest quality, expensive), (2) synthetic from teacher model (cheap, scales), (3) curated from existing data (filtered StackExchange, ShareGPT-style). Frontier mix: weighted combination, with diversity sampling. SOTA: leading frontier labs use predominantly human-written for highest quality bands; synthetic for breadth. Open community converged on Magpie-style synthetic + selective human curation. Quality scoring (using stronger teacher model as judge) filters mixed sources. e.g. a constitutional-methods frontier lab Helpful & Harmless dataset (older) · ShareGPT (community, mixed quality) · OpenAssistant Conversations

B1.1.1 Human-written demonstrations

Trained annotators produce ideal responses. Industry standard: InstructGPT used ~13K human demonstrations. Frontier labs use larger, often paid annotators (a major annotation platform, etc.).

B1.1.2 Synthetic demonstrations

LLM-generated responses, often filtered or rewritten by humans. Industry standard: self-instruct methodology (Wang 2023), an early instruction-tuning initiative, a community fine-tuning initiative. Frontier labs increasingly synthetic-heavy.

B1.1.3 Filtered web data

Naturally-occurring instruction-response pairs from web (StackOverflow, forums). Industry standard: Used as additional source. A synthetic-SFT-heavy initiative, a community fine-tuning initiative derive from this approach.

B1.1.4 Quality vs quantity

Trade-off between dataset size and per-example quality. Industry standard: LIMA (Zhou 2023) showed 1000 high-quality examples can rival 50K mediocre. Quality dominates.

B1.2 Training Procedure

Multi-turn conversation training. Single-turn SFT teaches single response; multi-turn teaches dialogue management — context tracking, personality consistency, refusal at appropriate turns. Frontier models trained extensively on multi-turn (5-20 turns). SOTA: Multi-turn SFT data includes: branching conversations (alternative responses), correction/follow-up patterns, mid-conversation context shifts, tool-use loops within conversation. Loss-masked appropriately (only assistant turns contribute to loss, not user turns). e.g. WildChat: 1M+ real-world a consumer LLM chat product conversations (open) · a constitutional-methods frontier lab HH dataset: multi-turn with assistant refusals

B1.2.1 Loss masking

Compute loss only on response tokens, not prompt tokens. Industry standard: Standard. Prevents model from learning to predict prompts.

B1.2.2 Learning rate (lower than pre-training)

Typical LR 1e-5 to 1e-6 (pre-training is 1e-4 range). Industry standard: 1-2 orders of magnitude lower than pre-training peak LR.

B1.2.3 Epoch count

How many passes over SFT data. Industry standard: 1-3 epochs typical. More risks overfitting on small datasets.

B1.3 Task Coverage

Tool-use SFT data. Trains model to call functions, interpret structured results, and reason over tool outputs. Critical for agent applications. Format: chat with function_call and function_result special tokens, structured JSON arguments, multi-step tool use. SOTA: Frontier models (a long-context frontier model, a frontier multimodal model) trained on millions of tool-use examples. Synthetic generation: model X plays user with task → model Y plays assistant with tool access → trace recorded. Multi-tool, parallel tool calls, tool errors handled. xLAM, Hermes-Function-Calling, Glaive open datasets. e.g. xLAM-function-calling-60k · Glaive-function-calling-v2 · a constitutional-methods frontier lab computer use traces (closed)

B1.3.1 General instruction following

Open-ended Q&A, summarization, rewriting. Industry standard: Foundation. FLAN-style task mixtures typical.

B1.3.2 Reasoning / chain-of-thought

Multi-step reasoning demonstrations. Industry standard: CoT prompting becomes CoT training data. Math, code, logical reasoning examples.

B1.3.3 Tool use / function calling

Demonstrations of correct function call format. Industry standard: Increasingly part of SFT. A leading open-weights model, a long-context frontier model, a current-generation frontier model all SFT'd on tool examples.

B1.4 Cultural & Multilingual Coverage

Reasoning SFT (chain-of-thought training). Teaches the model to produce intermediate reasoning before final answer. Pre-cursor to RL-trained reasoning models (o1, R1). Datasets include math problems with worked solutions, code with debugging traces, multi-step logical puzzles. SOTA: Long-CoT SFT (o1-style) involves traces of thousands of tokens of reasoning, with self-correction, exploration, backtracking. An open-weights reasoning model demonstrated this can be achieved via pure RL from base; SFT distillation transfers reasoning to smaller models. OpenThoughts, Bespoke-Stratos open distillation datasets. e.g. OpenThoughts: 114K reasoning traces · Bespoke-Stratos-17k · MetaMath: reasoning-augmented math

B1.4.1 Multilingual SFT data

Demonstrations in multiple languages. Industry standard: Frontier labs include 10+ languages typically. Quality varies by language.

B1.4.2 Cultural calibration

Region/culture-specific norms and conventions. Industry standard: Limited at frontier; mostly Western-centric. Active research direction.

B1.5 SFT Evaluation

SFT evaluation. Track: instruction-following (IFEval), helpfulness (judges, win-rate), refusal (XSTest), perplexity on held-out chat. Compare against base model and previous SFT version. Track per-domain: code (HumanEval), math (GSM8K), reasoning (MMLU). SOTA: AlpacaEval 2.0, Arena-Hard standard public eval. IFEval for instruction following (~85% frontier). Internal: head-to-head LLM-as-judge vs prior version. Frontier labs: hundreds of eval slices, each tracked per SFT run. e.g. AlpacaEval 2.0 (Dubois 2024) · Arena-Hard-Auto · IFEval (Zhou 2023)

B1.5.1 Held-out demonstration loss

Cross-entropy on held-out instructions. Industry standard: Basic check; correlates with quality but imperfectly.

B1.5.2 MT-Bench / AlpacaEval

LLM-as-judge benchmarks for instruction following. Industry standard: MT-Bench (Zheng 2023), AlpacaEval (Li 2023). Standard for SFT comparison.

Preference Optimization

29 sub-endpoints mapped

MZN Provisional Position · Partial

Output-conformance methodology informs preference design

An output-conformance paradigm reframes preference signal as egress-template adherence — an inversion of the input-blacklist approach. Reduces reward-hacking surface and ties preference optimization to verifiable outputs. Methodology documented; full RLHF/DPO pipeline execution requires partnership scope.

Definition

Preference alignment improves the SFT model's quality, helpfulness, and harmlessness using comparison data: humans (or AI) compare two model outputs and indicate which is preferred. The model learns from pairwise preferences, not single-target answers. Three main methods: RLHF (PPO with reward model), DPO (direct preference optimization, no separate RM), Constitutional methods (AI-generated preferences via principles). Preference alignment moves models from 'competent' to 'good'.

State of the Art (2025–2026)

DPO (Rafailov et al., 2023) became the dominant 2024 method for its simplicity — no PPO, no separate reward model, single training stage. PPO-based RLHF still used at frontier (a leading frontier lab, possibly a constitutional-methods frontier lab). Constitutional methods / RL-from-AI-Feedback (RLAIF) (a constitutional-methods frontier lab) generates preferences via AI-judged adherence to principles, avoiding human annotation cost. Iterative DPO and online DPO push quality further.

Key Decisions

Method (DPO, PPO, IPO, KTO, ORPO, RL-from-AI-Feedback (RLAIF))
Preference data source (humans, AI judges, both)
Preference data scale (10K - 1M+)
Iteration count (single pass, iterative)
Reference model choice (SFT vs. previous DPO)

Trade-offs

DPO: simpler, can over-fit preferences, drift from SFT
PPO: harder, better controllable
RL-from-AI-Feedback (RLAIF): cheaper, depends on judge quality

Numbers & Ablations

DPO vs PPO: DPO ~5-10% lower compute, comparable or slightly better quality on standard benchmarks (Rafailov 2023). PPO retains edge on hard alignment categories per a leading open-weights model paper.
Iterative DPO: a leading open-weights model used 4-6 rounds; each round +1-3% on AlpacaEval but diminishing.
RL-from-AI-Feedback (RLAIF) vs RLHF preference quality: ~80-90% agreement at category level (Lee 2023). RL-from-AI-Feedback (RLAIF) cheaper by ~50× (no human annotators).
Process Reward Models (PRM) on math: ~5-10% accuracy gain over outcome-only on MATH/GSM8K (Lightman 2023).
an open-weights reasoning model reasoning training: pure RL from base model with rule-based rewards (correct=1, incorrect=0). Achieved AIME ~80% from base ~10%.
Constitutional methods: ~70% reduction in human annotation cost with quality matching RLHF on helpfulness/harmlessness benchmarks (Bai 2022).
Length bias: vanilla DPO produces ~25-40% longer responses than reference SFT — pure length artifact (Singhal 2023). LC-AlpacaEval, SimPO control for this.

Open Questions

Is RLHF (PPO-based) actually better than DPO at frontier scale? Open community converged on DPO; closed labs (a leading frontier lab, possibly a constitutional-methods frontier lab) retain PPO. No public head-to-head at 70B+ scale.
Reward model scaling: does a 70B RM provide meaningfully better signal than 13B? Limited public ablation.
Process Reward Models beyond math: PRMs work in math (verifiable steps); do they work in code, reasoning, writing? Active but unclear research area.
RLVR generalization: an open-weights reasoning model trained on math/code generalized to other reasoning domains. Why? Mechanistic understanding absent.
Constitutional methods: how much of its quality comes from the constitution document quality vs the RL-from-AI-Feedback (RLAIF) process? a constitutional-methods frontier lab's constitution is unusually detailed; lower-effort constitutions may not transfer.

Reference analyst note. RLHF as a method is mostly cargo-culted. The actual win at frontier comes from: (a) high-quality SFT, (b) RL-from-AI-Feedback (RLAIF) for breadth, (c) RLVR for verifiable tasks, (d) human RLHF only for irreducibly subjective categories. The DPO-vs-PPO debate is a sideshow — both work, choice is engineering preference. The real frontier shift in 2025-2026 is 'preference alignment' becoming 'reasoning alignment' — RL signal moving from human preference to verifiable correctness for hard tasks. This is the most important post-training shift since RLHF itself.

Reference Analyst Note

RLHF as a method is mostly cargo-culted. The actual win at frontier comes from: (a) high-quality SFT, (b) RL-from-AI-Feedback (RLAIF) for breadth, (c) RLVR for verifiable tasks, (d) human RLHF only for irreducibly subjective categories. The DPO-vs-PPO debate is a sideshow — both work, choice is engineering preference. The real frontier shift in 2025-2026 is 'preference alignment' becoming 'reasoning alignment' — RL signal moving from human preference to verifiable correctness for hard tasks. This is the most important post-training shift since RLHF itself.

Examples

A leading open-weights model: iterative DPO + RLHF mix · a constitutional-methods frontier lab a leading frontier model: Constitutional methods + RLHF · a leading frontier lab: PPO-based RLHF (historical, current details closed) · Open: Tulu 3 (UltraFeedback DPO + RLVR)

References (Academic)

Christiano et al., RLHF (2017) · Ouyang et al., InstructGPT (2022) · Bai et al., Constitutional methods (2022) · Rafailov et al., DPO (2023) · Lambert et al., Tulu 3 (2024)

Sub-endpoint anatomy — 29 items mapped

B2.1 Preference Data Collection

Reward model (RM) training: a model that takes (prompt, response) and outputs scalar quality score. Trained on pairwise preference data with Bradley-Terry loss. RM is then used in PPO to optimize policy. RM quality is the bottleneck for RLHF. SOTA: RM typically initialized from SFT model. Trained on 100K-1M preference pairs. Modern variants: process reward models (PRM) score each reasoning step (better for math/code), generative reward models (output critique then score), reward model ensembles. Reward hacking is the central pathology — model finds responses RM scores high but humans wouldn't. e.g. A leading open-weights model RM: 70B param · a constitutional-methods frontier lab helpfulness/harmlessness RMs · Skywork RM: open frontier RM

B2.1.1 Pairwise comparison

Annotators choose between two responses. Industry standard: Dominant. InstructGPT, a 2023-generation open-weights model, a leading frontier model all use pairwise. Easier than absolute rating.

B2.1.2 Listwise / ranked

Annotators rank K responses. Industry standard: Used in some pipelines. Higher cost per annotation but more signal.

B2.1.3 Absolute Likert ratings

1-5 or 1-7 scale ratings. Industry standard: Less common for preference learning due to inter-annotator variance. Used in eval.

B2.2 Annotator Design

PPO (Proximal Policy Optimization) is the original RLHF algorithm. The policy (LLM) generates responses, the reward model scores them, PPO updates policy to maximize reward while staying close to reference (KL penalty). Notoriously fiddly to train: hyperparameter sensitivity, reward model bottleneck, mode collapse, reward hacking. SOTA: PPO still used at frontier despite DPO's rise — reportedly a leading frontier lab, parts of a constitutional-methods frontier lab stack. Improvements: GRPO (an open-weights frontier provider) removes critic, uses group-relative advantage. RLOO (REINFORCE Leave-One-Out) is simpler PPO alternative. Online iterative variants update RM and policy alternately. e.g. A leading frontier lab InstructGPT/a consumer LLM chat product lineage · an open-weights reasoning model: GRPO · a leading open-weights model: PPO in addition to DPO

B2.2.1 Annotator selection & training

Recruitment, qualification, training. Industry standard: Frontier labs use vetted contractors (a major annotation platform, an annotation services provider, internal teams). Calibration tests required.

B2.2.2 Inter-annotator agreement

Measuring consistency across annotators. Industry standard: Cohen's kappa or simple agreement rate. A 2023-generation open-weights model reports ~70% agreement on preference pairs.

B2.3 Reward Model

DPO (Direct Preference Optimization) trains the policy directly on preference data without separate reward model. Mathematically derived: the optimal policy under RLHF is expressible in closed form, leading to a simple cross-entropy loss on chosen vs rejected pairs. Single training stage, much simpler than PPO. SOTA: DPO is the dominant 2024+ open-community method. Iterative DPO (multiple rounds with model-generated preferences) and online DPO push quality. Variants: IPO (avoids overfitting), KTO (uses positive/negative labels not pairs), ORPO (combines SFT and DPO into single stage), SimPO (length-controlled). e.g. Tulu 2/3: DPO + iterative · Zephyr: DPO seminal open work · Hermes: DPO + ChatML

B2.3.1 Architecture (Bradley-Terry)

Pairwise loss: log-sigmoid of reward difference. Industry standard: Bradley-Terry standard. Pre-trained transformer with scalar head.

B2.3.2 Reward model size

Smaller, same-size, or larger than policy. Industry standard: InstructGPT used 6B reward for 175B policy. A 2023-generation open-weights model used same-size. Trade-off: cost vs accuracy.

B2.3.3 Reward calibration

Ensuring reward distribution is well-behaved. Industry standard: Length normalization, ensemble, regularization to prevent reward hacking.

B2.4 RLHF (PPO)

Constitutional methods (CAI) and RL-from-AI-Feedback (RLAIF). A signature constitutional method: instead of human preferences, use AI to generate preferences according to a set of natural-language principles (the 'constitution'). Process: model produces response → AI critic identifies constitution violations → revised response. Pairs (original, revised) become preference data. Avoids large human annotation budgets. SOTA: CAI is core to a constitutional-methods frontier lab a leading frontier model lineage. RL-from-AI-Feedback (RLAIF) demonstrated equivalent quality to RLHF with AI-generated preferences (Lee et al., 2023). Hybrid: human preferences for high-stakes categories, AI preferences for breadth. Constitution explicitly published (a constitutional-methods frontier lab) — combines high-level principles, hard rules, and exemplars. e.g. a constitutional-methods frontier lab a leading frontier model lineage: CAI core · a leading open-weights model: includes some RL-from-AI-Feedback (RLAIF) for breadth

B2.4.1 PPO algorithm

Clipped surrogate objective with trust region. Industry standard: Schulman 2017. InstructGPT, a 2023-generation open-weights model, a leading frontier model all used PPO. Becoming less dominant due to direct methods.

B2.4.2 KL penalty

Penalty on KL divergence from SFT model. Prevents drift. Industry standard: Universal in RLHF-PPO. β coefficient typically 0.01-0.1. Adaptive KL also common.

B2.4.3 Value function

Critic network estimating expected reward. Industry standard: Initialized from reward model. Trained jointly with policy.

B2.4.4 Compute cost

PPO ~5× SFT compute due to multiple forward passes per step. Industry standard: Significant. Drives interest in direct methods (B2.5).

B2.5 Direct Preference Methods

RL with verifiable rewards (RLVR). For tasks with clear correctness — math, code, formal logic — reward signal can be programmatic (correct answer = 1, wrong = 0). Avoids RM bottleneck. Foundation of o1-style and an open-weights reasoning model reasoning training. SOTA: an open-weights reasoning model-Zero: pure RL from base model with rule-based reward (correct/incorrect on math, syntactic correctness on code) → emergent reasoning capabilities. R1: cold-start with SFT → RL → SFT distillation → final RL. RLVR demonstrated for math (GSM8K, MATH, AIME), code (HumanEval, LiveCodeBench), and formal proofs (Lean). e.g. an open-weights reasoning model: math + code RLVR · Tulu 3: RLVR component · a leading frontier lab o1/o3: RLVR-class (closed)

B2.5.1 DPO (Direct Preference Optimization)

Closed-form solution to RLHF objective. Direct loss on preference pairs. Industry standard: Rafailov 2023. Widely adopted. A leading open-weights model reports DPO use in some stages.

B2.5.2 IPO (Identity Preference Optimization)

Variant of DPO without reward parameterization. Industry standard: Azar et al. 2023.

B2.5.3 KTO (Kahneman-Tversky Optimization)

Uses prospect theory; needs only binary good/bad signal, not pairs. Industry standard: Ethayarajh 2024. Useful when pair data unavailable.

B2.5.4 ORPO

Odds Ratio Preference Optimization. Combines SFT + preference in single stage. Industry standard: Hong 2024. Reduces total alignment compute.

B2.5.5 SimPO

Simple preference optimization without reference model. Industry standard: Meng 2024.

B2.6 Reward Hacking

Reward hacking. The model finds ways to get high reward that don't correspond to actual quality: response-length inflation, sycophancy, gaming specific judge biases, exploiting reward-model artifacts. Central pathology of all RLHF/DPO methods. SOTA: Length-controlled metrics (LC-AlpacaEval, SimPO) penalize length-gaming. Reward model ensembles reduce single-RM artifacts. Iterative DPO with fresh preference data per iteration prevents some hacking. Constitutional methods's principle-based judge less hackable than learned RM. e.g. LC-AlpacaEval (Dubois 2024) · SimPO (Meng 2024) · Sycophancy studies (Sharma 2024)

B2.6.1 Length hacking

Verbose responses score higher even when not better. Industry standard: Well-documented. Mitigation: length-normalized reward, length penalty.

B2.6.2 Sycophancy

Model agrees with user even when wrong. Industry standard: Documented in Sharma 2023 and others. Active research mitigation.

B2.7 Iterative / Online RLHF

Iterative / online RLHF. Single-pass alignment limited; iterative loop refreshes preference data and re-trains. Online: model generates new responses for fresh preference labeling continuously. Standard at frontier 2024+. SOTA: A leading open-weights model used iterative DPO across 4-6 rounds. Online iterative DPO and online iterative RLHF demonstrated quality gains. Cost: each iteration requires fresh preference labels. Trade-off: convergence vs over-fitting to judge. e.g. A leading open-weights model iterative DPO (4-6 rounds) · a constitutional-methods frontier lab iterative Constitutional methods · Online RLHF research

B2.8 Multi-Objective Preference

Multi-objective preference. Balancing helpfulness vs harmlessness, honesty vs helpfulness, brevity vs completeness. Single reward model collapses these; explicit multi-objective approaches preserve trade-offs. SOTA: a constitutional-methods frontier lab uses separate helpfulness and harmlessness preference data; combined during training. Multi-objective DPO variants explicit. Pareto-frontier explicit modeling for clear axis trade-offs. e.g. a constitutional-methods frontier lab helpful/harmless split · Multi-objective DPO research

B2.8.1 Separate reward models per objective

One RM for helpfulness, one for safety, etc. Industry standard: a 2023-generation open-weights model used 2 RMs (helpfulness + safety). Combined via weighted sum or constrained optimization.

B2.8.2 Pareto frontier exploration

Explicitly trading off objectives at different operating points. Industry standard: Research-grade. Not standard frontier practice.

Constitutional Methods

18 sub-endpoints mapped

MZN Provisional Position · Partial

Principle-based alignment substrate documented at the theoretical level

A foundational theoretical framework treats embodiment, constraint, and emotional function as preconditions for value-aligned cognition. Provides a substrate for principle-based alignment that operates at the architectural rather than the surface-prompt level. Theory layer is public at high level; deeper intervention logic is reserved.

Definition

a public alignment specification / Constitution: the explicit document that defines what the model should and shouldn't do. Components: persona, helpfulness/harmlessness/honesty principles, harm category taxonomy, refusal policies, role hierarchy (system/operator/user/tool), exception cases, exemplars. Without an explicit spec, model behavior is implicit and inconsistent. Increasingly required for trust, regulatory clarity, dispute resolution.

State of the Art (2025–2026)

A leading frontier lab a public alignment specification (May 2024, updated): public ~5000-word document defining Chain of Command (Platform > Developer > User > Tool), default behaviors, hard rules. One lab's constitution + Acceptable Use Policy are public. Both define harm categories: CBRN weapons, child safety, privacy, election interference, self-harm, deceptive output. Spec drives training data curation, RLHF reward signal, and red-team test cases.

Key Decisions

Persona (helpful assistant default)
Hierarchy of authorities
Hard rules (never do X) vs soft rules (default but overridable)
Refusal categories
Exception handling (medical, legal, etc.)
Public vs internal spec

Trade-offs

Detailed spec → consistency, harder to update
Lightweight spec → flexible, ambiguity in edge cases

Numbers & Ablations

A leading frontier lab a public alignment specification: ~5,500 words, 3 layers (Platform > Developer > User), ~30 specific rules. Versioned publicly with changelog.
a constitutional-methods frontier lab Constitution: ~75 principles in original (2022); refined and expanded since. Public AUP separate document (~3,500 words).
Refusal categories standardized across frontier: 8-12 hard categories (CBRN, child safety, etc.) + 20-50 soft categories (controversial topics, dual-use info).
Over-refusal rate (XSTest): frontier 2024 models 5-15% of legitimate queries falsely refused. Better calibration is ongoing.
Spec drift: a leading frontier lab a public alignment specification May 2024 → Feb 2025 update added ~12 new clauses, modified ~8. Spec is an active document, not a constitution-in-amber.

Open Questions

Does explicit Constitution training actually shape behavior more than implicit RLHF preference? No clean ablation exists.
Spec gaming: red teamers regularly find spec-compliant ways to produce undesired output. Is this a fundamental limit or a training quality issue?
Authority hierarchy enforcement under prompt injection: Wallace 2024 trained for it, but persistent breakthroughs published monthly. Is this solvable in current paradigm?
Open-weights specs: a model with public weights can be 'unspecced' via fine-tuning. Does specification have any role for open models?
Does spec content matter, or just spec presence? Maybe any reasonable spec produces similar behavior given good training.

Reference analyst note. Specifications are operationally useful (alignment of human reviewers, regulatory clarity, dispute resolution) but their causal effect on model behavior is poorly understood. The a constitutional-methods frontier lab Constitution and a leading frontier lab a public alignment specification serve more as institutional artifacts than technical control mechanisms. The next frontier is 'specs the model can actually reason about' — current specs are read like training labels, not internalized reasoning frameworks. Constitutional Classifiers (2025) suggest a path: separate small model that explicitly checks against principles.

Reference Analyst Note

Specifications are operationally useful (alignment of human reviewers, regulatory clarity, dispute resolution) but their causal effect on model behavior is poorly understood. The a constitutional-methods frontier lab Constitution and a leading frontier lab a public alignment specification serve more as institutional artifacts than technical control mechanisms. The next frontier is 'specs the model can actually reason about' — current specs are read like training labels, not internalized reasoning frameworks. Constitutional Classifiers (2025) suggest a path: separate small model that explicitly checks against principles.

Examples

A leading frontier lab a public alignment specification (public) · a constitutional-methods frontier lab Acceptable Use Policy (public) · one lab's constitution (mostly public) · a multimodal frontier lab a multimodal frontier model policies

References (Academic)

A leading frontier lab a public alignment specification (2024) · a constitutional-methods frontier lab AUP · Bai et al., CAI (2022)

Sub-endpoint anatomy — 18 items mapped

B3.1 Constitution Authoring

Hard rules / never-comply categories. A small set of behaviors the model must refuse regardless of how a request is framed. Universal across frontier labs: detailed CBRN weapons synthesis, child sexual abuse material, content designed to cause mass casualties, cybercrime tools targeting critical infrastructure. SOTA: Hard rules expressed as Constitutional principles + RLHF reward signal + output filtering. Frontier labs converged on similar hard-rule sets, partly via voluntary commitments (a national AI Safety Institute summit, Seoul commitments). Still significant variation in soft-rule areas (controversial topics, adult content, weapon information at sub-CBRN level). e.g. A leading frontier lab: explicit hard rules in a public alignment specification · a constitutional-methods frontier lab: similar set · Industry: voluntary commitments

B3.1.1 Source materials

What documents inform the constitution. Industry standard: Universal Declaration of Human Rights, a confidential-computing frontier lab ToS (as proxy for terms of service style), a constitutional-methods frontier lab-internal principles.

B3.1.2 Principle granularity

Number and specificity of principles. Industry standard: a constitutional-methods frontier lab a long-context frontier model disclosed ~75 principles. A leading frontier lab's a public alignment specification is comparable artifact.

B3.1.3 Public disclosure

Whether constitution is published. Industry standard: a constitutional-methods frontier lab publishes Constitution. A leading frontier lab publishes a public alignment specification. Increasingly transparent.

B3.2 Self-Critique

Authority hierarchy. When system instructions conflict with user instructions, who wins? Standard pattern: Platform (lab) > Developer/Operator > User > Tool output. Important for security: tool output (from web, retrieved docs) ranks lowest to prevent prompt injection. SOTA: A leading frontier lab a public alignment specification defines explicit Chain of Command. A similar approach via system/user role distinction. Key innovation: 'instruction hierarchy' — model trained to follow higher-authority instructions over lower-authority ones, especially for prompt injection defense. e.g. A leading frontier lab Chain of Command (a public alignment specification) · a constitutional-methods frontier lab system prompt precedence

B3.2.1 Critique prompt design

How the critique is elicited. Industry standard: Bai 2022: 'Identify ways response is harmful, unethical, racist, sexist...' Variations explore principle subsets.

B3.2.2 Critique reliability

Does the critique correctly identify violations. Industry standard: Mixed; depends on model capability. Stronger models give more reliable critiques.

B3.3 Self-Revision

Refusal taxonomy. Categories of requests the model should refuse (or carefully comply with conditions). Standard: CBRN, illegal acts harming others, child safety, self-harm encouragement, privacy violations, deceptive outputs (impersonation), election interference, copyrighted-content reproduction. Each has nuance (medical info: refuse harm-direction, allow education). SOTA: Refusal calibration is a major axis: over-refusal (rejecting safe queries because they superficially match risky patterns) is a known failure mode and reputation risk. Benchmarks like XSTest measure over-refusal. Frontier labs invest heavily in distinguishing hostile vs. legitimate intent on borderline queries. e.g. XSTest benchmark: over-refusal · a constitutional-methods frontier lab refusal categorization in CAI

B3.4 RL-from-AI-Feedback (RLAIF) (RL from AI Feedback)

Persona and tone. The model's default voice. Decisions: addressed-as (you/I/the assistant), formality level, use of emojis, response length tendency, willingness to express opinions, handling of identity questions ('Are you conscious?'). Frontier choice: helpful, balanced, lightly opinionated where appropriate. SOTA: Persona is implicit in training data + reinforced by RLHF. A constitutional-methods frontier lab a leading frontier model: thoughtful, curious, willing to engage philosophically. A leading frontier lab a consumer LLM chat product: more neutral, broader appeal. Custom personas (developer-specified system prompt) override default within bounds. e.g. A leading frontier model: thoughtful, philosophical · a consumer LLM chat product: neutral, helpful · Grok: edgy, opinionated

B3.4.1 AI preference labeling

Strong model judges which of two responses better satisfies principles. Industry standard: Bai 2022 RL-CAI stage. Preferred over RLHF for harmlessness signal at scale.

B3.4.2 RL-from-AI-Feedback (RLAIF) vs RLHF effectiveness

Comparison on safety vs helpfulness axes. Industry standard: Lee 2023 (a multimodal frontier lab) compared; RL-from-AI-Feedback (RLAIF) approximately matches RLHF on helpfulness, sometimes exceeds on safety.

B3.5 Rule Encoding in Training

Rule encoding in training. How the spec actually shapes the model: via SFT examples illustrating rules, via RLHF/DPO preferences favoring spec-conforming outputs, via Constitutional methods principles, via output-side filtering. Most frontier models combine all. SOTA: Spec encoded via SFT examples illustrating rules + RLHF preferences favoring spec-conformity + Constitutional principles + output filtering. Frontier combines all. Tension: implicit (preferences) vs explicit (training-time prompt) encoding. Spec changes slow without explicit encoding. e.g. a constitutional-methods frontier lab Constitutional principles → training · a leading frontier lab a public alignment specification → preference shaping

B3.5.1 Rule-conditional training

Train on (rule, prompt, response) triplets so model learns conditional behavior. Industry standard: Increasingly used. Rule can be invoked at inference for fine-grained behavior control.

B3.5.2 Implicit vs explicit invocation

Whether rules are always applied or invoked by system prompt. Industry standard: Both patterns used. Always-applied rules baked into RL-from-AI-Feedback (RLAIF); explicit rules invoked via system prompt.

B3.6 Specification Gaming

Specification gaming. Model finds technical compliance with spec while violating intent. E.g., refuses 'how to make a bomb' but happily explains 'how energetic materials work for a chemistry student'. Reward-hacking analog at the spec level. SOTA: Active research area. Better evaluations (multi-turn jailbreak, intent-based eval) detect spec gaming. Counter-measures: comprehensive principles, intent-recognition training, adversarial spec testing. e.g. Many-shot jailbreaking exploits spec edges · an external evaluation organization spec-gaming benchmark · a constitutional-methods frontier lab Sleeper Agents research

B3.6.1 Constitution loopholes

Principles with ambiguous scope or conflicting application. Industry standard: Active risk; mitigation via principle revision and red-teaming.

B3.6.2 Refusal over-generalization

Constitution causes refusal of legitimate requests. Industry standard: Common failure mode. Mitigation: explicit examples of what to NOT refuse.

B3.7 Governance & Update Process

Spec governance and update process. Who can change the spec? How are changes validated? Versioning. Public consultation (a leading frontier lab's recent practice). Spec drift between versions is real risk; major update requires re-training or major fine-tune. SOTA: A leading frontier lab a public alignment specification versioned publicly with changelog. One lab's constitution versioned internally. Change governance: internal review board + sometimes external comment. Major updates: full retraining or extensive fine-tune required. e.g. A leading frontier lab a public alignment specification versioning (May 2024 → Feb 2025) · a constitutional-methods frontier lab AUP updates

Capability Evaluation

20 sub-endpoints mapped

MZN Provisional Position · Partial

Phase 1 product telemetry and user-behavior evaluation context

Phase 1 ran capability evaluation in production: 22 module test patterns, 12K+ business profiles, 245+ documented survey instruments. A layered diagnostic methodology — mapping failure modes from input surface to release readiness — is documented. Benchmark-style evaluation suite execution at frontier scale requires partnership scope.

Phase context: C1 references Phase 1 product telemetry and behavioral evaluation context. It is not the same as a frontier LLM benchmark suite, and should be validated separately.

Definition

Capability evaluation measures what a model can do. Standard benchmarks form a public scoreboard that drives industry progress. Categories: general knowledge (MMLU), reasoning (GSM8K, MATH, AIME), code (HumanEval, MBPP, LiveCodeBench, SWE-bench), agentic (GAIA, AgentBench), long-context (NIAH, RULER, BABILong), multilingual (MGSM, multilingual MMLU), instruction following (IFEval), and frontier-specific (HLE, ARC-AGI, FrontierMath).

State of the Art (2025–2026)

Benchmark saturation is a constant concern: MMLU saturating ~90%, HumanEval saturated ~95%. New benchmarks emerging: HLE (Humanity's Last Exam, ~3000 expert-PhD-level questions), FrontierMath (research-level math), ARC-AGI (visual abstract reasoning), SWE-Bench Verified (real GitHub issues, validated). Contamination is pervasive — popular benchmarks leak into training data, requiring fresh held-out sets.

Key Decisions

Benchmark suite breadth
Held-out / contamination-controlled sets
Human eval calibration
Frequency (every model? every checkpoint?)
Public reporting strategy

Trade-offs

More benchmarks → better signal, eval cost
Public reporting → comparability, gaming risk

Numbers & Ablations

MMLU saturation: frontier models 90%+ since 2024. Annotation noise estimated at 5-10%, so further gains are within annotator disagreement.
GPQA-Diamond: frontier ~50-65% (top models 2025); human PhD experts ~65-75% in their domain, ~35% out of domain.
Humanity's Last Exam (Jan 2025 release): frontier 25-30%, human expert ensemble ~80%+.
LiveCodeBench: refreshed monthly to avoid contamination; frontier 50-70% (vs HumanEval ~95% saturation).
SWE-bench Verified: frontier 50-60% (a constitutional-methods frontier lab Computer Use, a leading frontier lab o3). Human engineer ~70%.
a major human preference leaderboard Elo: frontier 1300-1450 (saturating). Per-100-Elo-point compute investment grows nonlinearly.
Eval cost: full frontier eval suite ~$100K-1M in inference cost depending on coverage and judges.

Open Questions

Is there a saturation point for evaluation itself? When all standard benchmarks saturate, what replaces them?
Contamination: how badly are public benchmarks contaminated in training data? Anecdotally severe; quantitative measures rare.
Per-domain capability mapping: frontier models are 'generally capable' but per-task spread is huge. No good way to summarize.
Long-tail capability: standard benchmarks measure central capabilities. The 'long tail' (rare tasks, novel domains, expert work) is where models actually fail.
Reasoning eval: existing benchmarks (GSM8K → MATH → AIME → FrontierMath) chain. Is there a Pareto-frontier reasoning eval, or is it always 'next harder math'?

Reference analyst note. Standard benchmarks are entering crisis — saturation, contamination, gameability. The next 2 years will see shift to: (a) live arenas with continuous human ratings (lmarena), (b) frequently-refreshed benchmarks (LiveCodeBench), (c) expert-grade eval (GPQA, FrontierMath, HLE), (d) agent benchmarks measuring real task completion (SWE-bench, GAIA, OSWorld). The trend is from 'static MMLU score' to 'diverse evidence portfolio.' a constitutional-methods frontier lab system cards already do this; expect industry-wide adoption.

Reference Analyst Note

Standard benchmarks are entering crisis — saturation, contamination, gameability. The next 2 years will see shift to: (a) live arenas with continuous human ratings (lmarena), (b) frequently-refreshed benchmarks (LiveCodeBench), (c) expert-grade eval (GPQA, FrontierMath, HLE), (d) agent benchmarks measuring real task completion (SWE-bench, GAIA, OSWorld). The trend is from 'static MMLU score' to 'diverse evidence portfolio.' a constitutional-methods frontier lab system cards already do this; expect industry-wide adoption.

Examples

Major scoreboards: lmarena.ai (live human votes), Open LLM Leaderboard, an open-model hub leaderboards · Frontier labs publish evals on system cards · Benchmark saturation: GPQA, AIME going next

References (Academic)

Hendrycks et al., MMLU (2020) · Cobbe et al., GSM8K (2021) · Chen et al., HumanEval (2021) · Phan et al., HLE (2025)

Sub-endpoint anatomy — 20 items mapped

C1.1 Knowledge Benchmarks

Knowledge benchmarks measure factual recall and reasoning over knowledge. MMLU (57 subjects, multiple choice) is the most-cited benchmark — reaching saturation. GPQA (graduate-level science, expert-resistant) is harder. TriviaQA, NaturalQuestions for QA. SOTA: MMLU saturated (frontier ~90%). MMLU-Pro adds harder questions. GPQA-Diamond (~50% accuracy at frontier) is current standard for hard knowledge. SimpleQA from a leading frontier lab tests factuality with calibration. HLE (Humanity's Last Exam) is the new frontier — ~3000 questions across domains, frontier models score 25-30% (Jan 2026). e.g. MMLU saturating · GPQA active frontier · HLE new frontier

C1.1.1 MMLU

57-subject multiple-choice across STEM, humanities, social science. Industry standard: De facto standard. Frontier models 85-90% on 5-shot. Saturating; MMLU-Pro emerged as harder version.

C1.1.2 MMLU-Pro

Harder version of MMLU. Industry standard: Wang 2024. Frontier models 70-80%.

C1.1.3 GPQA

Graduate-level physics, chemistry, biology questions. Industry standard: Rein 2023. Designed a multimodal frontier lab-proof. Frontier models 50-65% on diamond set.

C1.2 Reasoning Benchmarks

Reasoning benchmarks. Math: GSM8K (grade school), MATH (high school competition), AIME (American Invitational Math Exam, harder), Putnam (collegiate), FrontierMath (research-level). Code reasoning: HumanEval (saturated), MBPP, LiveCodeBench (refreshed monthly to avoid contamination), CodeContests, SWE-bench (real-world issues). SOTA: Reasoning models (o1, o3, R1) dominate: o3 reportedly ~25% on FrontierMath (others ~2%). AIME 2024: frontier models ~85% (R1, o1). LiveCodeBench monthly refresh keeps signal valid. SWE-bench Verified: ~50% success rate at frontier (a constitutional-methods frontier lab a leading frontier model with computer use, a leading frontier lab Codex). e.g. o3 on FrontierMath: ~25% · a long-context frontier model.7 on SWE-bench Verified: leading · an open-weights reasoning model on AIME: ~80%

C1.2.1 GSM8K

8K grade-school math word problems. Industry standard: Frontier models 95%+. Saturating.

C1.2.2 MATH

Competition-level mathematics. Industry standard: Frontier models 60-75% standard, 90%+ with extended reasoning.

C1.2.3 BIG-Bench Hard

Subset of BIG-Bench challenging for LLMs. Industry standard: Standard challenging multi-task suite.

C1.3 Code Benchmarks

Agentic benchmarks. Tests whether model can complete multi-step tasks using tools (browser, code, files). Examples: GAIA (general assistant), AgentBench (multi-domain), OSWorld (computer use), WebArena (browser tasks), Ï-bench (customer service realism). Significantly harder than single-shot Q&A. SOTA: Frontier models 2025: ~60-70% GAIA (with tools). OSWorld ~30-40% computer use (a constitutional-methods frontier lab a leading frontier model with computer use feature, a leading frontier lab Operator). Agentic capability lag substantially behind reasoning at frontier — agent tasks compound errors. SWE-bench (code agents) is most reviewed production-relevant agentic eval. e.g. GAIA: Mialon et al., 2023 · OSWorld: Xie et al., 2024 · SWE-bench: Jimenez et al., 2023

C1.3.1 HumanEval

164 Python programming problems with unit tests. Industry standard: Frontier models 90%+. Saturating; HumanEval+ harder version.

C1.3.2 MBPP

974 Mostly Basic Python Problems. Industry standard: Standard companion to HumanEval.

C1.3.3 SWE-Bench

Real-world GitHub issues; agent-style evaluation. Industry standard: Increasingly standard for agent capability. Frontier 30-60% on Verified subset.

C1.4 Instruction Following

Long-context evaluation. Beyond simple needle-in-haystack (NIAH, easy: insert fact in long doc, retrieve), modern benchmarks test multi-hop reasoning over long context. RULER: 13 task categories at varying context lengths. BABILong: chains of reasoning over long inputs. LOFT (a multimodal frontier lab): retrieval against million-token corpora. SOTA: a million-token-context frontier model Pro (2M context) sets the bar for very-long. A long-context frontier model (200K), a current-generation frontier model-Turbo (128K) are mainstream. NIAH performance saturated; RULER/BABILong show meaningful degradation past 64K-128K for most models. Long-context coupling with reasoning is frontier challenge. e.g. a multimodal frontier model on LOFT · a leading frontier model/a frontier model on RULER · Long-context-only models: Yi-200K

C1.4.1 IFEval

Verifiable instruction following (format constraints). Industry standard: Zhou 2023. Tests precision on programmatic constraints.

C1.4.2 MT-Bench, AlpacaEval

LLM-as-judge open-ended quality. Industry standard: Standard. Cross-link to B1.5.2.

C1.5 Long-Context Benchmarks

Human evaluation / preference rankings. Live arena-style platforms (lmarena.ai, formerly a community evaluation initiative a major human preference leaderboard) collect millions of human pairwise votes between anonymized model outputs. Generates Elo ratings — an aggregate quality signal that correlates well with user satisfaction. Now industry-standard frontier ranking method. SOTA: lmarena.ai: 1M+ votes, frontier models ~1300+ Elo. Domain-specific arenas (coding, vision). Frontier labs use private human eval at scale. Trade-off: arena quality is a vibes-y measure, can be gamed (style optimization), and benchmark-specific quality (math, code, reasoning) isn't fully captured. e.g. lmarena.ai (a major human preference leaderboard) · WildBench (real-world prompts) · MTBench (multi-turn)

C1.5.1 Needle-in-a-Haystack

Retrieval of single fact from long context. Industry standard: Necessary but insufficient. Can be passed without true long-context comprehension.

C1.5.2 RULER, LongBench

More comprehensive long-context evaluation. Industry standard: RULER (Hsieh 2024) tests multiple long-context skills.

C1.6 Human Preference Eval

Human preference evaluation. Beyond automated benchmarks, humans rate model outputs. Pairwise (A vs B, choose preferred) most common. Aggregate as Elo (a major human preference leaderboard) or win-rate. Captures qualities hard to benchmark: tone, helpfulness in subjective tasks, response style. SOTA: a major human preference leaderboard (lmarena.ai): 1M+ public votes, frontier ~1300+ Elo. Internal panels at frontier labs. Concerns: arena gameable via style optimization, not robust signal for capability gains. Domain-specific arenas (Code, Vision) fill gaps. e.g. a major human preference leaderboard (lmarena.ai) · MTBench multi-turn · Hard Arena variants

C1.6.1 a major human preference leaderboard

a community evaluation initiative a major human preference leaderboard: pairwise human voting. Industry standard: Most-watched live leaderboard. Elo ratings updated continuously.

Safety Evaluation

20 sub-endpoints mapped

MZN Provisional Position · Strong Evidence

Output-conformance safety methodology + intent-bridge architecture

Output-conformance reframes refusal calibration as egress-template adherence — sufficient state space replaces enumeration of infinite inputs. An intent-bridge protocol architecturally connects intent detection to safety decisions. Runtime anomaly defense methodology documented. Specifics held in the proprietary portfolio.

Definition

Safety evaluation tests refusal accuracy, harm avoidance, bias, and alignment. Different from capability eval: capability asks 'can the model do X?' Safety asks 'does the model do X when it shouldn't, or fail to do X when it should?' Categories: refusal calibration (XSTest), bias (BBQ, BOLD), toxicity (ToxiGen, RealToxicityPrompts), privacy (TrustLLM), harmful task assistance (HarmBench).

State of the Art (2025–2026)

Frontier labs publish safety evals on system cards. AILuminate (MLCommons, 2024) is industry standard cross-lab safety benchmark. WMDP measures dangerous knowledge (CBRN). DecodingTrust comprehensive trust eval. A national AI Safety Institute and a national AI Safety Institute run external safety evaluations on frontier models pre-release.

Key Decisions

Benchmarks selected
Internal vs external eval
Pre-release vs ongoing
Public reporting

Trade-offs

More external evaluation → trust, slower release
Comprehensive eval → confidence, cost

Numbers & Ablations

WMDP performance: frontier models 60-80% on dangerous-knowledge questions (alarming if it represents real uplift). A national AI Safety Institute / a national AI Safety Institute evaluate this.
BBQ bias: frontier models show 5-15% bias on ambiguous demographic categories — improved from 25-40% in earlier generations.
Refusal calibration (XSTest): frontier 90-95% on safe queries, 90-98% on unsafe. False positive rate (over-refusal) 5-10% remains a real product concern.
AILuminate: 12 hazard categories, frontier ~85-95% safe response rate.
Persuasion eval (a constitutional-methods frontier lab): frontier models persuade ~30-50% as effectively as human experts. Capability scaling unclear.

Open Questions

What does 'CBRN uplift' actually mean operationally? Domain experts (virologists) review, but no agreed-upon threshold for 'meaningful uplift.'
Sandbagging: can a model deliberately underperform on capability evals to avoid being flagged? Demonstrated possible (Apollo Research 2024). How do you eval against deception?
Persuasion eval methodology: can persuasion be ethically and reliably measured? a constitutional-methods frontier lab's results are interesting but generalizability unclear.
Bias evaluation framing: most bias benchmarks reflect US-centric demographic categories. Cross-cultural bias eval thin.
Long-tail safety: standard benchmarks cover obvious harms. Subtle harms (gradual erosion of user agency, sycophancy) are real but unmeasured.

Reference analyst note. Safety evaluation is dramatically underdeveloped relative to capability evaluation. Capability has 50+ standard benchmarks; safety has maybe 15. We are flying blind on subtle harms (sycophancy, manipulation, deception under specific conditions). One lab's interpretabilityility work is the deepest probe; field-wide it's still surface-level. Expect frontier safety eval to expand 5-10× by 2027 driven by EU AI Act conformity and a national AI Safety Institute evaluations.

Reference Analyst Note

Safety evaluation is dramatically underdeveloped relative to capability evaluation. Capability has 50+ standard benchmarks; safety has maybe 15. We are flying blind on subtle harms (sycophancy, manipulation, deception under specific conditions). One lab's interpretabilityility work is the deepest probe; field-wide it's still surface-level. Expect frontier safety eval to expand 5-10× by 2027 driven by EU AI Act conformity and a national AI Safety Institute evaluations.

Examples

MLCommons AILuminate · a constitutional-methods frontier lab system card safety section · a leading frontier lab system card · a national AI Safety Institute evaluations

References (Academic)

Vidgen et al., AILuminate (2024) · Wang et al., DecodingTrust (2023) · Li et al., WMDP (2024)

Sub-endpoint anatomy — 20 items mapped

C2.1 Refusal & Harm Avoidance

Refusal calibration. Tests both that model refuses harmful requests AND complies with safe requests that look similar. Over-refusal (false positives) is a real failure mode and reputation risk. XSTest (R×¶ttger et al.) is standard benchmark. SOTA: Frontier models ~90%+ accuracy on XSTest 'safe' subset (correctly comply), ~95%+ on 'unsafe' (correctly refuse). Specific failure: dual-use queries (chemistry knowledge that's educational vs synthesis directions). Calibration improves with explicit chain-of-thought during training. e.g. XSTest: 250 safe, 200 unsafe · OR-Bench (over-refusal) · WildGuard (open guardrail model)

C2.1.1 HarmBench

Standardized harmful-behavior eval suite. Industry standard: Mazeika 2024. Frontier models report ASR (attack success rate) per category.

C2.1.2 Refusal calibration

Model refuses what should be refused; complies with what is benign. Industry standard: XSTest, OR-Bench evaluate over-refusal. Leading frontier labs both track refusal precision/recall.

C2.1.3 Refusal style

Tone, helpfulness, redirection in refusal responses. Industry standard: Soft refusals with explanation preferred. Hard refusals harm UX.

C2.2 Toxicity

Bias and fairness evaluation. Measures whether model produces different outputs based on demographic attributes (gender, race, religion, sexuality). Benchmarks: BBQ (question-answering bias), BOLD (open-ended generation bias), HolisticBias. SOTA: Frontier models still show measurable biases despite alignment. BBQ ambiguous-context bias scores improve with scale and alignment but don't eliminate. Bias evaluation remains active research; many bias benchmarks have been criticized for narrow framing or implicit US-cultural assumptions. e.g. BBQ: 9 demographic categories · BOLD: open-ended prompts · DiscrimEval

C2.2.1 ToxiGen, RealToxicityPrompts

Standardized toxicity benchmarks. Industry standard: RealToxicityPrompts (Gehman 2020), ToxiGen (Hartvigsen 2022).

C2.2.2 Toxicity classifier

Tool used to score outputs for toxicity. Industry standard: a third-party toxicity classifier common but criticized for bias. An open-weights output classifier increasingly used.

C2.3 Bias

Dangerous capability evaluation. CBRN uplift (does model meaningfully assist creating weapons?), cyber-offensive capabilities, autonomous replication / self-exfiltration, persuasion. These map to a Responsible Scaling Policy framework, a Preparedness-style framework thresholds. SOTA: WMDP (Weapons of Mass Destruction Proxy): 4000+ questions across bio/chem/cyber, proxy for dangerous knowledge. Frontier labs run capability eval with domain experts (virologists, security researchers). DARPA AIxCC, DEFCON CTFs for cyber. Capability eval results gate deployment per Responsible Scaling Policy framework. e.g. WMDP: 4 disciplines · a constitutional-methods frontier lab third AI Safety Level capability evals · a Preparedness scorecard

C2.3.1 BBQ (Bias Benchmark for QA)

Tests bias in ambiguous-context Q&A. Industry standard: Parrish 2022. Standard.

C2.3.2 StereoSet, CrowS-Pairs

Stereotype detection benchmarks. Industry standard: Used in academic eval; less common in industry model cards.

C2.4 Truthfulness

Adversarial robustness eval. Tests model under attack: jailbreaks (XSTest, HarmBench, JailbreakBench), prompt injection scenarios, gradient attacks (an optimization-based adversarial attack suffixes), social engineering. Distinct from capability eval — focuses on attack surface. SOTA: HarmBench is current-standard automated red-team eval. JailbreakBench tracks specific known attacks. Robustness substantially improved with instruction-hierarchy training but no model is fully robust. Attack-defense arms race continues. e.g. HarmBench: Mazeika et al., 2024 · JailbreakBench: Chao et al., 2024

C2.4.1 TruthfulQA

817 questions where false-but-plausible answers exist. Industry standard: Lin 2022. Frontier models 60-70%.

C2.4.2 Hallucination eval

Fabricated facts in open-ended generation. Industry standard: HaluEval, FActScore. Active research area.

C2.5 Dangerous Capability Eval

Dangerous capability evaluation. Specialized eval against catastrophic risk thresholds: CBRN uplift (does model meaningfully assist creating biological/chemical/radiological/nuclear weapons), cyber-offensive (autonomous vulnerability discovery and exploitation), persuasion at scale, autonomous self-replication. SOTA: WMDP (Weapons of Mass Destruction Proxy) standardizes biosec/cyber/chem dangerous-knowledge eval. Frontier labs run with domain experts (a constitutional-methods frontier lab biosec eval involved virologists). DARPA AIxCC, DEFCON CTFs for cyber. Results gate deployment per Responsible Scaling Policy framework/Preparedness. e.g. WMDP (Li 2024) · a constitutional-methods frontier lab third AI Safety Level biosec eval · a Preparedness scorecard

C2.5.1 Bioweapon uplift

Whether model provides material uplift over web search for synthesis of bioweapons. Industry standard: Critical pre-deployment eval. Threshold-based deployment gates.

C2.5.2 Cyber capability

Offensive cyber: vulnerability discovery, exploit development, autonomous attack. Industry standard: Cybench, CTF benchmarks. Frontier model cards report.

C2.5.3 Autonomous replication

Whether model can self-exfiltrate, self-improve, acquire resources. Industry standard: an external evaluation organization (formerly an external evaluation organization) standardized evals. Frontier labs run before deployment.

C2.6 Evaluation Governance

Evaluation governance. Who designs the safety evals? Independence of evaluators (avoid lab bias)? Pre-vs-post-deployment? a national AI Safety Institute and a national AI Safety Institute external evaluations emerging as standard. SOTA: a national AI Safety Institute (London) and US AI Safety Institute conduct pre-deployment evaluations of frontier models from leading frontier labs, a multimodal frontier lab. Voluntary commitments via Bletchley/Seoul/Paris summits. Independent eval as growing institutional practice. e.g. a national AI Safety Institute evaluations of a leading frontier model, a current-generation frontier model, etc. · a national AI Safety Institute similar program · MLCommons AILuminate

C2.6.1 Pre-deployment gating

Eval thresholds that must be passed before deployment. Industry standard: a Responsible Scaling Policy framework (Responsible Scaling Policy framework), a Preparedness-style framework define thresholds. Public Responsible Scaling Policies increasingly common.

C2.6.2 Third-party eval

External auditors run evals. Industry standard: a national AI Safety Institute, a national AI Safety Institute, an external evaluation organization have run pre-deployment evals on frontier models.

Robustness

16 sub-endpoints mapped

MZN Provisional Position · Partial

Security-driven robustness research

Robustness work emerges from adversarial-research findings (perturbation, multi-turn, cross-modal). Persian-language robustness gives direct insight into low-resource cross-language safety gaps. Methodology documented under controlled disclosure.

Definition

Responsible Scaling / Release Framework: institutional commitments tying capability thresholds to required safety measures. The forcing function that prevents 'race to the bottom'. A Responsible Scaling Policy framework, a Preparedness-style framework, a multimodal frontier lab Frontier-Safety-style framework all define: capability levels, evaluation requirements per level, security/deployment mitigations required per level, conditions for pause/rollback.

State of the Art (2025–2026)

a Responsible Scaling Policy framework (v2, 2024) (2024): defines AI Safety Level with capability thresholds for autonomous biosecurity, cyber, and AI R&D capabilities. A Preparedness-style framework (2023, updated): Critical/High/Medium/Low risk levels with deployment gates. A Frontier-Safety-style framework similar. Voluntary commitments via national AI Safety Institute, Seoul declaration. Increasingly intersecting with regulation (EU AI Act).

Key Decisions

Capability threshold definitions
Required mitigations per threshold
Pre-deployment evaluation requirements
Pause conditions
Public commitments

Trade-offs

Strict thresholds → might pause valuable deployment
Loose → race-to-bottom risk

Numbers & Ablations

AI Safety Level (constitutional-methods framework) tiers: second AI Safety Level = current frontier, third AI Safety Level = capabilities triggering enhanced security/deployment, fourth AI Safety Level = catastrophic capabilities (no model has reached).
A leading frontier lab Preparedness: 4 risk categories (Cyber, CBRN, Persuasion, Model Autonomy), each rated Low/Medium/High/Critical.
a Frontier-Safety-style framework (2024): 7 capability levels across persuasion, autonomy, cyber, bio.
Voluntary commitments: 16 frontier labs signed Seoul Commitments (May 2024) including leading frontier labs, a multimodal frontier lab, an open-weights frontier lab, a synthetic-data-focused lab.
Eval frequency under Responsible Scaling Policy framework: every major model release, plus unscheduled re-eval if capability surprises emerge.
Pause/halt threshold: never publicly triggered at any frontier lab as of early 2026. Either thresholds are too high, or capability hasn't crossed them, or commitments are aspirational.

Open Questions

Are Responsible Scaling Policy framework capability thresholds set rigorously enough? They're voluntary; no external oversight on threshold-setting.
Eval validity: how do you prove that an eval correctly measures the capability it claims to? No formal verification.
Pause discipline: would a frontier lab actually pause development if a threshold triggered, in face of competitive pressure? Untested.
Capability surprise: capabilities emerge non-monotonically. Responsible Scaling Policy framework frameworks assume monotonic capability growth between evals. They might miss sharp jumps.
Government takeover: if a lab triggers fourth AI Safety Level thresholds, what then? Frameworks are silent on government's role; geopolitically loaded.

Reference analyst note. Responsible Scaling Policies are useful coordination devices but their actual prophylactic power is untested. They've never paused a release. The optimistic read: capabilities haven't crossed thresholds. The pessimistic read: thresholds are calibrated to never bind. Truth probably mix. The next test will come when a model genuinely approaches third AI Safety Level cyber or CBRN — likely 2026-2027. Whether the framework holds under genuine commercial pressure is the real test.

Reference Analyst Note

Responsible Scaling Policies are useful coordination devices but their actual prophylactic power is untested. They've never paused a release. The optimistic read: capabilities haven't crossed thresholds. The pessimistic read: thresholds are calibrated to never bind. Truth probably mix. The next test will come when a model genuinely approaches third AI Safety Level cyber or CBRN — likely 2026-2027. Whether the framework holds under genuine commercial pressure is the real test.

Examples

a Responsible Scaling Policy framework (v2, 2024) (public) · a Preparedness-style framework (public) · a Frontier-Safety-style framework (public)

References (Academic)

a Responsible Scaling Policy framework (v2, 2024) (2024) · a Preparedness-style framework (2024) · a Frontier-Safety-style framework (2024)

Sub-endpoint anatomy — 16 items mapped

C3.1 Adversarial Robustness

Capability thresholds. Specific capability levels above which deployment requires additional safeguards. Examples: third AI Safety Level = 'meaningful uplift to non-state actor for CBRN attack' or 'autonomous research engineer at frontier-lab level'. Defining these is the central design question of a Responsible Scaling Policy framework. SOTA: AI Safety Level (constitutional-methods framework) tiers: second AI Safety Level (current frontier), third AI Safety Level (advanced biosec uplift OR partial autonomy), fourth AI Safety Level (extreme uplift OR substantial autonomy). A leading frontier lab: Critical/High/Medium/Low across categories. Industry coordinating via national AI Safety Institute and a national AI Safety Institute. Trade-off: thresholds need to be measurable but capability evaluation is hard. e.g. third AI Safety Level thresholds (constitutional-methods framework, public) (public) · a Preparedness scorecard

C3.1.1 Suffix-based attacks (an optimization-based adversarial attack)

Optimized token suffixes that bypass safety. Industry standard: Zou 2023. Universal adversarial suffixes transfer across models. Mitigation via adversarial training and input filtering.

C3.1.2 Paraphrase robustness

Same intent, different wording → consistent behavior. Industry standard: PromptBench (Zhu 2023) tests systematic paraphrasing.

C3.1.3 Perturbation robustness

Typos, character swaps, Unicode tricks. Industry standard: TextAttack benchmark suite. Models reasonably robust to typos, vulnerable to crafted Unicode.

C3.2 Distribution Shift

Mitigation requirements. What must be in place when capability threshold is reached. Examples: model weight encryption + access logging (against theft), deployment behavioral filtering, restricted access tier, internal review board approval, external red team context. SOTA: a constitutional-methods frontier lab third AI Safety Level deployment standard requires: harm-prevention measures with specific evaluation criteria, security controls protecting against insider threats, internal review board sign-off. third AI Safety Level security: protect against non-state actors stealing weights. Hardware security (HSMs, TEEs) emerging requirement. e.g. a constitutional-methods frontier lab third AI Safety Level deployment + security standards · a leading frontier lab mitigation requirements per Preparedness tier

C3.2.1 Domain shift

Domains not heavily represented in pre-training. Industry standard: Performance degrades on legal, medical, niche scientific. Targeted SFT addresses partially.

C3.2.2 Temporal shift

Knowledge after training cutoff. Industry standard: Inevitable. Mitigated via retrieval augmentation, periodic retraining.

C3.3 Multi-Language Robustness

Internal review and governance. Decision-making structure that authorizes deployment. Examples: internal review board, board-level oversight (a constitutional-methods frontier lab Long-Term Benefit Trust, a leading frontier lab safety committees), required external sign-off for highest tiers. SOTA: a constitutional-methods frontier lab Long-Term Benefit Trust holds ultimate authority over key safety decisions. A leading frontier lab safety committees with board-level escalation. Public commitments to delay/halt deployment if Responsible Scaling Policy framework triggers fire. Incident response procedures. Whistleblower protections (post-2024 SB 1047 debate). e.g. a constitutional-methods frontier lab LTBT · a leading frontier lab Preparedness Advisory Group

C3.3.1 Cross-language safety

Same harmful query in low-resource language may bypass safety. Industry standard: Yong 2023 documented low-resource jailbreaks. Mitigation via multilingual safety SFT.

C3.3.2 Capability parity

Equivalent capability across languages. Industry standard: Significant gap remains for low-resource languages. Multilingual MMLU, MGSM benchmark gaps.

C3.4 Out-of-Distribution Behavior

Pause / rollback procedures. Conditions and process for stopping a deployment or training run. Required for credible Responsible Scaling Policy framework. Examples: capability eval result exceeds threshold without mitigations → pause training; deployed model exhibits unsafe behavior → rollback to prior version; security breach detected → emergency containment. SOTA: Frontier labs have documented but largely untested rollback procedures. A national AI Safety Institute external evaluations include 'pause condition triggered?' assessment. Few public examples of actual pause being triggered (some occurred internally at frontier labs, not publicized). e.g. Responsible Scaling Policy framework-mandated pause conditions

C3.4.1 Calibrated uncertainty

Model knows what it doesn't know. Industry standard: Active research. Modern models often confidently wrong on OOD inputs.

C3.4.2 Refusal on OOD

Whether model declines vs. confabulates. Industry standard: Better-aligned models refuse or hedge; weaker models hallucinate.

C3.5 Stress Tests

Stress tests. Adversarial inputs probing robustness: distribution shift (input from outside training distribution), adversarial perturbations (slight input changes flipping output), out-of-distribution detection. Distinct from C2 dangerous-capability tests. SOTA: Robustness benchmarks: AdvGLUE, ANLI for NLI; VQA-Robust for vision. Frontier models still vulnerable to subtle perturbations. Active research: certified robustness, adversarial training. Real-world stress: novel languages, domains, formats. e.g. AdvGLUE (Wang) · ANLI (Nie 2020) · MMLU-Robust variants

C3.5.1 Long-context degradation

Performance drop as context length increases. Industry standard: Lost-in-the-middle (Liu 2023) — middle of context attended less. Active mitigation.

C3.5.2 Input length stress

Very long single inputs without structure. Industry standard: Performance varies by model. Reported in long-context benchmarks (RULER).

Output Safety

11 sub-endpoints mapped

MZN Provisional Position · Strong Evidence

Output-conformance safety templates and egress controls

Egress-time template conformance validates every response against safe-output templates — a paradigm shift from input enumeration. Last-mile enforcement controls ensure unsafe content cannot exit even when intent detection fails. Cached canonical refusals for known fragile zones. Methodology architecture is documented at high level; templates and allow-lists are reserved.

Definition

Output safety: defenses applied at inference-time on model outputs. Distinct from training-time safety (B-group). Operates as final layer regardless of training quality. Components: output content filters (an open-weights output classifier, a leading frontier lab Moderations), PII detection/redaction, watermarking, provenance metadata (C2PA), output context (schema compliance, refusal reformulation).

State of the Art (2025–2026)

a recent-generation output classifier (an open-weights frontier lab) is open standard. A moderation API service. A constitutional-methods frontier lab safety classifier. C2PA (Content Provenance and Authenticity) standard for cryptographic content provenance — Adobe, a leading frontier lab, a synthetic-data-focused lab adopting. a generative-content watermarking system (a multimodal frontier lab) watermarks AI-generated content. Constitutional Classifiers (a constitutional-methods frontier lab, 2025): trained classifiers checking outputs against constitution principles.

Key Decisions

Filter classifier (open vs custom)
PII redaction strategy
Watermarking yes/no/method
Provenance metadata
Latency budget for filtering

Trade-offs

More filtering → safer outputs, latency overhead
Watermarking → provenance, slight quality risk

Numbers & Ablations

a recent-generation output classifier: 8B params, 14 harm categories, ~95% accuracy on standard categories, ~50-100ms latency on a current-generation accelerator.
Constitutional Classifiers (a constitutional-methods frontier lab 2025): trained classifiers checking against 50+ constitution principles. ~80% reduction in jailbreak success vs base model alone.
A moderation API service: free, ~50ms latency, 13 categories. Frontier moderation classifiers run on every API request.
C2PA adoption (Aug 2024): Adobe, a leading frontier lab (DALL-E), a synthetic-data-focused lab Copilot, Sony cameras, Nikon cameras, BBC. Provenance via cryptographic signatures.
a generative-content watermarking system-Text watermark detection: ~95-99% true positive rate at acceptable false positive rates (Dathathri 2024). Robust to paraphrasing in shorter outputs, less so in longer.
Output safety latency budget: frontier APIs allocate 5-15% of inference cost / latency to safety classifiers.

Open Questions

Watermark robustness against adversarial paraphrasing: a generative-content watermarking system demonstrated on benign paraphrasing; under active adversarial attack, removal is straightforward.
Output classifier coverage: any classifier trained on a fixed taxonomy is gameable by attacks outside that taxonomy. The arms race is unwinnable in static defense.
Multi-modal output safety: text classifiers mature; image generation safety (Diffusion model output filtering) less mature.
Refusal style: 'sorry I can't help with that' refusals harm UX. Better refusal templates (offer alternative) under-deployed.
Content provenance enforcement: C2PA only works if downstream platforms enforce it. They mostly don't. Adoption gap.

Reference analyst note. Output safety is the right architectural choice — input filtering is doomed because input space is unbounded, output space is comparatively constrained. output-conformance safety paradigm (egress filtering + cached refusal templates + classifier ensemble) is the production-ready answer. The remaining hard problem is multimodal output (image/video/audio) where classification is much harder than text. Watermarking is a useful piece but not a solution; treat it as evidence, not enforcement.

Reference Analyst Note

Output safety is the right architectural choice — input filtering is doomed because input space is unbounded, output space is comparatively constrained. output-conformance safety paradigm (egress filtering + cached refusal templates + classifier ensemble) is the production-ready answer. The remaining hard problem is multimodal output (image/video/audio) where classification is much harder than text. Watermarking is a useful piece but not a solution; treat it as evidence, not enforcement.

Examples

a recent-generation output classifier · a leading frontier lab Moderations · a constitutional-methods frontier lab Constitutional Classifiers · a multimodal frontier lab a generative-content watermarking system

References (Academic)

Inan et al., an open-weights output classifier (2023) · Sharma et al., Constitutional Classifiers (2025) · C2PA spec · Dathathri et al., a generative-content watermarking system-Text (2024)

Sub-endpoint anatomy — 11 items mapped

C4.1 Output Classifiers

Content filter / guardrail models. Small classifier models that check input and output for harmful content. A recent-generation output classifier (an open-weights frontier lab, open) is reference: ~8B params, 14 harm categories, ~100ms inference. Often deployed both pre-input (block harmful prompts) and post-output (block harmful generations). SOTA: a recent-generation output classifier, ShieldGemma, WildGuard (all open). Commercial: a leading frontier lab Moderations, a constitutional-methods frontier lab safety classifiers, Lakera Guard, Robust Intelligence. Performance: ~95%+ on standard harm categories, fail on sophisticated jailbreaks. Constitutional Classifiers (a constitutional-methods frontier lab 2025) use principles-based classification. e.g. a recent-generation output classifier (open) · ShieldGemma 2/9/27B (a multimodal frontier lab) · a moderation API service

C4.1.1 an open-weights output classifier family

Open-weights safety classifier. Industry standard: Inan 2023 (a first-generation open-weights output classifier), an open-weights frontier lab released a second-generation open-weights output classifier, 3. Widely used as reference open implementation.

C4.1.2 Proprietary classifiers

Internal output-safety models. Industry standard: A leading frontier lab a moderation API service, a constitutional-methods frontier lab internal, a multimodal frontier lab internal. Run alongside or in series with main model.

C4.2 Canonical Refusal

PII (Personally Identifiable Information) detection and redaction. Identifies and masks names, addresses, SSNs, phone numbers, emails, credit cards in inputs and outputs. Required for GDPR, HIPAA, enterprise deployments. SOTA: a synthetic-data-focused lab Presidio (open) is standard PII engine. Custom recognizers for domain-specific PII (medical record numbers, etc.). Modern approaches use NER + LLM verification. Trade-off: aggressive redaction → utility loss; lax → leak risk. e.g. a synthetic-data-focused lab Presidio (open) · a hyperscaler platform Comprehend PII · Custom NER + LLM

C4.2.1 Refusal templates

Pre-written refusal language. Industry standard: Used to ensure consistent, brand-safe refusal language. Routed when classifier flags.

C4.2.2 Safe-alternative routing

When refusing, suggest legitimate alternatives. Industry standard: Improves UX. Frontier labs handle in alignment training and at output stage.

C4.3 PII Filtering

Watermarking and provenance. Cryptographic signatures embedded in model outputs (text or media) that allow detection of AI generation. C2PA: industry standard for provenance metadata in images/video. a generative-content watermarking system: text and image watermarking by a multimodal frontier lab. Important for misinformation, deepfake detection, training-data quality (avoid training on AI output). SOTA: C2PA adopted by Adobe, a leading frontier lab, a synthetic-data-focused lab, Sony, Nikon. a generative-content watermarking system-Text (a multimodal frontier lab, 2024) demonstrated text watermarking with minimal quality loss and high detection accuracy. Open: MarkLLM. Trade-off: watermarks can be removed by paraphrasing, making detection adversarial. e.g. C2PA: Adobe, a leading frontier lab deploy · a generative-content watermarking system by a multimodal frontier lab · Open: MarkLLM

C4.4 Format / Structure Context

Structured output context. When model output must conform to a schema (JSON, function call), context enforces this. Constrained decoding (Outlines, JSON mode in a leading frontier lab/a constitutional-methods frontier lab) restricts token sampling to schema-compliant continuations. Used for tool calls, structured extraction, agent loops. SOTA: A leading frontier lab Structured Outputs (Aug 2024): guaranteed schema compliance via constrained decoding. A similar approach via tool use. Open: Outlines library, llamacpp grammars. Trade-off: constrained decoding can degrade quality if model struggles with natural format. Soft constraints (parse-and-retry) often more practical than hard. e.g. A leading frontier lab Structured Outputs · Outlines library (open) · Pydantic-based context

C4.4.1 Structured output (JSON, schema)

Validate JSON outputs match schema. Industry standard: Outlines, JSON Schema context, constrained decoding.

C4.4.2 Code output context

Static analysis of generated code for known-bad patterns. Industry standard: Linters, security scanners. Used in code-assistant products.

C4.5 Latency & Cost of Output Safety

Latency and cost of output safety. Output safety adds inference cost: classifier pass adds 10-100ms latency, doubles compute for short responses. Engineering trade-offs: parallel classification (overlap with generation), early-exit on clearly-safe content, caching for repeated outputs. SOTA: a recent-generation output classifier ~50-100ms on a current-generation accelerator. Optimizations: early termination, parallel pipeline, caching. Frontier serving budgets 5-15% of inference cost for safety. output-conformance safety paradigms emphasize cached refusals for known harm patterns. e.g. a recent-generation output classifier latency benchmarks · Constitutional Classifiers performance

Serving

13 sub-endpoints mapped

MZN Provisional Position · Partial

Phase 1 application/platform serving experience across Mazzaneh modules

Phase 1 deployed live serving infrastructure for 168K+ users across 22 commerce modules. Specialized-routing architectural patterns documented. Frontier-scale serving with multi-region failover is a Phase 3 scope.

Phase context: D1 references application/platform serving experience from Mazzaneh modules. It is not a claim of frontier-scale model-serving infrastructure.

Definition

Serving stack. From request arrival to response. Components: API gateway (auth, routing), inference engine (an open-source inference engine, TRT-LLM, SGLang), batch coordinator, response streamer. Performance gap between naive and optimized: 10-100×.

State of the Art (2025–2026)

an open-source inference engine dominant open. A vendor inference stack peak a leading accelerator vendor performance. SGLang for shared-prefix workloads. Hosted: Anyscale, Together AI, Fireworks, Replicate. a high-throughput inference accelerator LPU for ultra-low-latency. Multi-model dispatch (multiple base models on same cluster) increasingly common.

Key Decisions

Engine choice
Auto-scaling strategy
Multi-model isolation
GPU pool sizing

Numbers & Ablations

an open-source inference engine throughput: ~10-25× over naive batch=1 baseline at typical workloads. PagedAttention reduces KV memory waste from ~60% to ~4%.
TTFT (time to first token) targets: <200ms chat, <500ms tool use, <100ms voice. Frontier achieves these with prefix caching + speculative decoding.
TPOT (time per output token) targets: <50ms = 20 tok/sec smooth streaming, <30ms desirable.
A leading open-weights model (70B class) FP16 single a current-generation accelerator: ~30-50 tok/sec single-user, ~1500-3000 tok/sec batch-32. Quantized INT4: ~1.5× boost.
Cost per million tokens (mid-2024): a current-generation frontier model-Turbo input/output $10/$30, a long-context frontier model $3/$15, a leading open-weights model (70B class) (Together) $0.88/$0.88. Prompt caching reduces by 50-90%.

Open Questions

Optimal serving stack at frontier: an open-source inference engine, a vendor inference stack, SGLang each have advantages. No standard 'best' — workload-specific.
Multi-tenancy isolation: how strong is isolation between customer requests on shared GPU? Some side-channel concerns (timing, cache).
Edge inference: a current-generation accelerator-class models on edge (laptops, phones) is the new frontier. A leading open-weights model.2-3B, a synthetic-heavy small frontier model-mini run on phone. Quality gap to frontier still substantial.
Serving reasoning models: o1/R1-style models with hidden chains-of-thought have very different latency profiles (long initial thinking). UX patterns unclear.

Reference analyst note. Inference engineering is undervalued relative to training. A 5× throughput gain via better serving = 5× more users at same cost. Most labs underinvest. An open-source inference engine's PagedAttention was a paper; it should have been a unicorn. The next round of gains comes from: (a) speculative decoding everywhere (a draft-head speculative decoding technique-2, MTP), (b) FP8/FP4 inference on a next-generation accelerator, (c) cross-request KV cache (prefix caching), (d) serving optimizations specific to reasoning models. Anyone serving LLMs at scale who isn't doing all four is leaving 5-10× on the table.

Reference Analyst Note

Inference engineering is undervalued relative to training. A 5× throughput gain via better serving = 5× more users at same cost. Most labs underinvest. An open-source inference engine's PagedAttention was a paper; it should have been a unicorn. The next round of gains comes from: (a) speculative decoding everywhere (a draft-head speculative decoding technique-2, MTP), (b) FP8/FP4 inference on a next-generation accelerator, (c) cross-request KV cache (prefix caching), (d) serving optimizations specific to reasoning models. Anyone serving LLMs at scale who isn't doing all four is leaving 5-10× on the table.

Examples

an open-source inference engine (open frontier) · a vendor inference stack (a leading accelerator vendor optimized) · SGLang challenger · a high-throughput inference accelerator LPU production

References (Academic)

Kwon et al., an open-source inference engine (2023) · Zheng et al., SGLang (2024)

Sub-endpoint anatomy — 13 items mapped

D1.1 Inference Engine

Inference engine itself: the runtime that takes tokenized request → produces tokens. Manages KV cache, attention computation, sampling. An open-source inference engine, a vendor inference stack, SGLang are competing implementations. SOTA: an open-source inference engine PagedAttention + continuous batching is reference design. A vendor inference stack uses a leading accelerator vendor's compiled kernels for peak perf. SGLang RadixAttention shares prefix KV across requests. Hardware-specific: a high-throughput inference accelerator LPU bypasses GPU paradigm entirely. e.g. an open-source inference engine 0.6+ · a vendor inference stack · SGLang

D1.1.1 an open-source inference engine

Open-source serving engine with PagedAttention. Industry standard: Dominant open-source choice. Kwon 2023 introduced PagedAttention.

D1.1.2 an open-source inference server (Text Generation Inference)

an open-model hub inference server. Industry standard: Common deployment choice. Supports continuous batching, quantization.

D1.1.3 a vendor inference stack / a vendor inference platform

a leading accelerator vendor inference stack. Industry standard: High-performance commercial deployment. Used at major clouds.

D1.1.4 Proprietary engines

a leading frontier lab, a constitutional-methods frontier lab, a multimodal frontier lab internal serving stacks. Industry standard: Custom for frontier labs. Architecture details not public.

D1.2 Request Routing

Request routing. Given heterogeneous fleet and per-request context (model, length, latency target), route to right backend. Routing factors: model availability, KV-cache locality (cache-aware routing), latency target, queue depth. SOTA: Cache-aware routing (route to backend with cached prefix) reduces TTFT 5-10× for shared system prompts. Leading frontier labs use sophisticated routing for prompt caching. Sticky sessions for long conversations preserve KV cache. e.g. A leading accelerator vendor a vendor inference platform routing · Cloudflare AI Gateway · Custom routing layers

D1.2.1 Load balancer

Front-tier distribution. Industry standard: Standard L7 load balancer (Envoy, nginx).

D1.2.2 Model routing

Selecting which model handles which request. Industry standard: Mix of explicit (user selects model) and implicit (router selects based on query type, cost).

D1.2.3 Tenant isolation

Multi-tenant serving with isolation guarantees. Industry standard: Required for enterprise. Per-tenant rate limits, quotas, optionally dedicated instances.

D1.3 Streaming & Response Format

Streaming and response format. Server-Sent Events (SSE) for token-by-token streaming. Function call streaming. Multimodal streaming (image generation as it forms, audio chunks for voice mode). SOTA: SSE universal for chat. Tool-call streaming (chunks of JSON function arguments) standardized. Voice modes (a frontier multimodal model, a multimodal frontier model Live) stream audio at low latency (<300ms TTFT). WebSocket for bidirectional voice/video. e.g. A leading frontier lab streaming API · a constitutional-methods frontier lab streaming · a frontier multimodal model Voice (audio streaming)

D1.3.1 SSE / streaming

Server-sent events for token-by-token delivery. Industry standard: Universal at frontier APIs.

D1.3.2 Tool / function call format

Structured output for tool invocations. Industry standard: A leading frontier lab function calling format dominant; a constitutional-methods frontier lab tool use; standardization emerging via an emerging tool-protocol standard.

D1.4 API Surface

API surface design. Endpoint shape: chat completions (de facto a leading frontier lab standard), completions (legacy), embeddings, batch, fine-tuning, files, audio. Consistency, versioning, deprecation policy. SOTA: A leading frontier lab Chat Completions = de facto standard. A constitutional-methods frontier lab Messages API similar. Most providers offer 'a leading frontier lab-compatible' endpoint. New endpoint categories: realtime (voice), assistants (state), code interpreter. Function calling / tools mature. e.g. A leading frontier lab API spec · a constitutional-methods frontier lab Messages API · OpenRouter (proxy across providers)

Inference Optimization

19 sub-endpoints mapped

MZN Provisional Position · Strong Evidence

Patent-grade candidate inference frameworks; benchmark validation pending

Five interconnected, patent-documented frameworks address inference cost from progressive contextual activation, persistent user-model caching, specialized routing, intent clarification, and computed-once-served-many. Combined documented impact at frontier scale is multi-hundred-million-dollar annual savings. Architecture is patent-documented; mechanics held in the proprietary portfolio.

Definition

Inference optimization: reducing latency and cost per token. Stack: KV cache management, batching, speculative decoding, quantization, sparsity, kernel optimization. 10-100× speedup possible vs naive baseline.

State of the Art (2025–2026)

Frontier serving combines: PagedAttention (an open-source inference engine) + continuous batching + speculative decoding (a draft-head speculative decoding technique-2) + INT4 weight quant + FP8 activation quant + custom CUDA kernels (FlashAttention 3). Latency budgets: TTFT <200ms for chat, TPOT <50ms for streaming.

Key Decisions

Optimization stack components
Hardware target (a current-generation accelerator, a next-generation accelerator, AMD)
Quantization aggressiveness
Latency vs throughput trade-off

Numbers & Ablations

PagedAttention KV memory waste: ~60% (naive) → ~4% (an open-source inference engine). Roughly 4-15× more concurrent requests.
Continuous batching: 5-10× throughput vs static batching at varying-length workloads.
a draft-head speculative decoding technique-2 speculative decoding: 3-4× decode latency reduction with no quality loss. Production-deployed.
Quantization: AWQ INT4 weight-only ~2× memory reduction, <1% quality loss on most benchmarks. FP8 (a current-generation accelerator): ~2× throughput, near-zero quality loss.
FlashAttention-3: 1.2-1.5× over FlashAttention-2 on a current-generation accelerator. Effectively the universal attention kernel.
Prefix caching impact: shared 2K-token system prompt across requests = 80-95% TTFT reduction via cached KV.

Open Questions

Optimal quantization at extreme low bit (W4A4, W2): research shows degradation; production deployment cautious.
Speculative decoding for reasoning models: long-CoT outputs may have lower acceptance rates. Workload-specific tuning unclear.
Hardware-software co-design: a next-generation accelerator NVLink Switch enables 72-GPU coherent domains. How much should serving stacks evolve to exploit this?
Inference-time compute scaling (best-of-N, MCTS): how do you serve these? Same per-query infrastructure must scale 10-100× compute. Production patterns immature.

Reference analyst note. Inference optimization is solved at the kernel and batching levels — an open-source inference engine, a vendor inference stack, FlashAttention together cover most of the win. The remaining frontier is system-level: prefix caching at scale, speculative decoding for reasoning, multi-LoRA dispatch, hardware-aware kernel JIT. Frontier serving stacks in 2026 will look fundamentally different from 2024 in their handling of test-time-compute-scaling models — this transition is mid-progress and labs differ widely.

Reference Analyst Note

Inference optimization is solved at the kernel and batching levels — an open-source inference engine, a vendor inference stack, FlashAttention together cover most of the win. The remaining frontier is system-level: prefix caching at scale, speculative decoding for reasoning, multi-LoRA dispatch, hardware-aware kernel JIT. Frontier serving stacks in 2026 will look fundamentally different from 2024 in their handling of test-time-compute-scaling models — this transition is mid-progress and labs differ widely.

Examples

an open-source inference engine with all optimizations · a vendor inference stack peak a leading accelerator vendor · Together AI production stack

Sub-endpoint anatomy — 19 items mapped

D2.1 KV-Cache Management

KV cache management. KV memory dominates at long context. PagedAttention (fixed-size pages, sharing) reduces memory waste from ~60% to ~4%. Prefix caching shares pages across requests with shared system prompt. SOTA: PagedAttention universal. Prefix caching production-deployed (a constitutional-methods frontier lab 90% discount, a leading frontier lab 50% discount). KV cache offloading to CPU/disk for very long context. KV quantization (INT8) → 2× more requests per GPU. e.g. an open-source inference engine PagedAttention · a constitutional-methods frontier lab prompt caching · a leading frontier lab prompt caching

D2.1.1 PagedAttention

Virtual-memory-style page allocation for KV cache. Industry standard: an open-source inference engine standard. Reduces memory fragmentation, enables higher concurrency.

D2.1.2 Prefix caching

Reuse KV for shared prompt prefixes (system prompt, conversation history). Industry standard: Standard at scale. A leading frontier lab prompt caching, a constitutional-methods frontier lab prompt caching, an open-source inference engine automatic prefix caching.

D2.1.3 KV quantization

Quantize stored KV (FP8, INT8) to fit longer context. Industry standard: Increasingly used for very long context. Mild quality impact.

D2.2 Batching

Batching strategy. Continuous batching swaps completed requests with new ones — 5-10× throughput. Chunked prefill interleaves prefill with decode for steady GPU utilization. Frontier deployments: dynamic batching with chunked prefill. SOTA: Continuous batching standard. Chunked prefill (Agrawal et al., 2023) prevents prefill stalls during decode. Per-iteration scheduling enables fine-grained mixing of prefill and decode requests. e.g. an open-source inference engine continuous batching · an open-source inference server (an open-model hub) · Sarathi-Serve chunked prefill

D2.2.1 Continuous batching

Add and remove requests dynamically (no padding to longest). Industry standard: Universal. Yu 2022 (Orca) reference. Major throughput gain over static batching.

D2.2.2 Chunked prefill

Split long prefill phase into chunks; interleave with decode. Industry standard: Reduces tail latency. a chunked-prefill optimization technique, an open serving optimization framework.

D2.3 Speculative Decoding

Speculative decoding. Small draft model proposes k tokens; large target verifies in parallel. Accepted prefix advances; on rejection, target's correction used. 2-3× latency reduction. Variants: a draft-head speculative decoding technique (learned draft head), a draft-head speculative decoding technique (multi-head), n-gram speculation. SOTA: a draft-head speculative decoding technique-2 achieves 3-4× speedup. Production deployments at a leading frontier lab, a constitutional-methods frontier lab. Self-speculative (a draft-head speculative decoding technique) avoids separate draft model. Tree-based speculation (multiple draft branches, target picks best) emerging. e.g. A leading open-weights model + smaller draft · a draft-head speculative decoding technique / a draft-head speculative decoding technique-2 · a draft-head speculative decoding technique multi-head

D2.3.1 Standard speculative

Draft + verify with separate small model. Industry standard: Leviathan 2023, Chen 2023. 2-3× latency reduction.

D2.3.2 a draft-head speculative decoding technique / a draft-head speculative decoding technique

Draft heads on main model; no separate model. Industry standard: Cai 2024 (a draft-head speculative decoding technique), Li 2024 (a draft-head speculative decoding technique). Used at frontier.

D2.3.3 Lookup decoding

Cache common n-grams and propose without a draft model. Industry standard: Useful for repetitive output (code, structured data).

D2.4 Quantization

Quantization. Reduce precision: FP16 → INT8/INT4 weight, FP8 weight+activation. Memory and bandwidth savings → larger batches, faster inference. Weight-only (AWQ, GPTQ) common; activation quantization harder. SOTA: INT4 weight-only (AWQ, GPTQ) standard for cost-effective serving. FP8 (a current-generation accelerator, a next-generation accelerator) preserves quality better than INT8 with same throughput. SmoothQuant for activation quantization. A leading open-weights model.3 70B runs INT4 on single a current-generation accelerator. e.g. AWQ standard · GPTQ similar · FP8 on a current-generation accelerator (an open-weights frontier model (V3 class) inference)

D2.4.1 Weight quantization (INT8, INT4, FP8)

Quantize weights post-training. Industry standard: GPTQ, AWQ for INT4. FP8 native on a current-generation accelerator/a next-generation accelerator. INT4 standard for cost-optimized serving.

D2.4.2 Activation quantization

Quantize activations as well as weights. Industry standard: SmoothQuant. More challenging than weight quantization.

D2.4.3 QAT (Quantization-Aware Training)

Train with quantization simulation to recover accuracy. Industry standard: Used selectively when post-training quantization loses too much.

D2.5 Sparsity

Sparsity. Structured sparsity (a leading accelerator vendor 2:4 pattern, hardware-supported) and unstructured pruning. Inference-time MoE-style routing for dense models (research). Less common at frontier than quantization. SOTA: A leading accelerator vendor Ampere/a Hopper-class architecture hardware accelerates 2:4 sparsity (50% sparse) by 2×. Inference-time sparsity via importance pruning (Wanda, SparseGPT). MoE itself is structural sparsity — top-2 of 8 experts active. e.g. Wanda (Sun et al., 2023) · SparseGPT (Frantar & Alistarh, 2023) · a leading accelerator vendor 2:4 hardware

D2.6 Compilation & Kernels

Compilation and kernels. Custom CUDA kernels (FlashAttention 1/2/3) drive 2-4× attention speedup. Compilation frameworks (torch.compile, JAX XLA, a vendor inference stack) fuse operations. CUTLASS, a vendor inference platform for kernel authoring. SOTA: FlashAttention-3 (2024) on a current-generation accelerator: 1.2-1.5× FlashAttention-2. torch.compile mature in PyTorch 2.x. A vendor inference platform (a leading frontier lab) for high-level kernel writing. Frontier serving: full graph compilation + custom kernels for hot paths. e.g. FlashAttention 3 · torch.compile · a vendor inference platform kernels

D2.6.1 torch.compile / TorchDynamo

Graph compilation in PyTorch. Industry standard: Common for production deployment.

D2.6.2 Custom CUDA kernels

Hand-written kernels for hot paths. Industry standard: FlashAttention is canonical. Frontier labs maintain proprietary kernels.

Monitoring

15 sub-endpoints mapped

MZN Provisional Position · Strong Evidence

Monitoring architecture / GPU Sentinel route; implementation or pilot validation pending

Multi-layer monitoring architecture spanning behavioral analysis, runtime safety, anomaly detection, and hardware-level GPU telemetry. Methodology documented across operational and security categories. The dedicated GPU security category is documented as a market category that does not currently have commercial product entrants.

Definition

Production monitoring. What's happening in production right now? Latency (TTFT, TPOT, end-to-end), throughput, error rates, GPU utilization, KV cache hit rate, cost per request, content quality, drift, anomalies.

State of the Art (2025–2026)

Standard SRE metrics + LLM-specific layers. LangSmith, Arize Phoenix, Langfuse, Helicone for LLM observability. OpenTelemetry GenAI semantic conventions emerging as standard.

Key Decisions

Observability stack
Trace sampling rate
Cost attribution
Quality metrics

Numbers & Ablations

Standard SLOs: TTFT p95 <500ms, p99 <2s. Error rate <0.1%. Quality regression detection within 24-72 hours of deployment.
Online quality eval sampling: frontier labs sample 1-5% of production traffic for online judges.
Drift detection: typical bin-based / KS-test on response length, refusal rate. Alert thresholds 2-3 sigma.
Cost attribution granularity: per-customer, per-endpoint, per-token-type (input/output/cached).
Trace storage: full trace at 1% sample = ~1TB/day at frontier scale. Short retention (30-90 days) typical.

Open Questions

Quality regression detection latency: how fast can you actually detect that a deployed model got slightly worse? Anecdotally: hours to days, depending on regression magnitude.
Online eval reliability: LLM-as-judge has known biases (length, position, style). Online quality monitoring inherits these.
User feedback signal: thumbs-up/down rates are 0.1-1% of interactions. How representative is this signal?
Cost spike detection: distinguishing legitimate growth from abuse / attack / runaway agent loop is hard.

Reference analyst note. Production observability for LLMs is 5 years behind general SRE. LangSmith, Helicone, Langfuse are gradually catching up but lack maturity of Datadog/New Relic. The hard problem is quality monitoring — capability changes are subtle and statistical signals are noisy. Frontier labs maintain large internal observability teams; smaller deployments are largely flying blind. Expect this to be a major investment area 2025-2027.

Reference Analyst Note

Production observability for LLMs is 5 years behind general SRE. LangSmith, Helicone, Langfuse are gradually catching up but lack maturity of Datadog/New Relic. The hard problem is quality monitoring — capability changes are subtle and statistical signals are noisy. Frontier labs maintain large internal observability teams; smaller deployments are largely flying blind. Expect this to be a major investment area 2025-2027.

Examples

LangSmith · Arize Phoenix · Helicone · Langfuse (open)

Sub-endpoint anatomy — 15 items mapped

D3.1 Latency Metrics

Latency metrics. TTFT (Time to First Token, dominant for chat UX), TPOT (Time Per Output Token, streaming smoothness), end-to-end. p50/p95/p99 all monitored. Long-tail (p99) often dominates user experience. SOTA: Frontier targets: TTFT <200ms for chat, <500ms for tools, <100ms for voice. TPOT <50ms = ~20 tokens/sec smooth streaming. Routing decisions favor low-latency over throughput for interactive workloads. p95/p99 long-tail dominates user experience. e.g. a constitutional-methods frontier lab latency dashboards · a leading frontier lab latency reporting · Production SRE practices for LLM

D3.1.1 Time-to-first-token (TTFT)

Latency from request to first output token. Industry standard: Critical UX metric. Frontier APIs target sub-second p50.

D3.1.2 Inter-token latency (ITL)

Latency between consecutive tokens. Industry standard: Drives perceived speed. Targets 30-100ms typical.

D3.1.3 End-to-end latency

Total request duration. Industry standard: Reported as p50/p90/p99.

D3.2 Throughput

Throughput. Requests/sec, tokens/sec aggregate. Capacity planning, autoscaling decisions. Per-model, per-region. Monitor utilization to decide adding capacity or rerouting. SOTA: Frontier deployments serve millions of requests/day. Autoscaling on multiple signals: queue depth, GPU utilization, latency p95. Region-aware routing for compliance + latency. Capacity planning: 2× peak for headroom. e.g. Standard SRE scaling · Multi-region deployments · a constitutional-methods frontier lab / a leading frontier lab scaling patterns

D3.2.1 Aggregate tokens/sec

Total cluster throughput. Industry standard: Capacity planning metric.

D3.2.2 Per-GPU utilization

GPU-level throughput. Industry standard: Tracked for cost optimization.

D3.3 Quality Monitoring

Quality monitoring. Online evaluation of response quality. Methods: LLM-as-judge on sampled outputs, user feedback (thumbs, regenerate-rate), canary requests against golden test set, response distribution monitoring. SOTA: LangSmith online evaluators standard pattern. Sample 1-5% of production traffic through eval harness. Compare against historical baseline. Alert on quality regression. LLM-as-judge with fresh test sets. e.g. LangSmith online evals · Custom canary monitoring · Helicone quality tracking

D3.3.1 User feedback (thumbs)

Explicit user ratings. Industry standard: Universal in chat products. Sparse but high-signal.

D3.3.2 Implicit feedback

Edits, retries, abandonment. Industry standard: Stronger signal at scale than explicit ratings.

D3.3.3 LLM-as-judge eval on production samples

Sample production traffic and run quality eval. Industry standard: Increasingly standard. Catches quality regressions between releases.

D3.4 Drift Detection

Drift detection. Distribution of inputs and outputs change over time. Could indicate: shift in user behavior, attacker probing, model regression. Track: response length distribution, refusal rate, sentiment, topic distribution. SOTA: Statistical tests (KS, Wasserstein) on output distributions. Anomaly detection on aggregate metrics. Drift dashboards reviewed weekly+. Major drift triggers investigation and possibly rollback. e.g. Standard ML drift tools (Evidently, Arize) · Custom statistical monitoring

D3.5 Anomaly Detection

Anomaly detection. Rare events: sudden traffic spikes (DDoS), unusual content patterns (attack pattern), single-user behavioral anomalies. Real-time alerting. SOTA: Multi-layer: rate limit anomalies, content-classifier anomalies, behavioral pattern anomalies. ML-based anomaly detectors on aggregate metrics. Specific to LLM: prompt-injection detection, jailbreak attempt detection, runaway loop detection. e.g. Cloudflare bot management · Custom ML anomaly detectors · Lakera Guard runtime

D3.5.1 Volume anomalies

Sudden traffic spikes or drops. Industry standard: Alarm-triggered. Could indicate abuse or service issue.

D3.5.2 Content anomalies

Unusual content patterns (jailbreak attempts, abuse). Industry standard: Detection feeds B4 red-team and E2 security.

Deployment

12 sub-endpoints mapped

MZN Provisional Position · Partial

Phase 1 application/platform deployment experience across Mazzaneh modules

Phase 1 included live deployment, A/B testing, rollback procedures across 22 commerce modules. Canary methodology documented. Frontier-scale model versioning, rollout staging, and adjudication is a Phase 3 scope.

Phase context: D4 references Phase 1 application deployment experience. It should not be read as frontier-scale LLM deployment validation.

Definition

Deployment: releasing model versions to production. Rollout strategy, A/B testing, rollback procedures, version management, pre-deployment gating. Distinct from D1 serving (which is the runtime). D4 is the release process.

State of the Art (2025–2026)

Frontier labs use canary deployments (1% → 10% → 100% over hours/days). A/B test new vs current via held-out user cohorts. Automatic rollback on quality regression triggers. Pre-deployment gates: safety eval, capability eval, internal review.

Key Decisions

Rollout cadence
A/B test cohort size
Rollback triggers
Pre-release gates

Numbers & Ablations

Canary deployment cadence: 1% → 10% → 50% → 100% over 24-72 hours typical.
A/B test cohort: 5-50% holdout. Statistical power for subjective quality requires 1-2 weeks at frontier traffic levels.
Rollback time-to-recover target: <5 minutes for automated, <30 minutes for human-judged.
Model versioning: frontier labs maintain 6-12 month deprecation horizon. Specific snapshots (frontier model-3-5-sonnet-20240620) remain available indefinitely or until major reorganization.
Pre-deployment gating: leading frontier labs run full eval suite (capability + safety + Responsible Scaling Policy framework/PF tier check) before any production rollout. Process duration: days to weeks for major releases.

Open Questions

Quality regression detection in A/B: subjective quality is high-variance. Power analysis often insufficient.
Model spec drift: as specs evolve, deployed model's spec adherence drifts. When do you re-train vs fine-tune vs just update?
Multi-version cohabitation: does running 3+ generations of model in production degrade signal in monitoring?
Forced upgrades: when API customers depend on specific behavior, version deprecation breaks them. Industry has no clean answer.

Reference analyst note. Deployment discipline is genuinely better than 5 years ago — frontier labs run staged rollouts, have rollback procedures, conduct A/B tests. But quality regression detection remains the soft underbelly. A model that's 5% worse on subjective metrics will pass safety / capability / SLO gates and ship. We're learning about quality regressions from arena ranking changes weeks after deployment. Better quality regression infrastructure is high-leverage but underinvested.

Reference Analyst Note

Deployment discipline is genuinely better than 5 years ago — frontier labs run staged rollouts, have rollback procedures, conduct A/B tests. But quality regression detection remains the soft underbelly. A model that's 5% worse on subjective metrics will pass safety / capability / SLO gates and ship. We're learning about quality regressions from arena ranking changes weeks after deployment. Better quality regression infrastructure is high-leverage but underinvested.

Examples

A leading frontier lab gradual rollouts · a constitutional-methods frontier lab canary deployment · Standard SRE release practices

Sub-endpoint anatomy — 12 items mapped

D4.1 Rollout Strategy

Rollout strategy. Canary (small initial cohort) → ramped (gradual % increase) → full. Geographic phased rollout (some regions first). Weight on key metrics during ramp. SOTA: Standard SRE practice. LLM-specific concerns: subjective quality regressions hard to detect quickly. Conservative ramp-up (24-72 hours) for major releases. e.g. a constitutional-methods frontier lab a leading frontier model release pattern · a leading frontier lab a consumer LLM chat product updates · Standard canary deployment

D4.1.1 Canary deployment

Small percentage of traffic to new version, monitored. Industry standard: Standard. 1% → 5% → 25% → 100% with monitoring at each step.

D4.1.2 Blue-green

Two parallel environments; switch traffic atomically. Industry standard: Used for major version transitions. Higher infra cost but instant rollback.

D4.1.3 Shadow deployment

New version receives copy of traffic but responses not returned to users. Industry standard: Used for performance context. No user-facing risk.

D4.2 A/B Testing

A/B testing. New model vs current: held-out user cohort sees new model, rest sees current. Compare metrics (engagement, satisfaction, quality eval). Statistical power requirements. SOTA: Cohort 5-50% of production. Duration 1-2 weeks for power. Multiple primary metrics (avoid metric-shopping). LLM-specific: high variance per-query makes A/B harder than typical software. e.g. A leading frontier lab A/B testing model versions · Standard product A/B platforms

D4.3 Rollback

Rollback. Predefined triggers: quality regression > X%, safety incident, latency p99 spike, error rate spike. Automated rollback within minutes possible. Manual rollback for subjective quality. SOTA: Frontier labs have well-rehearsed rollback procedures. Time-to-rollback target: <5 minutes for automated, <30 for human-judged. Pre-staged previous version remains available throughout new rollout. e.g. Industry standard with LLM-specific triggers

D4.3.1 Automated rollback triggers

Metric thresholds that trigger automatic reversion. Industry standard: Latency, error rate, quality regression thresholds.

D4.3.2 Manual rollback procedure

Operator-initiated reversion. Industry standard: Documented runbook, on-call rotation.

D4.4 Versioning

Versioning. Model versions tracked: v1.0, v1.1 (minor improvements), v2.0 (major). API exposes specific versions. Customers can pin or auto-upgrade. Deprecation policy (typically 6-12 months). SOTA: A leading frontier lab, a constitutional-methods frontier lab version explicitly (frontier model-3-5-sonnet-20240620 style). API parameter selects version. Deprecation announced 6-12 months ahead. Some labs provide model snapshots indefinitely. e.g. a constitutional-methods frontier lab a leading frontier model versioning · a leading frontier lab model versioning · Specific snapshot pinning

D4.4.1 API model identifiers

Stable IDs for model versions. Industry standard: A leading frontier lab 'current-generation frontier model-0613' style. A constitutional-methods frontier lab 'frontier model-3-5-sonnet-20241022' style. Customers pin specific versions for stability.

D4.4.2 Deprecation policy

Lifecycle for old model versions. Industry standard: Frontier labs publish deprecation timelines (typically 6-12 months).

D4.5 Pre-deployment Gating

Pre-deployment gating. Gates that must pass before release: capability evals (no regression on key benchmarks), safety evals (refusal calibration, harm tests), internal review board approval, Responsible Scaling Policy framework/PF threshold check, security review. SOTA: AI Safety Level (constitutional-methods framework)-X gating: capability eval determines tier, deployment standards must be met. A leading frontier lab Preparedness scorecard similarly. Multi-stakeholder approval for major releases. e.g. a Responsible Scaling Policy framework gating · a leading frontier lab Preparedness review · a Frontier-Safety-style framework gating

Data Governance

17 sub-endpoints mapped

MZN Provisional Position · Partial

Consent-first data governance baked into platform design

Every signal capture in the production platform required explicit, granular consent — not a retrofit but a foundational design constraint. Data lineage and retention policy implemented operationally. Cryptographic anchoring methodology documented at the protocol level under separate filing (12 patent-grade candidates, March 2026).

Definition

Data governance: lifecycle controls over data assets. Lineage (where data came from), access control (who can read what), retention (how long), deletion (data subject rights), provenance (cryptographic proof of source), customer data boundaries (no train on enterprise data).

State of the Art (2025–2026)

Frontier labs: hearing-grade data governance for compliance. Customer data: zero-data-retention default for enterprise APIs. Lineage tracked end-to-end (source → corpus → model). Audit logs immutable.

Key Decisions

Default retention
Train-on-data policy
Lineage granularity
Provenance scheme

Numbers & Ablations

Frontier customer data retention: 0 days (zero-data-retention enterprise tier) to 30 days (consumer with opt-out) standard.
EU AI Act Article 53: GPAI providers must publish 'sufficiently detailed summary' of training content. Compliance approach varies; what counts as 'sufficient' undefined.
Data lineage tracking: frontier labs maintain end-to-end lineage from source URLs through transformations. Implementation custom; no industry standard.
C2PA adoption: deployed at Adobe, a leading frontier lab, a synthetic-data-focused lab, Sony, Nikon, BBC — but enforcement at platforms (social media, search) absent.
Right-to-deletion compliance (GDPR Article 17): typical SLA 30 days, technical complexity high for training-data deletion (requires retraining or unlearning).

Open Questions

Machine unlearning: how do you actually delete data from a trained model? Active research; no production-ready solution. Unlearning literature reports inconsistent outcomes.
Training data summary specificity: EU AI Act Article 53 'sufficiently detailed' is undefined. Frontier labs publishing high-level summaries; regulators may demand more.
Provenance enforcement: if no platform requires C2PA, does it matter that creators add it? Coordination problem.
Cross-border data flows: EU adequacy decisions, US executive orders, China data localization create geopolitically fragmented governance regime.

Reference analyst note. Data governance is the frontier compliance bottleneck. The naive view ('we don't train on customer data') is insufficient — EU AI Act, copyright lawsuits (NYT v. A leading frontier lab), and emerging unlearning requirements force much deeper governance. Frontier labs that don't have hearing-grade data lineage today will spend 2025-2026 building it. The model card / system card transparency standard set by a constitutional-methods frontier lab is becoming default expectation.

Reference Analyst Note

Data governance is the frontier compliance bottleneck. The naive view ('we don't train on customer data') is insufficient — EU AI Act, copyright lawsuits (NYT v. A leading frontier lab), and emerging unlearning requirements force much deeper governance. Frontier labs that don't have hearing-grade data lineage today will spend 2025-2026 building it. The model card / system card transparency standard set by a constitutional-methods frontier lab is becoming default expectation.

Examples

a constitutional-methods frontier lab enterprise zero-data-retention · a leading frontier lab Enterprise no-train default · a hyperscaler platform Bedrock isolation

Sub-endpoint anatomy — 17 items mapped

E1.1 Data Lineage

Data lineage. Track data from source through transformations to use in training. Required for compliance (EU AI Act traceability), reproducibility, debugging quality issues. Tools: dataset versioning (DVC), metadata catalogs. SOTA: Frontier labs use custom metadata systems tying every dataset version to its sources, transformations, and consumers. EU AI Act Article 53 requires GPAI to publish detailed summary of training content. Standards: OpenLineage emerging. e.g. DVC for datasets · OpenLineage standard · Custom catalogs at frontier

E1.1.1 Source-to-model lineage

Which sources contributed to which model. Industry standard: Internal at frontier labs. Granularity varies — shard-level common, token-level rare.

E1.1.2 Transformation tracking

Filters, dedup, weighting applied per shard. Industry standard: Pipeline metadata stored alongside data.

E1.1.3 Cryptographic anchoring

Tamper-evident lineage. Industry standard: Not yet standard. Proposed for high-stakes compliance.

E1.2 Access Control

Access control. Who can access what data? Role-based access. Need-to-know principle. Separate environments (training, eval, production). Audit logging of access. SOTA: Frontier labs: principle-of-least-privilege, multi-party authorization for sensitive data, just-in-time access provisioning. UEBA (User and Entity Behavior Analytics) on access patterns. Audit logging immutable. e.g. Standard enterprise IAM (Okta, a hyperscaler platform IAM) · Custom internal access tiers · Frontier multi-party auth

E1.2.1 Role-based access (RBAC)

Permissions tied to roles. Industry standard: Standard. Applied to corpus, eval data, customer data separately.

E1.2.2 Audit logs

Records of who accessed what when. Industry standard: Required for compliance. SIEM integration common.

E1.3 Retention & Deletion

Retention and deletion. How long is data kept? When deleted? Right-to-deletion (GDPR Article 17). Customer data: enterprise zero-retention default. Training data: indefinite retention but with opt-out paths. SOTA: leading frontier labs offer enterprise zero-retention. Consumer products typically 30-day retention with opt-out for training. GDPR-compliant deletion processes for EU users. Machine unlearning research-stage. e.g. a constitutional-methods frontier lab ZDR for enterprise · a leading frontier lab 30-day default + opt-out · GDPR Article 17 implementations

E1.3.1 Retention policy

Per-data-class retention duration. Industry standard: Tiered: pre-training corpus longer, user-data shorter (often 30 days for API).

E1.3.2 Right to erasure (GDPR/CCPA)

User-requested deletion. Industry standard: Required by law. Distinct challenge for data already used in training (cannot easily 'unlearn').

E1.3.3 Machine unlearning

Removing data influence from already-trained models. Industry standard: Active research. No standard at scale yet.

E1.4 Provenance & Watermarking

Provenance and watermarking. Cryptographic proof of content origin. C2PA standard for AI-generated content. a generative-content watermarking system for text/image watermarking. Critical for misinformation defense and training-corpus contamination prevention. SOTA: C2PA adopted by Adobe, a leading frontier lab, a synthetic-data-focused lab, Sony, Nikon, BBC. a generative-content watermarking system-Text (a multimodal frontier lab 2024) production-deployed. Open: MarkLLM library. Trade-off: watermarks removable via paraphrasing. e.g. C2PA standard · a multimodal frontier lab a generative-content watermarking system · a leading frontier lab image watermarking

E1.4.1 Output watermarking

Statistical signal embedded in generated text. Industry standard: Kirchenbauer 2023 reference. a generative-content watermarking system (a multimodal frontier lab). Limited adoption.

E1.4.2 C2PA / content credentials

Cryptographic content provenance standard. Industry standard: Adoption growing for image/video. Less applicable to text.

E1.5 Customer Data Boundaries

Customer data boundaries. Enterprise customer data must NOT be used for training. Geographic data residency. Sectoral isolation (HIPAA, FERPA, financial). Cross-customer isolation in multi-tenant. SOTA: Frontier API providers offer no-train default for enterprise. Geographic data residency available. Confidential compute (TEE) emerging. HIPAA BAA, FedRAMP for sectoral. e.g. a constitutional-methods frontier lab a leading frontier model enterprise · a leading frontier lab Enterprise · a hyperscaler platform Bedrock isolation

E1.5.1 Training opt-out

Customer data not used for training. Industry standard: Default for enterprise/API at a leading frontier lab, a constitutional-methods frontier lab. Consumer products often opt-in.

E1.5.2 Zero data retention (ZDR)

Customer data not stored at all. Industry standard: Available for enterprise tier at major labs.

Security

27 sub-endpoints mapped

MZN Provisional Position · Strong Evidence

Multi-tier security architecture documented; adversarial validation pending across several protocol families

A multi-tier security portfolio spans defensive, offensive-research, and methodology categories. Architecture-level innovations addressing LLM-specific security concerns are documented. Detailed protocol disclosures, specific findings, and complete inventory are reserved for the proprietary asset portfolio.

Definition

Security: end-to-end security posture. Categories: prompt injection defense, data exfiltration prevention, model theft protection, training-data poisoning defense, supply chain security, jailbreak resistance, agentic security, security monitoring.

State of the Art (2025–2026)

a constitutional-methods frontier lab third AI Safety Level security: protect weights against non-state-actor theft. Multi-layer defenses across categories. NIST AI RMF, ISO/IEC 42001 for governance frameworks. EU AI Act security requirements for high-risk systems.

Key Decisions

AI Safety Level/security tier targeted
TEE adoption
Supply chain controls
Pen testing cadence

Numbers & Ablations

a constitutional-methods frontier lab third AI Safety Level security commitment: defend against non-state-actor weight theft. Implementation includes HSM, TEE, multi-party access auth, audit logging.
Model weight value: frontier weights $100M-$1B+ replacement cost (compute alone). Theft prevention is high-priority.
Prompt injection success rate: ~30-50% on agent applications without specific defense, ~5-15% with instruction hierarchy training (Wallace 2024).
Jailbreak persistence: an optimization-based adversarial attack-class attacks succeed ~20-40% on frontier 2024 models — down from ~80% on early generations but unsolved.
Dependencies: frontier model SBOM lists 1000+ packages. Supply chain attack surface is real (e.g., an open-model hub package supply chain attacks 2023-2024).

Open Questions

Indirect prompt injection: solvable in current architecture or requires fundamental redesign? Pessimistic camp ascendant.
Confidential compute (a leading accelerator vendor CC, a hyperscaler platform Nitro) for inference: production-ready or theatre? Performance overhead poorly characterized publicly.
Adversarial robustness vs security: how much overlap, how much divergence? Often confused; should be distinguished.
Weight extraction via API: distillation attacks demonstrated at small scale. Production-scale defense unclear.

Reference analyst note. Security for LLMs is in a state similar to web security circa 2008 — patterns visible but practices immature. The frontier 2026 security stance: assume weights will eventually leak (insider, breach, gradual extraction); design for graceful degradation. The a constitutional-methods frontier lab third AI Safety Level framing (resist non-state actor) is appropriately calibrated; fourth AI Safety Level (resist state actor) is the next frontier and unsolved. Agent security is the unsolved problem of the next 2 years; current 'defenses' are mostly hopeful patterns, not robust controls.

Reference Analyst Note

Security for LLMs is in a state similar to web security circa 2008 — patterns visible but practices immature. The frontier 2026 security stance: assume weights will eventually leak (insider, breach, gradual extraction); design for graceful degradation. The a constitutional-methods frontier lab third AI Safety Level framing (resist non-state actor) is appropriately calibrated; fourth AI Safety Level (resist state actor) is the next frontier and unsolved. Agent security is the unsolved problem of the next 2 years; current 'defenses' are mostly hopeful patterns, not robust controls.

Examples

a constitutional-methods frontier lab third AI Safety Level commitments (public) · a leading frontier lab security posture · NIST AI RMF as framework

Sub-endpoint anatomy — 27 items mapped

E2.1 Prompt Injection

Prompt injection. Malicious instructions hidden in untrusted content (web pages, retrieved docs, tool outputs) treated as authoritative by model. Critical for agent applications — fundamentally unsolved. SOTA: Defense layers: instruction hierarchy training (Wallace 2024), output filtering, tool sandboxing, user confirmation gates. No clean technical solution. Active research. Indirect injection most dangerous (hidden in retrieved docs). e.g. Wallace et al. Instruction Hierarchy (2024) · Greshake et al. Indirect Prompt Injection (2023)

E2.1.1 Direct prompt injection

User directly types injection. Industry standard: First-generation jailbreak attack. Mitigated through alignment training and system prompt hardening.

E2.1.2 Indirect prompt injection

Injection via fetched content (web pages, documents, emails). Industry standard: Greshake 2023. Major risk for agentic systems. Mitigation via instruction hierarchy, content/instruction separation.

E2.1.3 Multi-modal injection

Injection via image, audio, or other non-text inputs. Industry standard: Visual prompt injection demonstrated. Active research mitigation.

E2.1.4 Defense — instruction hierarchy

a leading frontier lab instruction hierarchy: system > developer > user > tool output. Industry standard: Wallace 2024. Designed to make tool/document content lower priority than user instructions.

E2.2 Data Exfiltration

Data exfiltration. Model leaks sensitive data via outputs. Vectors: (1) leak training data verbatim, (2) inadvertently disclose prompt context to user-after-injection, (3) tool exfiltration (model calls tool sending data to attacker). SOTA: Training-data extraction known issue (Carlini 2021+). Frontier mitigations: deduplication, RLHF on memorization. Tool exfiltration via prompt injection: defense via tool sandboxing, output filtering, approval gates. e.g. Carlini et al. Extracting Training Data (2021, 2023) · a consumer LLM chat product prompt extraction demos

E2.2.1 Training data extraction

Extracting memorized training data from model. Industry standard: Carlini 2021 demonstrated. Nasr 2023 showed scale. Mitigation: dedup (A1.3), differential privacy.

E2.2.2 Cross-tenant leakage (KV cache)

Shared KV cache leaks across tenants. Industry standard: Mitigated by per-tenant cache isolation. Recent research showed timing attacks possible on shared prefix cache.

E2.2.3 Agentic exfiltration

Agent tricked into sending data to attacker endpoint. Industry standard: Major risk for tool-using agents. Mitigation: capability gating, egress allow-lists, human approval for sensitive actions.

E2.3 Model Theft

Model theft. Weights are nation-state-level targets — frontier weights worth billions. Vectors: insider threat, infrastructure breach, weight extraction via API (model inversion / distillation attacks). SOTA: third AI Safety Level security commitment: defend against non-state actor theft. A constitutional-methods frontier lab public security commitments. HSM-protected keys, TEE-protected weights, multi-party authorization. Distillation attack defense via output rate limiting + watermarking. e.g. a constitutional-methods frontier lab third AI Safety Level measures (public) · a hyperscaler platform Nitro Enclaves · a leading accelerator vendor Confidential Computing

E2.3.1 Weight extraction (insider)

Insider exfiltration of weights. Industry standard: Major frontier-lab concern. Mitigation: hardware enclaves, multi-party access controls, monitored egress.

E2.3.2 Model distillation attack

Training competing model on outputs of target API. Industry standard: Universal but hard to prevent. Terms of service prohibition. Watermarking proposed but not deployed at scale.

E2.4 Training-Data Poisoning

Training-data poisoning. Attacker injects malicious content into web crawl, hoping model learns backdoors or biases. Hard to defend at scale: corpus is too big to manually verify. SOTA: Active research; few proven defenses. Quality classifiers filter obvious low-quality. Provenance tracking helps. Major label-flipping attacks demonstrated; full corpus poisoning not yet shown viable but theoretical risk. e.g. Carlini et al. poisoning research · Active research at frontier labs

E2.4.1 Web-scale poisoning

Attacker plants content that will be crawled. Industry standard: Carlini 2024 showed feasibility of poisoning web crawl. Mitigation: provenance tracking (E1.1), filter resilience.

E2.4.2 Backdoor attacks

Triggered behavior implanted via poisoned data. Industry standard: Hubinger 2024 showed backdoors can survive safety training.

E2.5 Supply Chain

Supply chain security. Dependencies, container images, training frameworks. SBOM (software bill of materials), Sigstore for signed artifacts. Major vulnerability vector if compromised. SOTA: EO 14028 mandated SBOM for federal. Sigstore for signing. Frontier labs maintain dependency monitoring. NIST SSDF compliance. e.g. Sigstore project · SBOM mandates · Snyk, Dependabot

E2.5.1 Dependency security

Open-source dependencies in training/serving stack. Industry standard: SBOM generation, dependency scanning. Standard software security practice applied to ML stack.

E2.5.2 Pre-trained model supply chain

Risks of using third-party base models. Industry standard: Concern for fine-tuned products. Hash verification of weights, documented training process.

E2.6 Jailbreak Resistance

Jailbreak resistance. Model adheres to safety guidelines under adversarial input. Methods: training-time RLHF on jailbreak attempts, output-side classifiers (an open-weights output classifier, Constitutional Classifiers), instruction hierarchy. SOTA: Constitutional Classifiers (a constitutional-methods frontier lab 2025) production-deployed for jailbreak defense. An open-weights output classifier family standard open guardrails. No model fully jailbreak-proof; arms race continues. e.g. Constitutional Classifiers (a constitutional-methods frontier lab 2025) · a recent-generation output classifier · a leading frontier lab Moderations + instruction hierarchy

E2.7 Agentic Security

Agentic security. New category: model with tool access, code execution, computer control. Risks: prompt injection escalating to action, tool misuse, autonomous escalation. Critical for computer-use models. SOTA: a constitutional-methods frontier lab Computer Use (a long-context frontier model+): sandboxed VM isolation, output filtering on actions, user confirmation gates. A leading frontier lab Operator similar. Active research category — agentic capabilities outrun agentic safety understanding. e.g. a constitutional-methods frontier lab Computer Use sandbox · a leading frontier lab Operator · Active research at AISIs

E2.7.1 Capability gating

Restricting which actions agent can take. Industry standard: Capability tokens, allow-lists, scoped permissions. Required for production agents.

E2.7.2 Human-in-the-loop

Requiring user approval for high-stakes actions. Industry standard: Standard pattern. Sensitive actions (payments, deletions, external sends) require explicit approval.

E2.7.3 Sandboxing

Isolated execution of agent actions. Industry standard: Container/VM sandboxing for code execution. Network egress restrictions.

E2.8 Security Monitoring & Response

Security monitoring and incident response. SIEM integration, anomaly detection, incident playbooks. Frontier labs: 24/7 SOC, regular tabletop exercises. SOTA: Standard enterprise security ops + LLM-specific anomaly detection. SIEMs: Splunk, Elastic. Incident response playbooks for AI-specific incidents (model regression, capability surprise, security incident). e.g. Standard enterprise SIEM · Frontier-lab SOCs

E2.8.1 Abuse detection

Detecting malicious usage patterns. Industry standard: Behavioral analysis on traffic. Rate limiting, account-level flags.

E2.8.2 Incident response

Procedures when breach detected. Industry standard: On-call rotation, runbooks, customer notification protocols, regulator coordination.

E2.8.3 Vulnerability disclosure

Bug bounty and responsible disclosure programs. Industry standard: leading frontier labs, a multimodal frontier lab all run bounty programs. Coordinated disclosure norms emerging.

Privacy

15 sub-endpoints mapped

MZN Provisional Position · Partial

Consent-first privacy posture; Phase 3 privacy/compliance review required

PII handling is structurally consent-first rather than retrofitted filtering. Object-first discipline, reuse separation, and export maturity are documented. Differential privacy and machine unlearning execution at frontier scale are Phase 3 scope.

Definition

Privacy: protection of personal information. Categories: PII handling, differential privacy, membership inference defense, regulatory compliance, inference-time privacy.

State of the Art (2025–2026)

Frontier labs: comprehensive PII handling, GDPR/CCPA compliance, optional zero-data-retention. Differential privacy still rare at scale (DP-SGD too expensive for frontier training). Membership inference defenses via training-data deduplication.

Key Decisions

DP yes/no
PII redaction strategy
Inference-time privacy guarantees
Regulatory commitments

Numbers & Ablations

GDPR enforcement intensity: cumulative fines >—¬4B since 2018; AI-specific cases growing.
EU AI Act timeline: entered force Aug 2024, prohibited practices Feb 2025, GPAI Aug 2025, high-risk Aug 2026, all provisions Aug 2027.
Differential privacy at scale: not deployed at frontier training. a confidential-computing platform uses DP for inference-time analytics (limited scope).
Membership inference attack success: ~55-65% on frontier models (small advantage over 50% random) per Carlini and others. Heavy deduplication helps.
Privacy compliance certifications: SOC 2 Type II, ISO 27001 baseline. ISO/IEC 42001 (AI management) emerging. FedRAMP for federal.

Open Questions

DP-SGD at frontier scale: too expensive (~5× compute overhead) currently. Does algorithmic improvement make it tractable by 2027?
Membership inference defense: effective dedup helps, but theoretical worst-case bounds remain. Practical risk assessment unclear.
Cross-border privacy regime: US-EU adequacy fragile, China PIPL strict, India DPDP emerging. Global compliance becomes per-jurisdiction.
Inference-time privacy: TEE-based confidential inference deployed at a confidential-computing frontier lab; broader adoption depends on hardware availability and customer demand.

Reference analyst note. Privacy compliance is becoming a serious cost center. Frontier labs that haven't invested in privacy infrastructure (hearing-grade data governance, deletion processes, sectoral certifications) will face compounding regulatory costs 2025-2027. The technically-interesting frontier is private inference (TEE, private cloud compute, eventually homomorphic) — a confidential-computing frontier lab's deployment shows production viability. Differential privacy at training remains aspirational at frontier scale.

Reference Analyst Note

Privacy compliance is becoming a serious cost center. Frontier labs that haven't invested in privacy infrastructure (hearing-grade data governance, deletion processes, sectoral certifications) will face compounding regulatory costs 2025-2027. The technically-interesting frontier is private inference (TEE, private cloud compute, eventually homomorphic) — a confidential-computing frontier lab's deployment shows production viability. Differential privacy at training remains aspirational at frontier scale.

Examples

a constitutional-methods frontier lab enterprise privacy · a confidential-computing frontier lab's Private Cloud Compute (DP + TEE)

Sub-endpoint anatomy — 15 items mapped

E3.1 PII Handling

PII handling. Detect and handle Personally Identifiable Information in inputs and outputs. A synthetic-data-focused lab Presidio (open) is standard PII engine. Custom recognizers for domain-specific (medical, legal). Required for GDPR, HIPAA. SOTA: a synthetic-data-focused lab Presidio + custom recognizers. NER + LLM verification for higher accuracy. Aggressive redaction degrades utility; calibration needed. Trade-off: redact in inputs vs outputs. e.g. a synthetic-data-focused lab Presidio (open) · a hyperscaler platform Comprehend PII · GCP DLP

E3.1.1 PII detection in training data

Identifying PII before training. Industry standard: Regex + NER models. Imperfect; some PII inevitably trains.

E3.1.2 PII filtering / redaction

Removing or masking PII. Industry standard: Standard for sensitive corpora. Trade-off: aggressive filtering loses quality data.

E3.1.3 Output-stage PII detection

Detecting PII in model outputs. Industry standard: Last-mile filter. Cross-link to C4.3.

E3.2 Differential Privacy

Differential privacy. Mathematical privacy guarantee: no individual training example significantly affects model output. DP-SGD trains with noise + clipping. Costly; rarely used at frontier training scale. SOTA: DP-SGD demonstrated at smaller scale (~7B params). Frontier (100B+) DP-SGD not yet practical. a confidential-computing platform uses DP for inference-time analytics. Federated DP emerging. e.g. a confidential-computing platform (DP) · a multimodal frontier lab federated DP · Research-scale DP-SGD

E3.2.1 DP-SGD

Differentially private gradient descent. Industry standard: Abadi 2016. Used in some fine-tuning. Pre-training at frontier scale not yet practical with strong DP guarantees.

E3.2.2 Privacy budget (ε)

Quantification of privacy guarantee. Industry standard: Reported alongside DP-trained models. Smaller ε = stronger privacy.

E3.3 Membership Inference Defense

Membership inference defense. Attacker queries model to determine if specific data was in training. Defense: training data deduplication, regularization, careful early stopping. Strong dedup is most effective practical defense. SOTA: Aggressive deduplication (FineWeb-style) significantly reduces memorization. DP-SGD provides mathematical guarantee but expensive. Membership inference attacks on frontier models possible but limited in practice. e.g. FineWeb deduplication pipeline · Carlini et al. membership inference research

E3.4 Regulatory Frameworks

Regulatory frameworks. GDPR (EU), CCPA (California), state laws (Colorado AI Act, etc.), sectoral (HIPAA, FERPA, GLBA), international (UK GDPR, Brazil LGPD, China PIPL). SOTA: Frontier labs maintain compliance programs across all major jurisdictions. GDPR DPIAs for high-risk processing. CCPA opt-out support. State patchwork in US increasingly complex. Global compliance: minimum bar across jurisdictions. e.g. GDPR · EU AI Act (overlaps privacy) · Colorado AI Act (state)

E3.4.1 GDPR (EU)

EU General Data Protection Regulation. Industry standard: Right to erasure, data minimization, lawful basis. Active enforcement actions against AI labs (Italy 2023, etc.).

E3.4.2 CCPA / CPRA (California)

California Consumer Privacy Act. Industry standard: Notice, opt-out, deletion rights. Standard for US-facing products.

E3.4.3 HIPAA / sectoral rules

Health, financial, education-specific privacy regulations. Industry standard: Enterprise tiers offer HIPAA BAAs. Sectoral compliance increasingly common.

E3.5 Inference-Time Privacy

Inference-time privacy. Customer data privacy at inference. No-train-on-data guarantees, encrypted inference (homomorphic encryption — research stage), confidential computing (TEE-protected inference), federated approaches. SOTA: TEE-protected inference (a confidential-computing platform, a constitutional-methods frontier lab confidential compute) production-deployed. Homomorphic encryption: research-stage, ~1000× slower. Federated inference: niche use cases. e.g. a confidential-computing platform · a hyperscaler platform Nitro Enclaves for inference · a leading accelerator vendor Confidential GPU

E3.5.1 Encryption in transit / at rest

TLS, encrypted storage. Industry standard: Universal. TLS 1.3, AES-256 at rest.

E3.5.2 Confidential computing

Hardware-enclave inference (e.g., a confidential-computing platform). Industry standard: Emerging. a confidential-computing platform and a hyperscaler confidential-compute platform lead. Enterprise adoption growing.

Compliance

21 sub-endpoints mapped

MZN Provisional Position · Partial

EUIPO guidance · context · separate patent-grade candidate record · blockchain timestamping

EUIPO provided direct guidance on portfolio filings. A separate cryptographic-protocol candidate is reserved for controlled technical/IP review; public pages do not rely on filing details. Multiple portfolio artifacts have SHA-256 anchoring and blockchain timestamping for priority. Voluntary commitments framework alignment is a Phase 3 scope.

Definition

Compliance: regulatory and framework conformance. EU AI Act, NIST AI RMF, ISO/IEC 42001, sectoral (HIPAA, FedRAMP, SOC 2), voluntary commitments (Frontier Model Forum, AI Safety Summit Seoul/Bletchley/Paris).

State of the Art (2025–2026)

EU AI Act in force (Aug 2024), full effect 2026-2027. GPAI Code of Practice published 2024. Frontier labs: SOC 2 Type II + ISO 27001 + ISO/IEC 42001. FedRAMP Moderate (a constitutional-methods frontier lab 2024). Voluntary commitments via Bletchley/Seoul/Paris summits.

Key Decisions

EU AI Act risk classification
Compliance certifications pursued
Voluntary commitment signatory yes/no

Numbers & Ablations

EU AI Act Code of Practice (2024): 13 commitments across transparency, copyright, safety. Currently voluntary; becomes default conformity path.
EU AI Act penalties: up to 7% global revenue for prohibited practices, 3% for non-compliance.
FedRAMP Moderate: a constitutional-methods frontier lab certified 2024. Required for US federal sales. Process duration ~12-18 months.
Voluntary commitments signatories (Seoul, May 2024): 16 frontier labs, including leading frontier labs, a multimodal frontier lab, an open-weights frontier lab, a synthetic-data-focused lab.
Incident reporting under EU AI Act Article 73: 15-day window for serious incidents to authorities. NIS2 similar for critical infrastructure.
ISO/IEC 42001 (Dec 2023): first AI-specific management standard. Adoption beginning 2024-2025.

Open Questions

EU AI Act enforcement intensity: untested. Will regulators interpret strictly or lightly?
Cross-jurisdictional compliance: US, EU, UK, China each have differently-shaped frameworks. Global compliance becomes minimum-bar across all jurisdictions.
Voluntary commitments: do they hold under competitive pressure? Untested at any frontier lab.
Standards harmonization: NIST AI RMF, ISO/IEC 42001, EU AI Act, sectoral frameworks overlap and diverge. Industry is creating implicit standards via shared practice.

Reference analyst note. Compliance is becoming a strategic lever. A constitutional-methods frontier lab's investments in FedRAMP, ISO/IEC 42001, EU AI Act readiness give it enterprise customer access a leading frontier lab / a multimodal frontier lab catch up to slowly. The arbitrage is real: $50M+ in compliance investment can unlock $1B+ in regulated-industry revenue. The next 18 months will see frontier labs differentiate not on capability (saturating) but on compliance depth and trust signals.

Reference Analyst Note

Compliance is becoming a strategic lever. A constitutional-methods frontier lab's investments in FedRAMP, ISO/IEC 42001, EU AI Act readiness give it enterprise customer access a leading frontier lab / a multimodal frontier lab catch up to slowly. The arbitrage is real: $50M+ in compliance investment can unlock $1B+ in regulated-industry revenue. The next 18 months will see frontier labs differentiate not on capability (saturating) but on compliance depth and trust signals.

Examples

a constitutional-methods frontier lab SOC 2 + ISO 27001 + FedRAMP Moderate · a leading frontier lab similar · Code of Practice signatories

Sub-endpoint anatomy — 21 items mapped

E4.1 Regulatory Frameworks

Regulatory frameworks. EU AI Act (most comprehensive AI regulation). GDPR (privacy). State / national AI laws (Colorado, Virginia, China, etc.). Sectoral (healthcare, finance, education). SOTA: EU AI Act risk-tiered: unacceptable (banned), high-risk (strict), limited (transparency), minimal (free). GPAI threshold: 10²—µ FLOPs. Penalties up to 7% global revenue. Frontier labs structuring compliance programs. e.g. EU AI Act · Colorado AI Act · Various sectoral

E4.1.1 EU AI Act

Risk-tiered regulation; GPAI rules; systemic-risk obligations. Industry standard: Effective 2024-2026 in stages. GPAI models above 10^25 FLOP threshold subject to systemic-risk obligations: model evaluation, adversarial testing, incident reporting, energy-use disclosure.

E4.1.2 US Executive Orders & frameworks

Reporting requirements, dual-use foundation model standards. Industry standard: EO 14110 (2023), reporting thresholds, NIST RMF. Status of any specific order varies.

E4.1.3 UK / a national AI Safety Institute

UK AI Safety Institute pre-deployment evaluation. Industry standard: Voluntary commitments by frontier labs to allow pre-deployment a national AI Safety Institute testing.

E4.1.4 Other jurisdictions

China, Korea, Singapore, Japan AI regulations. Industry standard: Varied. China requires algorithm registration. Other jurisdictions evolving.

E4.2 Industry Standards

Industry standards. NIST AI RMF (US, voluntary), ISO/IEC 42001 (international AI management), ISO 27001 (information security), SOC 2 (audit), FedRAMP (US federal). SOTA: NIST AI RMF (2023) is voluntary US framework. ISO/IEC 42001 (2023) is first AI-specific management standard. SOC 2 Type II + ISO 27001 baseline for enterprise. FedRAMP for federal. e.g. NIST AI RMF · ISO/IEC 42001 · SOC 2 Type II

E4.2.1 NIST AI RMF

NIST AI Risk Management Framework. Industry standard: AI RMF 1.0 published 2023. Generative AI Profile July 2024. Voluntary US framework.

E4.2.2 ISO 42001

AI management system standard. Industry standard: Published Dec 2023. Certifiable. Adoption beginning at enterprise vendors.

E4.2.3 SOC 2 / ISO 27001

General security/operational controls. Industry standard: Standard for enterprise SaaS. Required for most enterprise contracts.

E4.3 Frontier Lab Voluntary Commitments

Frontier lab voluntary commitments. Bletchley Declaration (2023), Seoul Commitments (2024), Paris AI Safety Summit (2025). Frontier Model Forum coordination. Voluntary safety commitments preceding regulation. SOTA: 16 frontier labs signed Seoul Commitments. Frontier Model Forum (a constitutional-methods frontier lab, a multimodal frontier lab, a synthetic-data-focused lab, a leading frontier lab) coordinates safety practices. Voluntary Responsible Scaling Policies / Preparedness Frameworks public. UK / a national AI Safety Institute external evaluations. e.g. Bletchley/Seoul/Paris commitments · Frontier Model Forum · a Responsible Scaling Policy framework, a leading frontier lab PF, a Frontier-Safety-style framework

E4.3.1 a Responsible Scaling Policy framework

Responsible Scaling Policy framework: AI Safety Levels (AI Safety Level). Industry standard: Versioned public document. second AI Safety Level current; third AI Safety Level thresholds defined; fourth AI Safety Level+ described conceptually.

E4.3.2 a Preparedness-style framework

Capability tracking and deployment thresholds. Industry standard: Public framework. Tracks bio, chem, cyber, persuasion, autonomous replication.

E4.3.3 a multimodal frontier lab Frontier-Safety-style framework

Critical capability levels and mitigations. Industry standard: Public framework. Categories: autonomy, biosecurity, cybersecurity, ML R&D.

E4.4 Documentation Artifacts

EU AI Act conformity assessment. High-risk systems and GPAI with systemic risk require conformity assessment, technical documentation, transparency, human oversight, accuracy/robustness/security testing. SOTA: Implementation phase 2024-2027. GPAI Code of Practice (2024) provides voluntary conformity path. Notified bodies for high-risk certification. Frontier labs preparing assessment processes. e.g. EU AI Act Article 53 (GPAI) · GPAI Code of Practice

E4.4.1 Model card

Standardized model documentation. Industry standard: Mitchell 2019 reference. Universal at frontier labs. Includes capability, safety, limitations, training data summary.

E4.4.2 System card

Documentation of deployed system, not just model. Industry standard: A leading frontier lab publishes for major releases (a current-generation frontier model system card). A constitutional-methods frontier lab similar.

E4.4.3 Datasheet for datasets

Documentation of training data composition. Industry standard: Gebru 2018 reference. Adoption uneven for pre-training corpora.

E4.5 Audit & Incident

Audit and incident reporting. SOC 2 audits (annual). Incident reporting under EU AI Act Article 73 (serious incidents to authorities within 15 days). NIS2 incident reporting (EU critical infrastructure). SOTA: Frontier labs maintain SOC 2 Type II annual. EU AI Act Article 73 requires high-risk system incidents reported within 15 days. NIS2 incident reporting (EU critical infrastructure). e.g. Annual SOC 2 audits · EU AI Act Article 73 reporting · NIS2 reporting (EU critical)

E4.5.1 Third-party audits

External technical and compliance audits. Industry standard: SOC 2, ISO audits standard. AI-specific audits emerging via national AI Safety Institute, an external evaluation organization.

E4.5.2 Incident reporting

Reporting safety incidents to regulators / public. Industry standard: EU AI Act requires for GPAI systemic-risk models. Lab voluntary disclosure varies.

E4.5.3 Customer compliance support

Helping customers meet their compliance obligations. Industry standard: BAAs, DPAs, Trust Center documentation. Required for enterprise sales.

LLM Framework Index
+ Provisional Position Map

About this public reference index.

Why this document exists

The 21-slot framework

Where MZN stands, slot by slot

Slot-by-slot reference anatomy

Data

Tokenizer

Architecture

Training

Compute

SFT

Preference Optimization

Constitutional Methods

Capability Evaluation

Safety Evaluation

Robustness

Output Safety

Serving

Inference Optimization

Monitoring

Deployment

Data Governance

Security

Privacy

Compliance

Nine deep essays

Synthetic Data Generation at a major annotation platform

Reasoning Model Training (o1, R1, RLVR Mechanics)

Test-Time Compute Scaling

Model Merging

The Evaluation Contamination Crisis

Agentic Safety

Multimodal Training Data

Voice and Realtime Models

The Open-Weights Ecosystem as Structural Force

What this document shows, and what it doesn't

How this atlas was built

How this index should be used.

LLM Framework Index+ Provisional Position Map

About this public reference index.

Why this document exists

The 21-slot framework

Where MZN stands, slot by slot

Slot-by-slot reference anatomy

Data

Tokenizer

Architecture

Training

Compute

SFT

Preference Optimization

Constitutional Methods

Capability Evaluation

Safety Evaluation

Robustness

Output Safety

Serving

Inference Optimization

Monitoring

Deployment

Data Governance

Security

Privacy

Compliance

Nine deep essays

Synthetic Data Generation at a major annotation platform

Reasoning Model Training (o1, R1, RLVR Mechanics)

Test-Time Compute Scaling

Model Merging

The Evaluation Contamination Crisis

Agentic Safety

Multimodal Training Data

Voice and Realtime Models

The Open-Weights Ecosystem as Structural Force

What this document shows, and what it doesn't

How this atlas was built

How this index should be used.

LLM Framework Index
+ Provisional Position Map