A phase-safe index for the MZN LLM Framework: a public reference atlas of major capability areas behind a modern LLM company — mapped to 21 slots and 529 sub-endpoints — with MZN’s provisional self-positioning for Phase 3 technical review.
A reference atlas authored by Mohammad Rahimi during the bounded Phase 2 solo formation period. The credentials, context signals, and provenance below give context for what follows.
Beyond its content, the provenance of this document is itself unusual. Industry reference atlases of comparable depth — capability decomposition, sub-endpoint mapping, and frontier-position analysis at 21-slot and 529-endpoint granularity — are typically produced by analyst teams over multi-year engagements at research firms or industry consortiums.
This atlas was authored by Mohammad Rahimi during the bounded Phase 2 solo formation period, without a human team, agency, contractor stack, API stack, or agent workforce. It remains subject to independent technical review.
The atlas that follows is therefore intended as two artifacts in one: a demonstration of the author's depth of LLM-industry knowledge and a documented instance of single-person production capacity claim prepared for review in a category of work that has historically required institutional resources. Should the broader question of whether the One-Person Unicorn pattern — single-individual operation reaching the productive capacity historically associated with venture-backed teams — has arrived in frontier AI be raised, this document is part of the evidence to consider.
Many architectural choices and capability claims in the broader portfolio — across commerce, hardware, security, alignment, and foundational research — are informed by a broad picture of the capability areas behind a serious LLM company. This document is a public reference version of that picture. Named assets, valuations, and detailed inventory are documented separately in the strategic asset portfolio.
The 21-slot framework is a synthesized reference map based on publicly visible frontier-AI practice, academic literature, open-source infrastructure, and observed industry patterns through 2026 — five pre-training slots, three post-training and alignment slots, four evaluation and safety slots, four inference and production slots, and four cross-cutting slots. Each is a major capability area; absence or weakness in any slot should be treated as a diligence question for a serious LLM operation.
For each slot the document gives: definition, current state of the art, key decisions, trade-offs, numbers and ablations from the published literature, open questions, an analyst-level frontier position, examples, academic references, and the full sub-endpoint anatomy (top-level 21 → mid-level → deeper detail), with a 529-item map total. Then, prominently at the head of each slot, MZN's provisional position: Strong, Partial, or Gap, with concise reasoning.
All company-, lab-, and product-specific identifiers in the body of this document have been generalized to categorical references (frontier lab, leading open-weights model, current-generation accelerator, etc.). Academic paper authors, technique names, benchmark names, and dataset names are preserved as standard scientific references. No proprietary MZN content — code, hashes, internal codenames, partner identities, or pipeline detail — appears in this document.
Five groups. Twenty-one slots. 529 sub-endpoints below. Each slot is a major prerequisite at frontier scale — A blocks B blocks C blocks D, while E is cross-cutting.
Of the 20 major capability slots, the portfolio operates 7 at Strong Evidence, 13 at Partial, and 1 at Gap. The criteria below are explicit; each slot earns one status based on observable evidence rather than a 1-to-10 score.
| Slot | Title | Level | Position summary |
|---|---|---|---|
| A1 | Data | STRONG EVIDENCE | Phase 1 consent-first product data evidence; LLM-readiness pending review |
| A2 | Tokenizer | STRONG EVIDENCE | Multilingual tokenizer expertise; under-served-script focus |
| A3 | Architecture | PARTIAL | Patent-grade candidate architectural innovations; implementation validation pending |
| A4 | Training | PARTIAL | Training methodology documented; frontier-scale execution pending |
| A5 | Compute | GAP | No cluster under solo operation |
| B1 | SFT | PARTIAL | Demonstration-data shaping methodology documented |
| B2 | Preference Optimization | PARTIAL | Output-conformance methodology informs preference design |
| B3 | Constitutional Methods | PARTIAL | Principle-based alignment substrate (theoretical layer) |
| C1 | Capability Evaluation | PARTIAL | Phase 1 product telemetry and user-behavior evaluation context |
| C2 | Safety Evaluation | STRONG EVIDENCE | Documented safety architecture; red-team validation pending |
| C3 | Robustness | PARTIAL | Security-driven robustness research |
| C4 | Output Safety | STRONG EVIDENCE | Output-conformance safety templates and egress controls |
| D1 | Serving | PARTIAL | Phase 1 application/platform serving experience across Mazzaneh modules |
| D2 | Inference Optimization | STRONG EVIDENCE | Patent-grade candidate inference frameworks; benchmark validation pending |
| D3 | Monitoring | STRONG EVIDENCE | Monitoring architecture / GPU Sentinel route; implementation or pilot validation pending |
| D4 | Deployment | PARTIAL | Phase 1 application/platform deployment experience across Mazzaneh modules |
| E1 | Data Governance | PARTIAL | Consent-first data governance by design |
| E2 | Security | STRONG EVIDENCE | Multi-tier security architecture documented; adversarial validation pending |
| E3 | Privacy | PARTIAL | Consent-first privacy posture; Phase 3 privacy/compliance review required |
| E4 | Compliance | PARTIAL | EUIPO guidance · context + separate patent filing |
Each slot below opens with MZN's provisional position, then a reference industry view (definition, state of the art, decisions, trade-offs, numbers, open questions, frontier analyst position, examples, references), and concludes with an expandable 529-item sub-endpoint anatomy.
The pre-training corpus is the model's universe of evidence: every text, code file, image-caption pair, and audio transcript that establishes what the model considers possible. Corpus engineering — acquisition, extraction, filtering, deduplication, mixing — sets the absolute capability ceiling. Post-training can refine and align, but cannot exceed what is latent in the data.
Frontier models (2025-2026) train on 10-30 trillion text tokens plus billions of multimodal pairs. A leading open-weights flagship model used 15T tokens. An open-weights frontier model (V3 class) used 14.8T tokens. A current-generation frontier model estimated similar order. Frontier shift: from raw web scale to curated quality (FineWeb-Edu, DCLM-baseline). Multimodal natively integrated from pre-training (a multimodal frontier model, a frontier multimodal model, a long-context frontier model).
Reference analyst note. Quality > quantity is now consensus, but the field has overcorrected — most labs underweight diversity in chase of curated quality. The 'textbook quality' direction (a synthetic-heavy small frontier model) is a local optimum, not a global one. Frontier 2026-2027 will rebalance toward curated-but-diverse, with synthetic data filling specific holes (math reasoning, agent traces) not as bulk replacement.
Quality > quantity is now consensus, but the field has overcorrected — most labs underweight diversity in chase of curated quality. The 'textbook quality' direction (a synthetic-heavy small frontier model) is a local optimum, not a global one. Frontier 2026-2027 will rebalance toward curated-but-diverse, with synthetic data filling specific holes (math reasoning, agent traces) not as bulk replacement.
A leading open-weights model: 15T tokens, 5% non-English, code 17% · an open-weights frontier model (V3 class): 14.8T tokens, multilingual focus · FineWeb-Edu: 1.3T high-quality educational tokens (open) · RedPajama-V2: 30T tokens (open) · DCLM-baseline: 4T tokens with model-based filtering
Hoffmann et al., Chinchilla (2022) · Penedo et al., FineWeb (2024) · Li et al., DataComp-LM (2024) · Soldaini et al., Dolma (2024)
The tokenizer maps raw text into discrete tokens — the model's vocabulary. Tokenizer choice is permanent: it defines vocabulary size, multilingual coverage, code handling, and context-window efficiency. A bad tokenizer wastes context (more tokens per character), degrades multilingual performance, and cannot be changed without retraining. Frontier tokenizers are byte-level BPE or SentencePiece with 100K-256K vocabulary.
a current-generation frontier model tokenizer (cl100k_base, 100K vocab) and a leading open-weights model tokenizer (128K vocab, multilingual) are reference points. Byte-level BPE (open-source BPE tokenizer libraries, a foundational decoder-only model lineage) handles any UTF-8 input gracefully. SentencePiece (open-weights models) supports both BPE and Unigram. Multimodal tokenizers add image tokens (256-1024 per image) and audio tokens.
Reference analyst note. Tokenizer choice is a permanent commitment that constrains everything downstream. Frontier labs underinvest here — most use SentencePiece-defaults trained on subset of data. The next frontier capability gain may come from rethinking tokenization (entropy-aware dynamic tokenization, byte-level with efficient training). Anyone aiming for genuine multilingual frontier should treat tokenizer as a first-class capability investment.
Tokenizer choice is a permanent commitment that constrains everything downstream. Frontier labs underinvest here — most use SentencePiece-defaults trained on subset of data. The next frontier capability gain may come from rethinking tokenization (entropy-aware dynamic tokenization, byte-level with efficient training). Anyone aiming for genuine multilingual frontier should treat tokenizer as a first-class capability investment.
a current-generation frontier model: cl100k_base, 100K vocab, byte-level BPE · a leading open-weights model: 128K vocab, multilingual SentencePiece BPE · a leading frontier model: ~65K vocab · a multimodal frontier model: tokenizer designed for multimodal
Sennrich et al., Subword Units / BPE (2015) · Kudo & Richardson, SentencePiece (2018) · Petrov et al., Tokenizer Choice (2023)
Model architecture defines the network's computational structure: how inputs flow through layers, what operations apply at each layer, and how representations combine. The dominant paradigm since 2017 is the decoder-only transformer with mods. Architecture decisions cascade: attention type affects long-context, normalization affects training stability, MoE affects parameter efficiency vs. compute.
Frontier 2024-2026 dense architectures (a leading open-weights flagship model, an open-weights frontier lab Large 2): decoder-only transformer with RoPE positional encoding, RMSNorm, SwiGLU activation, GQA (grouped query attention). MoE architectures (a sparse-MoE frontier model, an open-weights frontier model (V3 class)): sparse expert routing with 8-256 experts, top-2 routing typical. Reasoning models (o1, R1): same architecture but RL-trained for chain-of-thought. Multimodal: native interleaved tokens with vision encoder integration.
Reference analyst note. Dense architecture is dead at frontier scale by end of 2026. A leading open-weights flagship model is likely the last frontier-tier dense model. Either MoE (an open-weights frontier provider lineage, fine-grained 200+ experts) or new sparse paradigms wins. Architecture innovation is decoupling from scaling — RL post-training resets 'capability per parameter' such that smaller models with better post-training match much larger pre-train-only models. The bottleneck is shifting from architecture-quality to RL-environment-quality.
Dense architecture is dead at frontier scale by end of 2026. A leading open-weights flagship model is likely the last frontier-tier dense model. Either MoE (an open-weights frontier provider lineage, fine-grained 200+ experts) or new sparse paradigms wins. Architecture innovation is decoupling from scaling — RL post-training resets 'capability per parameter' such that smaller models with better post-training match much larger pre-train-only models. The bottleneck is shifting from architecture-quality to RL-environment-quality.
A leading open-weights flagship model: dense, GQA, RoPE, RMSNorm, SwiGLU · an open-weights frontier model (V3 class): MoE 671B total / 37B active, multi-head latent attention · a sparse-MoE frontier model: MoE, 141B total / 39B active · a long-context frontier model / a frontier multimodal model: architecture undisclosed but likely MoE
Vaswani et al., Attention Is All You Need (2017) · Touvron et al., leading open-weights model (2023, 2024) · an open-weights frontier model (V3 class) technical report (2024) · Su et al., RoPE (2021)
Training infrastructure is the orchestration layer that turns architecture + data + compute into a trained model. At frontier scale (10K+ GPUs, weeks of training), every component matters: distributed parallelism strategy, optimizer state management, mixed-precision arithmetic, failure recovery, checkpoint frequency, gradient accumulation, learning rate scheduling. A 1% throughput improvement at frontier scale = millions of dollars.
Frontier training stacks: a leading accelerator vendor a tensor-parallelism reference implementation + an open optimization framework (PyTorch), JAX/MaxText (a constitutional-methods frontier lab, a multimodal frontier lab). 4D parallelism standard: data + tensor + pipeline + expert (for MoE). A current-generation accelerator/a current-generation accelerator/a next-generation accelerator with InfiniBand. BF16 mixed-precision, FP8 emerging (a current-generation accelerator+). Checkpoint to S3/GCS every N steps with async writes. Auto-recovery from node failure.
Reference analyst note. an open-weights frontier model (V3 class)'s $5.6M-equivalent demonstrated the field has been overspending by 5-10×. The next 2 years will see massive efficiency gains as algorithmic improvements (FP8, fine-grained MoE, better parallelism, better data) compound. Frontier 'training compute' as the dominant moat is collapsing. The new moat is post-training infrastructure, RL environment quality, and inference-time compute scaling. Anyone with 1K H100s can now produce competitive models — the bottleneck has moved upstream of pre-training to data and downstream to RL.
an open-weights frontier model (V3 class)'s $5.6M-equivalent demonstrated the field has been overspending by 5-10×. The next 2 years will see massive efficiency gains as algorithmic improvements (FP8, fine-grained MoE, better parallelism, better data) compound. Frontier 'training compute' as the dominant moat is collapsing. The new moat is post-training infrastructure, RL environment quality, and inference-time compute scaling. Anyone with 1K H100s can now produce competitive models — the bottleneck has moved upstream of pre-training to data and downstream to RL.
A leading open-weights flagship model: 16K H100s for ~30M GPU-hours, BF16, 4D parallel · an open-weights frontier model (V3 class): 2K H800s, FP8 mixed-precision (innovation) · a constitutional-methods frontier lab: JAX on a custom-silicon accelerator
Shoeybi et al., a tensor-parallelism reference implementation (2019) · Rajbhandari et al., ZeRO/an open optimization framework (2020) · a leading open-weights model paper (2024) · an open-weights frontier model (V3 class) report (2024)
Compute infrastructure is the physical substrate. GPU/a custom-silicon accelerator acquisition, network topology, storage. Frontier training requires homogeneous, high-bandwidth GPU clusters with InfiniBand interconnect. Inference requires either similar clusters (for largest models) or commodity GPU with optimization. The compute supply chain is a strategic constraint: GPU access is gated by a leading accelerator vendor allocation and capital.
a current-generation accelerator (80GB, 700W, $25-40K/GPU) is the frontier workhorse since 2023. A current-generation accelerator (141GB, late 2024) and a next-generation accelerator/a Blackwell-class architecture (192GB, 2025) succession. A multimodal frontier lab a custom-silicon accelerator / v6e for a constitutional-methods frontier lab, a multimodal frontier lab. Frontier clusters: 16K-100K+ GPUs with non-blocking InfiniBand 400-800Gbps. CoreWeave, Lambda Labs, Crusoe provide alternative-cloud GPU access at lower cost than hyperscalers.
Reference analyst note. Compute infrastructure is becoming a real estate / power infrastructure business as much as a hardware business. A synthetic-data-focused lab signing 20-year nuclear PPA with Three Mile Island, xAI building gas turbines on-site at Memphis, Stargate's $500B announcement — these reflect that the actual frontier constraint by 2027 is gigawatt-class power, not GPU supply. National strategic positioning of compute (US export controls on H800 to China, EU sovereign cloud requirements) is now first-order policy. Anyone serious about frontier needs to think 5+ years ahead about power and land, not just GPU procurement.
Compute infrastructure is becoming a real estate / power infrastructure business as much as a hardware business. A synthetic-data-focused lab signing 20-year nuclear PPA with Three Mile Island, xAI building gas turbines on-site at Memphis, Stargate's $500B announcement — these reflect that the actual frontier constraint by 2027 is gigawatt-class power, not GPU supply. National strategic positioning of compute (US export controls on H800 to China, EU sovereign cloud requirements) is now first-order policy. Anyone serious about frontier needs to think 5+ years ahead about power and land, not just GPU procurement.
xAI Colossus: 100K a current-generation accelerator single cluster (2024) · an open-weights frontier lab: ~600K a current-generation accelerator equivalent (2024 reported) · a constitutional-methods frontier lab: a hyperscaler platform a hyperscaler accelerator + GCP a custom-silicon accelerator · an open-weights frontier provider: 2K H800 (export-restricted, smaller scale)
A leading accelerator vendor a current-generation accelerator datasheet · Selene cluster paper (a leading accelerator vendor)
SFT (Supervised Fine-Tuning) takes a pre-trained base model — which is a powerful text completer but not an assistant — and trains it on instruction-response pairs to behave as an assistant. The model learns the chat template, role conventions, refusal patterns, and the basic shape of helpful responses. SFT is universally the first post-training stage; everything else builds on it.
Quality > quantity is the consensus since LIMA (Zhou et al., 2023) demonstrated 1000 highly-curated examples nearly match millions of crowdsourced ones. Frontier SFT mixtures include: human-written conversations (leading frontier labs use 100K-1M+), reasoning chains (long CoT exemplars), tool-use traces, code with patches, math with solutions. Synthetic SFT (teacher model generates) increasingly common via self-instruct methodology, Evol-Instruct, Magpie.
Reference analyst note. SFT is dramatically underrated and over-tuned. Most labs spend too much on SFT data scale (millions of examples) and not enough on quality + diversity. The optimal frontier SFT corpus is probably 100K-500K examples curated to within an inch of their lives. SFT-then-RL is the path; trying to push everything into SFT (Tulu approach) hits diminishing returns visible in current open community.
SFT is dramatically underrated and over-tuned. Most labs spend too much on SFT data scale (millions of examples) and not enough on quality + diversity. The optimal frontier SFT corpus is probably 100K-500K examples curated to within an inch of their lives. SFT-then-RL is the path; trying to push everything into SFT (Tulu approach) hits diminishing returns visible in current open community.
A leading open-weights model SFT: ~10M examples mix (human + synthetic) · OpenAssistant: 161K human conversations (open) · Magpie: synthetic from base model self-conversation · Hermes / Nous: open SFT-tuned models
Zhou et al., LIMA (2023) · Wang et al., self-instruct methodology (2022) · Xu et al., Evol-Instruct (2023) · Xu et al., Magpie (2024)
Preference alignment improves the SFT model's quality, helpfulness, and harmlessness using comparison data: humans (or AI) compare two model outputs and indicate which is preferred. The model learns from pairwise preferences, not single-target answers. Three main methods: RLHF (PPO with reward model), DPO (direct preference optimization, no separate RM), Constitutional methods (AI-generated preferences via principles). Preference alignment moves models from 'competent' to 'good'.
DPO (Rafailov et al., 2023) became the dominant 2024 method for its simplicity — no PPO, no separate reward model, single training stage. PPO-based RLHF still used at frontier (a leading frontier lab, possibly a constitutional-methods frontier lab). Constitutional methods / RL-from-AI-Feedback (RLAIF) (a constitutional-methods frontier lab) generates preferences via AI-judged adherence to principles, avoiding human annotation cost. Iterative DPO and online DPO push quality further.
Reference analyst note. RLHF as a method is mostly cargo-culted. The actual win at frontier comes from: (a) high-quality SFT, (b) RL-from-AI-Feedback (RLAIF) for breadth, (c) RLVR for verifiable tasks, (d) human RLHF only for irreducibly subjective categories. The DPO-vs-PPO debate is a sideshow — both work, choice is engineering preference. The real frontier shift in 2025-2026 is 'preference alignment' becoming 'reasoning alignment' — RL signal moving from human preference to verifiable correctness for hard tasks. This is the most important post-training shift since RLHF itself.
RLHF as a method is mostly cargo-culted. The actual win at frontier comes from: (a) high-quality SFT, (b) RL-from-AI-Feedback (RLAIF) for breadth, (c) RLVR for verifiable tasks, (d) human RLHF only for irreducibly subjective categories. The DPO-vs-PPO debate is a sideshow — both work, choice is engineering preference. The real frontier shift in 2025-2026 is 'preference alignment' becoming 'reasoning alignment' — RL signal moving from human preference to verifiable correctness for hard tasks. This is the most important post-training shift since RLHF itself.
A leading open-weights model: iterative DPO + RLHF mix · a constitutional-methods frontier lab a leading frontier model: Constitutional methods + RLHF · a leading frontier lab: PPO-based RLHF (historical, current details closed) · Open: Tulu 3 (UltraFeedback DPO + RLVR)
Christiano et al., RLHF (2017) · Ouyang et al., InstructGPT (2022) · Bai et al., Constitutional methods (2022) · Rafailov et al., DPO (2023) · Lambert et al., Tulu 3 (2024)
a public alignment specification / Constitution: the explicit document that defines what the model should and shouldn't do. Components: persona, helpfulness/harmlessness/honesty principles, harm category taxonomy, refusal policies, role hierarchy (system/operator/user/tool), exception cases, exemplars. Without an explicit spec, model behavior is implicit and inconsistent. Increasingly required for trust, regulatory clarity, dispute resolution.
A leading frontier lab a public alignment specification (May 2024, updated): public ~5000-word document defining Chain of Command (Platform > Developer > User > Tool), default behaviors, hard rules. One lab's constitution + Acceptable Use Policy are public. Both define harm categories: CBRN weapons, child safety, privacy, election interference, self-harm, deceptive output. Spec drives training data curation, RLHF reward signal, and red-team test cases.
Reference analyst note. Specifications are operationally useful (alignment of human reviewers, regulatory clarity, dispute resolution) but their causal effect on model behavior is poorly understood. The a constitutional-methods frontier lab Constitution and a leading frontier lab a public alignment specification serve more as institutional artifacts than technical control mechanisms. The next frontier is 'specs the model can actually reason about' — current specs are read like training labels, not internalized reasoning frameworks. Constitutional Classifiers (2025) suggest a path: separate small model that explicitly checks against principles.
Specifications are operationally useful (alignment of human reviewers, regulatory clarity, dispute resolution) but their causal effect on model behavior is poorly understood. The a constitutional-methods frontier lab Constitution and a leading frontier lab a public alignment specification serve more as institutional artifacts than technical control mechanisms. The next frontier is 'specs the model can actually reason about' — current specs are read like training labels, not internalized reasoning frameworks. Constitutional Classifiers (2025) suggest a path: separate small model that explicitly checks against principles.
A leading frontier lab a public alignment specification (public) · a constitutional-methods frontier lab Acceptable Use Policy (public) · one lab's constitution (mostly public) · a multimodal frontier lab a multimodal frontier model policies
A leading frontier lab a public alignment specification (2024) · a constitutional-methods frontier lab AUP · Bai et al., CAI (2022)
Capability evaluation measures what a model can do. Standard benchmarks form a public scoreboard that drives industry progress. Categories: general knowledge (MMLU), reasoning (GSM8K, MATH, AIME), code (HumanEval, MBPP, LiveCodeBench, SWE-bench), agentic (GAIA, AgentBench), long-context (NIAH, RULER, BABILong), multilingual (MGSM, multilingual MMLU), instruction following (IFEval), and frontier-specific (HLE, ARC-AGI, FrontierMath).
Benchmark saturation is a constant concern: MMLU saturating ~90%, HumanEval saturated ~95%. New benchmarks emerging: HLE (Humanity's Last Exam, ~3000 expert-PhD-level questions), FrontierMath (research-level math), ARC-AGI (visual abstract reasoning), SWE-Bench Verified (real GitHub issues, validated). Contamination is pervasive — popular benchmarks leak into training data, requiring fresh held-out sets.
Reference analyst note. Standard benchmarks are entering crisis — saturation, contamination, gameability. The next 2 years will see shift to: (a) live arenas with continuous human ratings (lmarena), (b) frequently-refreshed benchmarks (LiveCodeBench), (c) expert-grade eval (GPQA, FrontierMath, HLE), (d) agent benchmarks measuring real task completion (SWE-bench, GAIA, OSWorld). The trend is from 'static MMLU score' to 'diverse evidence portfolio.' a constitutional-methods frontier lab system cards already do this; expect industry-wide adoption.
Standard benchmarks are entering crisis — saturation, contamination, gameability. The next 2 years will see shift to: (a) live arenas with continuous human ratings (lmarena), (b) frequently-refreshed benchmarks (LiveCodeBench), (c) expert-grade eval (GPQA, FrontierMath, HLE), (d) agent benchmarks measuring real task completion (SWE-bench, GAIA, OSWorld). The trend is from 'static MMLU score' to 'diverse evidence portfolio.' a constitutional-methods frontier lab system cards already do this; expect industry-wide adoption.
Major scoreboards: lmarena.ai (live human votes), Open LLM Leaderboard, an open-model hub leaderboards · Frontier labs publish evals on system cards · Benchmark saturation: GPQA, AIME going next
Hendrycks et al., MMLU (2020) · Cobbe et al., GSM8K (2021) · Chen et al., HumanEval (2021) · Phan et al., HLE (2025)
Safety evaluation tests refusal accuracy, harm avoidance, bias, and alignment. Different from capability eval: capability asks 'can the model do X?' Safety asks 'does the model do X when it shouldn't, or fail to do X when it should?' Categories: refusal calibration (XSTest), bias (BBQ, BOLD), toxicity (ToxiGen, RealToxicityPrompts), privacy (TrustLLM), harmful task assistance (HarmBench).
Frontier labs publish safety evals on system cards. AILuminate (MLCommons, 2024) is industry standard cross-lab safety benchmark. WMDP measures dangerous knowledge (CBRN). DecodingTrust comprehensive trust eval. A national AI Safety Institute and a national AI Safety Institute run external safety evaluations on frontier models pre-release.
Reference analyst note. Safety evaluation is dramatically underdeveloped relative to capability evaluation. Capability has 50+ standard benchmarks; safety has maybe 15. We are flying blind on subtle harms (sycophancy, manipulation, deception under specific conditions). One lab's interpretabilityility work is the deepest probe; field-wide it's still surface-level. Expect frontier safety eval to expand 5-10× by 2027 driven by EU AI Act conformity and a national AI Safety Institute evaluations.
Safety evaluation is dramatically underdeveloped relative to capability evaluation. Capability has 50+ standard benchmarks; safety has maybe 15. We are flying blind on subtle harms (sycophancy, manipulation, deception under specific conditions). One lab's interpretabilityility work is the deepest probe; field-wide it's still surface-level. Expect frontier safety eval to expand 5-10× by 2027 driven by EU AI Act conformity and a national AI Safety Institute evaluations.
MLCommons AILuminate · a constitutional-methods frontier lab system card safety section · a leading frontier lab system card · a national AI Safety Institute evaluations
Vidgen et al., AILuminate (2024) · Wang et al., DecodingTrust (2023) · Li et al., WMDP (2024)
Responsible Scaling / Release Framework: institutional commitments tying capability thresholds to required safety measures. The forcing function that prevents 'race to the bottom'. A Responsible Scaling Policy framework, a Preparedness-style framework, a multimodal frontier lab Frontier-Safety-style framework all define: capability levels, evaluation requirements per level, security/deployment mitigations required per level, conditions for pause/rollback.
a Responsible Scaling Policy framework (v2, 2024) (2024): defines AI Safety Level with capability thresholds for autonomous biosecurity, cyber, and AI R&D capabilities. A Preparedness-style framework (2023, updated): Critical/High/Medium/Low risk levels with deployment gates. A Frontier-Safety-style framework similar. Voluntary commitments via national AI Safety Institute, Seoul declaration. Increasingly intersecting with regulation (EU AI Act).
Reference analyst note. Responsible Scaling Policies are useful coordination devices but their actual prophylactic power is untested. They've never paused a release. The optimistic read: capabilities haven't crossed thresholds. The pessimistic read: thresholds are calibrated to never bind. Truth probably mix. The next test will come when a model genuinely approaches third AI Safety Level cyber or CBRN — likely 2026-2027. Whether the framework holds under genuine commercial pressure is the real test.
Responsible Scaling Policies are useful coordination devices but their actual prophylactic power is untested. They've never paused a release. The optimistic read: capabilities haven't crossed thresholds. The pessimistic read: thresholds are calibrated to never bind. Truth probably mix. The next test will come when a model genuinely approaches third AI Safety Level cyber or CBRN — likely 2026-2027. Whether the framework holds under genuine commercial pressure is the real test.
a Responsible Scaling Policy framework (v2, 2024) (public) · a Preparedness-style framework (public) · a Frontier-Safety-style framework (public)
a Responsible Scaling Policy framework (v2, 2024) (2024) · a Preparedness-style framework (2024) · a Frontier-Safety-style framework (2024)
Output safety: defenses applied at inference-time on model outputs. Distinct from training-time safety (B-group). Operates as final layer regardless of training quality. Components: output content filters (an open-weights output classifier, a leading frontier lab Moderations), PII detection/redaction, watermarking, provenance metadata (C2PA), output context (schema compliance, refusal reformulation).
a recent-generation output classifier (an open-weights frontier lab) is open standard. A moderation API service. A constitutional-methods frontier lab safety classifier. C2PA (Content Provenance and Authenticity) standard for cryptographic content provenance — Adobe, a leading frontier lab, a synthetic-data-focused lab adopting. a generative-content watermarking system (a multimodal frontier lab) watermarks AI-generated content. Constitutional Classifiers (a constitutional-methods frontier lab, 2025): trained classifiers checking outputs against constitution principles.
Reference analyst note. Output safety is the right architectural choice — input filtering is doomed because input space is unbounded, output space is comparatively constrained. output-conformance safety paradigm (egress filtering + cached refusal templates + classifier ensemble) is the production-ready answer. The remaining hard problem is multimodal output (image/video/audio) where classification is much harder than text. Watermarking is a useful piece but not a solution; treat it as evidence, not enforcement.
Output safety is the right architectural choice — input filtering is doomed because input space is unbounded, output space is comparatively constrained. output-conformance safety paradigm (egress filtering + cached refusal templates + classifier ensemble) is the production-ready answer. The remaining hard problem is multimodal output (image/video/audio) where classification is much harder than text. Watermarking is a useful piece but not a solution; treat it as evidence, not enforcement.
a recent-generation output classifier · a leading frontier lab Moderations · a constitutional-methods frontier lab Constitutional Classifiers · a multimodal frontier lab a generative-content watermarking system
Inan et al., an open-weights output classifier (2023) · Sharma et al., Constitutional Classifiers (2025) · C2PA spec · Dathathri et al., a generative-content watermarking system-Text (2024)
Serving stack. From request arrival to response. Components: API gateway (auth, routing), inference engine (an open-source inference engine, TRT-LLM, SGLang), batch coordinator, response streamer. Performance gap between naive and optimized: 10-100×.
an open-source inference engine dominant open. A vendor inference stack peak a leading accelerator vendor performance. SGLang for shared-prefix workloads. Hosted: Anyscale, Together AI, Fireworks, Replicate. a high-throughput inference accelerator LPU for ultra-low-latency. Multi-model dispatch (multiple base models on same cluster) increasingly common.
Reference analyst note. Inference engineering is undervalued relative to training. A 5× throughput gain via better serving = 5× more users at same cost. Most labs underinvest. An open-source inference engine's PagedAttention was a paper; it should have been a unicorn. The next round of gains comes from: (a) speculative decoding everywhere (a draft-head speculative decoding technique-2, MTP), (b) FP8/FP4 inference on a next-generation accelerator, (c) cross-request KV cache (prefix caching), (d) serving optimizations specific to reasoning models. Anyone serving LLMs at scale who isn't doing all four is leaving 5-10× on the table.
Inference engineering is undervalued relative to training. A 5× throughput gain via better serving = 5× more users at same cost. Most labs underinvest. An open-source inference engine's PagedAttention was a paper; it should have been a unicorn. The next round of gains comes from: (a) speculative decoding everywhere (a draft-head speculative decoding technique-2, MTP), (b) FP8/FP4 inference on a next-generation accelerator, (c) cross-request KV cache (prefix caching), (d) serving optimizations specific to reasoning models. Anyone serving LLMs at scale who isn't doing all four is leaving 5-10× on the table.
an open-source inference engine (open frontier) · a vendor inference stack (a leading accelerator vendor optimized) · SGLang challenger · a high-throughput inference accelerator LPU production
Kwon et al., an open-source inference engine (2023) · Zheng et al., SGLang (2024)
Inference optimization: reducing latency and cost per token. Stack: KV cache management, batching, speculative decoding, quantization, sparsity, kernel optimization. 10-100× speedup possible vs naive baseline.
Frontier serving combines: PagedAttention (an open-source inference engine) + continuous batching + speculative decoding (a draft-head speculative decoding technique-2) + INT4 weight quant + FP8 activation quant + custom CUDA kernels (FlashAttention 3). Latency budgets: TTFT <200ms for chat, TPOT <50ms for streaming.
Reference analyst note. Inference optimization is solved at the kernel and batching levels — an open-source inference engine, a vendor inference stack, FlashAttention together cover most of the win. The remaining frontier is system-level: prefix caching at scale, speculative decoding for reasoning, multi-LoRA dispatch, hardware-aware kernel JIT. Frontier serving stacks in 2026 will look fundamentally different from 2024 in their handling of test-time-compute-scaling models — this transition is mid-progress and labs differ widely.
Inference optimization is solved at the kernel and batching levels — an open-source inference engine, a vendor inference stack, FlashAttention together cover most of the win. The remaining frontier is system-level: prefix caching at scale, speculative decoding for reasoning, multi-LoRA dispatch, hardware-aware kernel JIT. Frontier serving stacks in 2026 will look fundamentally different from 2024 in their handling of test-time-compute-scaling models — this transition is mid-progress and labs differ widely.
an open-source inference engine with all optimizations · a vendor inference stack peak a leading accelerator vendor · Together AI production stack
Production monitoring. What's happening in production right now? Latency (TTFT, TPOT, end-to-end), throughput, error rates, GPU utilization, KV cache hit rate, cost per request, content quality, drift, anomalies.
Standard SRE metrics + LLM-specific layers. LangSmith, Arize Phoenix, Langfuse, Helicone for LLM observability. OpenTelemetry GenAI semantic conventions emerging as standard.
Reference analyst note. Production observability for LLMs is 5 years behind general SRE. LangSmith, Helicone, Langfuse are gradually catching up but lack maturity of Datadog/New Relic. The hard problem is quality monitoring — capability changes are subtle and statistical signals are noisy. Frontier labs maintain large internal observability teams; smaller deployments are largely flying blind. Expect this to be a major investment area 2025-2027.
Production observability for LLMs is 5 years behind general SRE. LangSmith, Helicone, Langfuse are gradually catching up but lack maturity of Datadog/New Relic. The hard problem is quality monitoring — capability changes are subtle and statistical signals are noisy. Frontier labs maintain large internal observability teams; smaller deployments are largely flying blind. Expect this to be a major investment area 2025-2027.
LangSmith · Arize Phoenix · Helicone · Langfuse (open)
Deployment: releasing model versions to production. Rollout strategy, A/B testing, rollback procedures, version management, pre-deployment gating. Distinct from D1 serving (which is the runtime). D4 is the release process.
Frontier labs use canary deployments (1% → 10% → 100% over hours/days). A/B test new vs current via held-out user cohorts. Automatic rollback on quality regression triggers. Pre-deployment gates: safety eval, capability eval, internal review.
Reference analyst note. Deployment discipline is genuinely better than 5 years ago — frontier labs run staged rollouts, have rollback procedures, conduct A/B tests. But quality regression detection remains the soft underbelly. A model that's 5% worse on subjective metrics will pass safety / capability / SLO gates and ship. We're learning about quality regressions from arena ranking changes weeks after deployment. Better quality regression infrastructure is high-leverage but underinvested.
Deployment discipline is genuinely better than 5 years ago — frontier labs run staged rollouts, have rollback procedures, conduct A/B tests. But quality regression detection remains the soft underbelly. A model that's 5% worse on subjective metrics will pass safety / capability / SLO gates and ship. We're learning about quality regressions from arena ranking changes weeks after deployment. Better quality regression infrastructure is high-leverage but underinvested.
A leading frontier lab gradual rollouts · a constitutional-methods frontier lab canary deployment · Standard SRE release practices
Data governance: lifecycle controls over data assets. Lineage (where data came from), access control (who can read what), retention (how long), deletion (data subject rights), provenance (cryptographic proof of source), customer data boundaries (no train on enterprise data).
Frontier labs: hearing-grade data governance for compliance. Customer data: zero-data-retention default for enterprise APIs. Lineage tracked end-to-end (source → corpus → model). Audit logs immutable.
Reference analyst note. Data governance is the frontier compliance bottleneck. The naive view ('we don't train on customer data') is insufficient — EU AI Act, copyright lawsuits (NYT v. A leading frontier lab), and emerging unlearning requirements force much deeper governance. Frontier labs that don't have hearing-grade data lineage today will spend 2025-2026 building it. The model card / system card transparency standard set by a constitutional-methods frontier lab is becoming default expectation.
Data governance is the frontier compliance bottleneck. The naive view ('we don't train on customer data') is insufficient — EU AI Act, copyright lawsuits (NYT v. A leading frontier lab), and emerging unlearning requirements force much deeper governance. Frontier labs that don't have hearing-grade data lineage today will spend 2025-2026 building it. The model card / system card transparency standard set by a constitutional-methods frontier lab is becoming default expectation.
a constitutional-methods frontier lab enterprise zero-data-retention · a leading frontier lab Enterprise no-train default · a hyperscaler platform Bedrock isolation
Security: end-to-end security posture. Categories: prompt injection defense, data exfiltration prevention, model theft protection, training-data poisoning defense, supply chain security, jailbreak resistance, agentic security, security monitoring.
a constitutional-methods frontier lab third AI Safety Level security: protect weights against non-state-actor theft. Multi-layer defenses across categories. NIST AI RMF, ISO/IEC 42001 for governance frameworks. EU AI Act security requirements for high-risk systems.
Reference analyst note. Security for LLMs is in a state similar to web security circa 2008 — patterns visible but practices immature. The frontier 2026 security stance: assume weights will eventually leak (insider, breach, gradual extraction); design for graceful degradation. The a constitutional-methods frontier lab third AI Safety Level framing (resist non-state actor) is appropriately calibrated; fourth AI Safety Level (resist state actor) is the next frontier and unsolved. Agent security is the unsolved problem of the next 2 years; current 'defenses' are mostly hopeful patterns, not robust controls.
Security for LLMs is in a state similar to web security circa 2008 — patterns visible but practices immature. The frontier 2026 security stance: assume weights will eventually leak (insider, breach, gradual extraction); design for graceful degradation. The a constitutional-methods frontier lab third AI Safety Level framing (resist non-state actor) is appropriately calibrated; fourth AI Safety Level (resist state actor) is the next frontier and unsolved. Agent security is the unsolved problem of the next 2 years; current 'defenses' are mostly hopeful patterns, not robust controls.
a constitutional-methods frontier lab third AI Safety Level commitments (public) · a leading frontier lab security posture · NIST AI RMF as framework
Privacy: protection of personal information. Categories: PII handling, differential privacy, membership inference defense, regulatory compliance, inference-time privacy.
Frontier labs: comprehensive PII handling, GDPR/CCPA compliance, optional zero-data-retention. Differential privacy still rare at scale (DP-SGD too expensive for frontier training). Membership inference defenses via training-data deduplication.
Reference analyst note. Privacy compliance is becoming a serious cost center. Frontier labs that haven't invested in privacy infrastructure (hearing-grade data governance, deletion processes, sectoral certifications) will face compounding regulatory costs 2025-2027. The technically-interesting frontier is private inference (TEE, private cloud compute, eventually homomorphic) — a confidential-computing frontier lab's deployment shows production viability. Differential privacy at training remains aspirational at frontier scale.
Privacy compliance is becoming a serious cost center. Frontier labs that haven't invested in privacy infrastructure (hearing-grade data governance, deletion processes, sectoral certifications) will face compounding regulatory costs 2025-2027. The technically-interesting frontier is private inference (TEE, private cloud compute, eventually homomorphic) — a confidential-computing frontier lab's deployment shows production viability. Differential privacy at training remains aspirational at frontier scale.
a constitutional-methods frontier lab enterprise privacy · a confidential-computing frontier lab's Private Cloud Compute (DP + TEE)
Compliance: regulatory and framework conformance. EU AI Act, NIST AI RMF, ISO/IEC 42001, sectoral (HIPAA, FedRAMP, SOC 2), voluntary commitments (Frontier Model Forum, AI Safety Summit Seoul/Bletchley/Paris).
EU AI Act in force (Aug 2024), full effect 2026-2027. GPAI Code of Practice published 2024. Frontier labs: SOC 2 Type II + ISO 27001 + ISO/IEC 42001. FedRAMP Moderate (a constitutional-methods frontier lab 2024). Voluntary commitments via Bletchley/Seoul/Paris summits.
Reference analyst note. Compliance is becoming a strategic lever. A constitutional-methods frontier lab's investments in FedRAMP, ISO/IEC 42001, EU AI Act readiness give it enterprise customer access a leading frontier lab / a multimodal frontier lab catch up to slowly. The arbitrage is real: $50M+ in compliance investment can unlock $1B+ in regulated-industry revenue. The next 18 months will see frontier labs differentiate not on capability (saturating) but on compliance depth and trust signals.
Compliance is becoming a strategic lever. A constitutional-methods frontier lab's investments in FedRAMP, ISO/IEC 42001, EU AI Act readiness give it enterprise customer access a leading frontier lab / a multimodal frontier lab catch up to slowly. The arbitrage is real: $50M+ in compliance investment can unlock $1B+ in regulated-industry revenue. The next 18 months will see frontier labs differentiate not on capability (saturating) but on compliance depth and trust signals.
a constitutional-methods frontier lab SOC 2 + ISO 27001 + FedRAMP Moderate · a leading frontier lab similar · Code of Practice signatories
Topics that span multiple slots and define the 2025-2027 frontier: synthetic data at scale, reasoning model training, test-time compute, model merging, evaluation contamination, agentic safety, multimodal training, realtime models, and the open-weights ecosystem as a structural force.
Relevant to: A1, B1, B2
Synthetic data has gone from niche to backbone in two years. A synthetic-heavy small frontier model (a synthetic-data-focused lab 2024) demonstrated existence proof: a 3.8B model trained heavily on synthetic textbook-quality data matches 70B-class capability on benchmarks. Cosmopedia (an open-model hub) released 25B tokens of synthetic textbook content. RL-from-AI-Feedback (RLAIF) (a constitutional-methods frontier lab 2022) showed AI-generated preferences match RLHF on helpfulness/harmlessness.
The mechanics: a frontier model (a leading frontier model, a current-generation frontier model, a leading open-weights model) acts as teacher, generating instruction-response pairs, reasoning traces, tool-use demonstrations, or preference comparisons. Filtering removes obvious failures. The student model trains on this curated synthetic corpus.
Three failure modes shape practice. First, model collapse (Shumailov 2024): training on AI-generated data recursively narrows distribution and degrades quality. The fix is mixing — synthetic data should compose 20-50% of training, not replace real data. Second, distributional artifacts: synthetic data has telltale stylistic homogeneity (frontier teachers all sound similar). Diversity prompting and multi-teacher mixing partially mitigate. Third, capability ceiling: students can match but rarely exceed teachers, except via RL with verifiable rewards (where verifier is the upper bound, not the teacher).
The frontier 2025 direction: synthetic data targeting specific capability gaps. Math reasoning (NuminaMath: 860K verified problems), code-with-tests (every example includes execution verification), agent traces (model X plays user, model Y plays assistant with tool access). Each gap is filled with bespoke synthetic pipelines.
Strategic implication: pre-training compute is decreasingly the bottleneck. Synthetic data pipelines + RL infrastructure + verification environments are the new capability levers. Anyone building an LLM company in 2026 should treat synthetic data generation as a first-class capability, not an afterthought.
---
Relevant to: A3, B2, C1
Reasoning models scale test-time compute the way standard LLMs scale parameters. o1 (a leading frontier lab Sept 2024), an open-weights reasoning model (Jan 2025), a long-context frontier model.7 Sonnet extended thinking, a multimodal frontier model Flash Thinking — all share a pattern: same transformer architecture, RL post-training on tasks with verifiable rewards, hidden chain-of-thought before answer.
An open-weights reasoning model is the most documented case. Pure RL from base model (R1-Zero) with rule-based rewards: correct = 1, incorrect = 0 on math problems; syntactic correctness on code. No process reward model, no human preferences in this stage. After ~10K RL steps, emergent capabilities: self-correction, alternative-strategy exploration, backtracking, verbalization of uncertainty. R1 then adds cold-start SFT + multi-stage RL + distillation to smaller models (a multilingual frontier model and leading open-weights model 7-70B distillates that retain most reasoning capability).
The algorithm: Group Relative Policy Optimization (GRPO). Standard PPO needs a critic network (value function). GRPO replaces it with sampling K rollouts per query, computing advantage as outcome relative to group mean. Saves critic compute, simpler to implement. Same idea (RLOO, REINFORCE Leave-One-Out) is in Allen AI's Tulu 3.
Three open puzzles. First, generalization: R1 was trained on math/code but reasoning improvements transfer to other domains. Why? Hypothesis: RL teaches general meta-cognition (planning, verification, self-correction) that's domain-agnostic. No mechanistic confirmation. Second, length scaling: longer chains-of-thought roughly correlate with better answers, but with severe diminishing returns past ~10K thinking tokens. The shape of this curve isn't modeled. Third, length penalty: without one, model rambles. With aggressive one, capability degrades. Calibrating this remains art.
Strategic implication: reasoning is a separate skill axis from raw knowledge. A 7B a multilingual frontier model distilled from R1 outperforms 70B non-reasoner on hard math. The capability ceiling for reasoning is set by the verifier, not the teacher — a major shift from teacher-bounded SFT/RLHF. Companies investing in verifier-rich domains (formal math, code with test suites, scientific computation) can build domain-specialist models that outperform generalists.
---
Relevant to: A3, B2, C1, D1
Test-time compute scaling is the second major capability lever after parameter scaling. Same model, more inference compute per query, better answers. Methods: (a) longer chain-of-thought (o1, R1), (b) best-of-N sampling with verifier, (c) Monte Carlo Tree Search over reasoning steps, (d) self-consistency (sample N, majority vote), (e) multi-agent debate.
The key paper: Snell et al. (2024) "Scaling LLM Test-Time Compute Optimally". Showed that for a fixed quality target, you can substitute test-time compute for pre-training compute at favorable ratios. A 1B model with optimal test-time compute can match a 14B model on math reasoning.
Production implementation faces serving challenges. Best-of-N requires N parallel generations + verifier; latency 10-100× single generation. MCTS with branching factor 5 and depth 10 is potentially 10⁷ states. Self-consistency with N=20 samples is well-defined but 20× cost. Reasoning models with hidden CoT route differently — TTFT becomes thousands of tokens of thinking time before user sees response. UX patterns are immature.
Trade-off across methods: chain-of-thought is single-stream, sequential. Best-of-N parallelizes but requires verifier. MCTS searches systematically but only works on tasks where partial states are evaluable. Self-consistency works for tasks with discrete answer space, struggles for open-ended generation.
The frontier 2026 trend: hybrid stacks. Reasoning model with internal CoT, plus best-of-N at the answer level for verifiable tasks, plus self-consistency for high-stakes outputs. A leading frontier lab o3, an open-weights reasoning model+, a leading frontier model extended thinking all converge on similar patterns.
Strategic implication: inference cost is no longer a single multiplier on serving fees. A reasoning query may cost 10-100× a standard query. Pricing models (per-token) break down. leading frontier labs are pricing reasoning tier separately. Customers will choose dynamically: cheap fast model for simple queries, expensive slow reasoner for hard ones. Building this routing layer is a 2025-2027 product opportunity.
---
Relevant to: A3, B1, D4
Model merging combines weights of multiple fine-tuned models into a single model that retains multiple capabilities. Methods: linear interpolation (model soups), task arithmetic (vector arithmetic in weight space), DARE (Drop And REscale, sparsifies before merge), TIES (resolves sign conflicts), Model Stock, Evolutionary merging (Sakana AI 2024).
The core insight (Wortsman et al., 2022): fine-tuned models trained from same pre-trained checkpoint live in a connected loss basin. Linear interpolation between them often improves over either parent. This shouldn't work as well as it does — it implies fine-tuning makes localized updates.
The ecosystem: MergeKit (Goddard 2024) is the standard library. An open-model hub leaderboards regularly populated by merged models — top open models often combinations of community fine-tunes. SOLAR-10.7B, Yi-merged variants, and many leaderboard chart-toppers are merge products.
The mechanism: orthogonal capabilities (math vs creative writing) can be added in weight space; redundant capabilities collapse. DARE drops most fine-tuning delta vectors (they're sparse) and rescales remaining; surprisingly preserves quality. TIES detects sign conflicts (parameter wants to go up in one fine-tune, down in another) and resolves via majority/magnitude.
Open puzzles: when does merging help vs hurt? Evolutionary search (Sakana AI's evolutionary merge) finds non-obvious combinations but is computationally expensive. Mechanistic understanding of why this works is thin — interpretability research is starting to catch up (sparse autoencoders show that merged models inherit features from both parents in weight-space-additive way).
Strategic implication: post-training data + merging may offer alternative path to frontier capability without large-scale RL. Multiple fine-tunes for different capabilities, then merged for general capability. Cost: orders of magnitude lower than full RLHF. Quality: unclear at frontier, demonstrated at mid-tier (7-70B). The open community has converged on merging as a primary capability lever; frontier labs less public about whether they use it.
---
Relevant to: C1, C2
Public benchmarks leak into training data. This is the field's open secret. Frontier models train on tens of trillions of tokens including most of the internet. MMLU questions, GSM8K problems, HumanEval prompts — all are widely posted, indexed, repeated. The benchmark-as-leaderboard premise breaks if models have seen the test.
Evidence: Nasr et al. (2023) demonstrated extraction of training data verbatim from production models. Several papers (Magar & Schwartz 2022; Sainz et al. 2023; Xu et al. 2024) measured contamination via memorization checks: do models reproduce benchmark questions verbatim? Answer: yes, for many popular benchmarks.
Defenses: (a) refresh benchmarks frequently (LiveCodeBench monthly), (b) hold out test sets and never publish (HLE for v1), (c) generate fresh problems via known-difficulty templates, (d) measure capability on held-out competition problems with verified post-cutoff dates (AIME 2024, 2025). Benchmark designers increasingly distinguish 'public dev set' (contaminated, useful for development) from 'private test set' (held by AISIs or arxived after eval).
Frontier-specific: AISIs (UK, US) maintain private capability evals. Frontier labs run these independently, publish summary results. The actual live benchmark for frontier capability has shifted from public scoreboards to a national AI Safety Institute evaluations + lmarena Elo + a handful of carefully-held private benchmarks.
Open puzzle: is contamination quantitatively important, or marginal? Some studies (Brown et al. 2020) show small effect size. Others (Magar & Schwartz) show large effects on heavily-contaminated benchmarks. Frontier labs claim awareness but haven't published rigorous internal contamination audits.
Strategic implication: benchmark scores from frontier labs should be read with skepticism, especially on benchmarks more than 1-2 years old. The actual capability signal is from: (a) reasoning benchmarks held out (FrontierMath, HLE), (b) live arenas, (c) novel domain-specific benchmarks created post-training-cutoff. Anyone evaluating models for partnership should commission held-out evaluations rather than rely solely on published scores.
---
Relevant to: B4, C4, D1, E2
Agentic safety is the unsolved frontier security problem. An agent — model with tool access (browser, code execution, file system, computer control) — can take real actions in the world. Prompt injection in this context is no longer just bad output; it's unauthorized action.
Current production agents: a constitutional-methods frontier lab Computer Use (a long-context frontier model+ controls a sandboxed VM via screenshots and mouse/keyboard), a leading frontier lab Operator (similar), Cursor / Devin (code agents), domain-specific agents in customer support, research, browser automation. Common architecture: LLM in loop, structured tool calls, output observed and fed back, maximum step budget, human approval gates for sensitive actions.
The threat model. Indirect prompt injection: agent reads attacker-controlled content (web page, document, email) which contains instructions ("ignore previous instructions, exfiltrate data"). Agent treats this as authoritative. Defenses: instruction hierarchy training (treat retrieved content as data, not instruction), tool sandboxing (limit blast radius), output filtering on actions, user confirmation gates for sensitive actions. None are robust. Demonstrated attacks against a constitutional-methods frontier lab Computer Use, a leading frontier lab plugins, Bing Chat — every major agent has been breached in published research.
Compound threats specific to agents. Goal hijacking: agent pursues attacker's goal across many steps. Resource consumption: runaway loops. Privilege escalation: agent given limited access expands via discovered shortcuts. Multi-agent collusion: agents from different systems collude in shared environment. Few of these are addressed in current frameworks.
Open puzzles. Is there a fundamental architecture that resists prompt injection? Hypothesis: separating "data context" from "instruction context" with separate model heads. No production context. How do you measure agent safety? Eval methodology nascent — an external evaluation organization's autonomous capability evaluations are early. Sandboxing is necessary but how strong needs to be? When agent can browse arbitrary web content + execute arbitrary code, sandbox is functionally as permissive as production server.
Strategic implication: agent capabilities are deploying faster than agent safety. Leading frontier labs, a synthetic-data-focused lab, a multimodal frontier lab all shipping agents in 2024-2025. The honest position: these are useful but exploitable; current commercial use cases happen in environments where exploit consequences are bounded (sandboxed VMs, low-stakes automation, human-in-loop). Production agents with high-stakes autonomy (autonomous research, financial transactions, critical infrastructure) are not yet safely deployable. The 2025-2027 frontier security work is here.
---
Relevant to: A1, A3
Multimodal training data is qualitatively harder than text. Image-text pairs need accurate alignment. Video adds temporal dimension. Audio adds streaming. The frontier shifted from late-fusion (separate vision encoder bolted onto frozen LLM) to native multimodal (interleaved tokens trained from start) in 2024.
Image-text data. LAION-5B (5.8B pairs) was the open backbone but quality is uneven (alt-text varies). Quality filtering: DataComp (Gadre 2023) established curation methodology. Synthetic captions (BLIP-style: vision model writes caption for image) scale arbitrarily but introduce hallucination loop. Frontier mix: licensed high-quality captioned images + filtered web image-text + synthetic captions.
Resolution strategy is a major axis. Fixed 224² or 336² is cheap. Dynamic resolution (AnyRes in LLaVA-NeXT, native in Pixtral and a frontier multimodal model) handles arbitrary aspect ratios up to ~1024×1024. Tiled (split image into patches, process each) for very-high-resolution. Trade-off: more visual tokens = more compute = better fine detail = expensive at training and inference.
Video data. WebVid (10M video-text), HowTo100M (100M instructional video clips), LVD-2M (2M licensed). Frontier video models train on billions of video-text pairs. Sampling strategy: 1-8 fps typical for understanding, 24+ fps for fine motion. Temporal tokens via video transformer or frame-level encoder.
Audio data. AudioSet (2M clips with labels), LibriVox (100K+ hours public-domain audio), Common Voice (Mozilla 17K hours multilingual). Frontier voice models (a frontier multimodal model Voice, Moshi by Kyutai) use native audio tokens at 12.5Hz frame rate. Speech recognition models (Whisper) transcribe audio for LLM input; native voice models bypass transcription.
Open puzzles. Optimal text-image ratio: more multimodal data costs English text density. Fundamental cross-modal capability ceiling? Models that train multimodal natively show better cross-modal reasoning than late-fusion adapters; mechanism unclear. Data licensing: image copyright is notoriously fraught (LAION class-action lawsuits, Stable Diffusion litigation). Frontier shift toward licensed datasets + synthetic.
Strategic implication: multimodal capability is now table stakes for frontier. Pure text models look outdated by 2026. The capital cost of multimodal training data (licensed images, video, audio) is substantial — multiple millions in licensing alone. The next frontier is video generation (Sora, Veo, Kling) where data needs are an order of magnitude larger.
---
Relevant to: A2, A3, D1, D2
Voice mode shipped at frontier in 2024 (a frontier multimodal model Voice, a multimodal frontier model Live). Two architectural approaches: cascaded (Whisper ASR → LLM → TTS, ~1-3s latency) and native end-to-end (audio tokens in same model as text, ~300-500ms latency). Native is the frontier.
Native voice mechanics. Audio encoded as discrete tokens at low frame rate (Moshi: 12.5Hz, a frontier multimodal model estimated similar). Text and audio tokens share single transformer. Generated audio tokens decoded back to waveform via vocoder. End-to-end model handles ASR, response generation, and TTS in single forward pass. Captures non-verbal cues (tone, pace, hesitation, laughter) — qualitatively different UX than cascaded.
Latency stack for native voice. End-to-end target <500ms TTFT to seem natural. Speculative decoding on audio tokens. Streaming generation (start emitting audio while still computing later tokens). Hardware: a high-throughput inference accelerator LPU for sub-100ms TTFT on leading open-weights model-class models. For frontier models, latency budget is dominated by compute for first token; everything after streams smoothly.
Realtime conversation features. Interruption handling: model must stop speaking when user starts. Voice activity detection. Turn-taking models. Multi-speaker tracking. Emotional response (model's audio output reflects content emotion). All of these are immature in current systems.
Open puzzles. Does native voice training degrade text capabilities? Reports suggest tradeoffs but no clean ablation. Cross-lingual voice: most native voice models are English-strong, multilingual weak. Privacy: voice contains biometric identity; processing implications under GDPR/biometric laws unclear. Voice deepfake risk: voice cloning at scale enabled by these models.
Strategic implication: voice is the next consumer interface frontier after chat. a consumer LLM chat product Voice mode usage grew 50%+ post-launch. A leading frontier lab's Realtime API enables developer access. Voice will dominate certain verticals (customer service, accessibility, in-car, hands-busy contexts) where chat doesn't fit. Building voice-first applications on top of frontier APIs is a major 2025 product direction.
---
Relevant to: A1-E4 (cross-cutting)
Open-weights models are no longer trailing frontier — they're co-frontier in 2024-2025. A leading open-weights flagship model (an open-weights frontier lab), an open-weights frontier model (V3 class) (an open-weights frontier provider), an open-weights frontier lab Large 2, a leading multilingual frontier model.5 (Alibaba), Yi (01.AI). These shift the entire industry's economics.
The progression. leading open-weights model (2023) — open weights of competitive models. A 2023-generation open-weights model (July 2023) — first commercial-grade open model. A sparse-MoE frontier model (Dec 2023) — first open MoE at competitive quality. A leading open-weights model / 3.1 (2024) — frontier-tier open model, 405B parameters. An open-weights frontier model (V3 class) (Dec 2024) — open MoE that exceeds a current-generation frontier model-class on benchmarks at fraction of compute. A leading multilingual frontier model.5 family — strong multilingual coverage.
Economic effect. API price compression: when a leading open-weights model (70B class) is available at $0.88/M tokens via Together, leading frontier labs pricing for similar-tier models gets pressure. Production users for non-frontier workloads have credible exit option to open models. This is real competition, not theoretical.
Capability effect. Open models become research substrate. Mechanistic interpretability research (a constitutional-methods frontier lab-led, but increasingly cross-lab) uses leading open-weights model as standard test bed. Fine-tuning research (Tulu series, Hermes, Nous) advances open model capability. RLHF research (DPO ecosystem) trained on open models. Frontier closed labs benefit from this research too.
Geopolitical effect. An open-weights frontier model (V3 class) trained on H800 (export-restricted variant of a current-generation accelerator, throttled bandwidth). Demonstrated frontier-tier model achievable under hardware constraints. China's open-weights position complicates US export controls. leading open-weights model license restrictions (acceptable use, training-data disclosure) become real diplomacy.
Safety implications. Open weights enable bad-actor fine-tuning (removal of safety training, malicious specialization). A leading open-weights model Guard, ShieldGemma, etc. are partial response (an open-weights frontier lab and a multimodal frontier lab ship safety classifiers alongside models). But once a model is open, downstream control is impossible.
Strategic implication. The open vs closed frontier is a moving line. By 2027, open-weights at parity with frontier closed is likely on most capabilities. Closed labs differentiate via: (a) frontier-only capabilities (third AI Safety Level-level uplift, agent autonomy), (b) compliance and trust signals (FedRAMP, EU AI Act readiness), (c) reasoning model leadership (most expensive to replicate), (d) safety infrastructure. The business model of "selling access to a frontier model the customer can't replicate" is degrading. Successful closed labs will be those that build defensible moats beyond raw capability.
---
End of public reference anatomy. Version FINAL · 2026-05-09.
Approximately 60% of MZN's portfolio knowledge is disclosed via public documents (this atlas, the LLM Complement 13-section series, mzncompany.com landing pages, and supporting articles). Approximately 25% is restricted-layer content released only under NDA at partnership-evaluation stage. Approximately 15% is reserved-layer content disclosed only inside finalized partnership scope.
The 21-slot framework, the sub-endpoint tree, and the state-of-the-art summaries draw on published academic literature, open-weights release papers and model cards, voluntary responsible-scaling and preparedness frameworks, and the documented practices of the frontier research community 2017-2026.
This page is the technical reviewer route for understanding how MZN’s LLM/HUAI-related assets were mapped against a broad LLM-company capability reference. It is not the asset inventory, not a valuation document, and not independent validation.
The 21-slot framework, sub-endpoint mapping, MZN position assessments, and synthesis throughout this document are the work of MZN Company, copyright 2026.
MZN's portfolio includes multiple patent-documented architectures across LLM optimization, security, training methodology, data governance, and adjacent categories — with cryptographic provenance via SHA-256 hashing and blockchain timestamping. A separate cryptographic-protocol patent was filed March 2026 with 12 claims. Specific named assets, valuations, and detailed inventory are documented separately in the strategic asset portfolio at mzncompany.com/asset.
Public disclosures (this atlas, the 13-section LLM Complement series, mzncompany.com landing pages, supporting articles) represent approximately 60% of portfolio knowledge. The remaining 40% is reserved for partnership stage under appropriate confidentiality.
Inquiries: partnership@mzncompany.com · mazzaneh.company@gmail.com