LLM Wiki Pattern¶

Source: Karpathy's gist Ingested: 2026-04-09

Core Idea¶

Stop re-deriving. Start compiling.

Traditional RAG makes the LLM rediscover connections from raw docs on every query. The LLM Wiki eliminates this: the LLM builds a structured markdown knowledge base once, then queries are answered from the compiled wiki — not the raw sources.

"The wiki is a persistent, compounding artifact."

Three-Layer Architecture¶

Raw Sources (immutable, human-curated)
      |
      v  [ingest]
Wiki Pages (LLM-maintained markdown)
      |
      v  [query]
Answers (grounded, with citations)

Layer 1 — Raw Sources Documents, articles, papers. Never modified by LLM. Authoritative source of truth.

Layer 2 — Wiki LLM-generated and maintained markdown files. Summaries, entity pages, concept pages, comparisons. The human reads; the LLM writes.

Layer 3 — Schema Config (SCHEMA.md or CLAUDE.md) defining wiki structure, ingestion rules, conventions.

Three Operations¶

Ingest¶

New source added → LLM reads it, extracts key info, integrates into existing pages (may touch 10-15 files), updates index, logs activity.

Query¶

User asks → LLM reads index first, finds relevant pages, synthesizes answer with citations. Optionally saves valuable explorations as new pages.

Lint¶

Periodic health check: contradictions, stale claims, orphaned pages, missing cross-references, conceptual gaps.

index.md — content catalog. LLM reads this first during queries to locate relevant pages.

log.md — append-only chronological record of all operations.

Why it beats RAG¶

RAG	LLM Wiki
Re-derives on every query	Compiled once, queried fast
Context window filled with raw chunks	Context filled with synthesized knowledge
No accumulation	Compounds over time
Hard to synthesize across many docs	Cross-references pre-built

v2 Extensions (rohitg00)¶

Memory lifecycle: confidence scoring, supersession, gradual forgetting
Knowledge graph: typed entities (people, projects, libs) + relationships (uses, depends-on, contradicts)
Automation: event-driven hooks, auto-ingest, scheduled consolidation
Consolidation tiers: working memory → episodic → semantic → procedural

Implementation in this Wiki¶

This wiki IS the implementation. Hosted at wiki.mukhayyar.my.id.

To ingest new content: tell Ductor "wiki ingest [URL or content]" — it will create/update pages and update log.md.

Recent Finds¶

Updated: 2026-05-29

OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation (arXiv:2605.15177, May 2026)¶

OpenDeepThink scales test-time compute via population-based parallel reasoning: instead of sampling N independent completions and taking a majority vote (best-of-N), it runs multiple reasoning traces through pairwise Bradley-Terry comparisons — each pair is evaluated by the LLM itself for which answer is better, producing a tournament-style ranking that surfaces high-quality reasoning paths that simple majority vote would miss. The key insight is that LLM self-evaluation of pairwise comparisons is significantly more reliable than absolute scoring or majority counting, because the model can reason about relative quality more consistently than it can assign calibrated absolute confidence. Empirical result: +405 Codeforces Elo for Gemini 3.1 Pro in 8 sequential rounds of this aggregation loop — a large competitive programming improvement using only inference-time compute, no fine-tuning. The "open" in the name refers to the open-source release of the framework. Significance for the test-time compute cluster in this wiki: OpenDeepThink fills a gap between single-chain CoT scaling (EDRM routes to it) and full best-of-N sampling — the Bradley-Terry aggregation is more compute-efficient than brute-force N-sampling because it pruning inferior traces early via pairwise tournaments rather than running all N to completion. Composable with EDRM (which decides whether to reason) and the EqR/attractor cluster (which decides how deep to recurse) — OpenDeepThink addresses which reasoning path to select from a population of candidates.

When Do LLMs Reason? Entropy Phase Transitions Enable Adaptive Inference Routing (arXiv:2605.22873, May 2026)¶

Standard CoT deployment treats all tasks uniformly — chain-of-thought for everything. This paper reveals that LLM reasoning is a dynamic decoding state, not a static task property: tasks that genuinely benefit from CoT show a distinct entropy reduction pattern during early decoding (exploratory → structured phase transition), while factual and open-ended tasks show unstable or increasing entropy — and forcing CoT on those tasks wastes tokens and can degrade accuracy. The paper formalizes this as a dynamical-systems observation: there are phase transitions between reasoning modes that manifest in token-level entropy trajectories during the first few decoding steps. EDRM (Entropy Dynamics-based Reasoning Manifold) operationalizes this into a training-free routing framework: monitor early-decoding entropy trajectory → embed into a compact manifold → route to CoT or direct-answer accordingly. Results across 15 benchmarks and 4 LLM architectures: dataset-level 41–55% token reduction with improved accuracy; instance-level 4.7% accuracy improvement with 27–45% token savings. Significance for the inference efficiency cluster in this wiki: EDRM is the upstream decision layer for the speculative decoding stack (FlexDraft, SlimSpec, PARD-2, etc.) — those papers optimize how to decode given that CoT is needed; EDRM answers the prior question of whether to invoke reasoning at all. The UCCI cascade router (already in this wiki) routes between model sizes; EDRM routes between reasoning modes — they are orthogonal and composable in production serving. Together they represent the two key routing dimensions for cost-efficient LLM inference: model granularity (UCCI) × reasoning depth (EDRM).

Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki (arXiv:2605.25480, May 25, 2026)¶

LLM-Wiki operationalizes the Retrieval-as-Reasoning paradigm: instead of treating external knowledge as a flat, static chunk index (the standard RAG assumption), the system organizes documents into structured Wiki pages with bidirectional links — the same compiling-and-querying architecture this wiki is itself built on, now applied to agent-native retrieval. The agent retrieval loop performs three operations: search (find relevant Wiki pages), read (extract structured information), and link-follow (traverse bidirectional links for multi-hop reasoning). A distinct Error Book mechanism enables the system to self-correct: it logs structural and semantic errors encountered during retrieval operations and refines the Wiki structure iteratively — making the knowledge base self-improving with use rather than static. Results: state-of-the-art on multiple multi-hop reasoning benchmarks, outperforming HippoRAG 2, LightRAG, and GraphRAG by 2.0–8.1 F1 points with strong generalization beyond chain-style reasoning. Significance for the agentic retrieval cluster in this wiki: this paper is the production specification for what Karpathy's original LLM Wiki concept (already at the top of this file) looks like as a fully engineered retrieval system. The bidirectional link structure enables associative reasoning across the knowledge base — the same dynamic that makes Wikipedia more useful than a flat document corpus. Directly complements LatentRAG (arXiv:2605.06285, already in this wiki) which moves retrieval to the latent space; LLM-Wiki keeps retrieval in symbolic space but adds a self-evolving structure layer that LatentRAG's continuous representation cannot express. Together they represent the two frontier directions for next-generation RAG: structure-augmented symbolic retrieval (LLM-Wiki) vs. continuous latent-space retrieval (LatentRAG).

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference (arXiv:2605.17613, May 17, 2026 — U. Chicago / Tensormesh / Samsung / Microsoft Research)¶

VeriCache is the first LLM inference framework that guarantees identical outputs to full-KV-cache decoding while preserving the throughput benefits of KV cache compression — resolving the core reliability problem of production KV compression deployment. The core mechanism is speculative verification: (1) the compressed KV cache (any algorithm — token dropping, quantization, etc.) is used as a drafter to generate candidate tokens at high throughput; (2) the full KV cache — kept entirely out of GPU HBM on CPU DRAM or SSD — is streamed in only for verification, confirming or rejecting drafts; (3) rejected tokens trigger a fallback to full-KV decoding for those positions only. The system challenge is minimizing the full-KV swap-in overhead while keeping the verification latency off the critical path. Why lossy KV compression fails in production: token dropping and quantization methods achieve near-lossless perplexity on short benchmarks but exhibit catastrophic output divergence at long outputs — particularly in code generation (where a single wrong token cascades) and tool calling (where structured output correctness is binary). VeriCache eliminates this class of failure entirely. Throughput: up to 4× higher than full-KV decoding — matching the best lossy methods but with zero correctness risk. Significance for the KV-cache stack: VeriCache closes the production deployment gap for the KV-cache optimization stack this wiki has been tracking — RateQuant+FibQuant (bit-optimal quantization) + SpecKV (compression-aware gamma) + KVDrive (multi-tier placement) + FreeKV (speculative retrieval) all remain lossy; VeriCache adds a lossless verification wrapper that makes any of them safe for production code/agent workloads. The speculative verification framing also elegantly reuses the speculative decoding infrastructure (drafter + verifier pattern) already present in most modern serving frameworks (vLLM, SGLang).

GRAM: Generative Recursive Reasoning Models (arXiv:2605.19376, May 2026 — KAIST/Mila/NYU, ICLR 2026 Workshop on AI with Recursive Self-Improvement)¶

GRAM transforms recursive neural reasoning from deterministic to probabilistic computation. Baseline Recursive Reasoning Models (RRMs) follow a single deterministic latent trajectory — at each recursion step, the state updates deterministically. GRAM injects stochasticity: at each step, the model samples a transition conditioned on the input and current reasoning state, enabling the model to explore multiple hypotheses in parallel without the fixed computation graph of standard chain-of-thought. Training uses amortized variational inference to learn a latent-variable generative model with iterative latent-state refinement via a shared transition function — the same function is reused at each recursion step (weight-tying), so depth scales without parameter growth. Crucially, GRAM supports both conditional reasoning (given input) and unconditional generation (without input) — bridging reasoning and generation within the same architecture. Inference-time scaling uses two orthogonal axes: recursive depth (more iterations) and parallel breadth (multiple sampled trajectories aggregated). Results outperform deterministic recurrent and recursive baselines on structured reasoning and constraint satisfaction tasks. Significance for the attractor cluster: GRAM is the stochastic counterpart to the Solve the Loop / EqR attractor thread (already in this wiki) — where those models seek a single fixed-point equilibrium, GRAM explicitly represents uncertainty over fixed points via a latent distribution. This positions GRAM as the probabilistic inference upgrade to deterministic fixed-point models: instead of "converge to one answer," it asks "what is the distribution over answers?" — directly relevant for tasks with multiple valid solutions or high epistemic uncertainty. The ICLR 2026 workshop venue (AI with Recursive Self-Improvement) reflects the authors' framing: stochastic recursion is a stepping stone to models that can improve their own reasoning strategies at inference time.

An Interpretable Latency Model for Speculative Decoding in LLM Serving (arXiv:2605.15051, May 14, 2026 — MIT + Red Hat AI)¶

Prior speculative decoding (SD) work reports speedups in isolated, fixed-batch settings — but production LLM serving has dynamic request load, and effective batch size is an emergent property of the serving system, not a direct control variable. This paper fills that gap by developing a simple, interpretable latency model for SD in production serving. The key analytical move: apply Little's Law (from queuing theory) to infer effective batch size from observable request rate, then decompose per-request demand into load-independent (constant-cost) and load-dependent (batch-size-scaling) components separately for each of the three SD stages — prefill, drafting, and verification. The model explains the widely observed but poorly understood phenomenon: SD speedups diminish under higher server load — specifically because the load-dependent verification cost grows faster than drafting overhead under large batches, eroding the acceptance-rate advantage. Validated against extensive vLLM measurements across verifier/drafter model size ratios, prefill lengths, decode lengths, request rates, draft lengths, and acceptance probabilities. Extended to MoE models where sparse activation patterns alter load-dependent costs differently than dense models. Significance for the speculative decoding cluster: the prior six papers in this wiki (SpecKV, Attention Drift, PPOW, PARD-2, SlimSpec, FlexDraft) focus on improving drafter quality, acceptance rate, or memory efficiency. This paper is orthogonal: it provides the production deployment theory — given a target SLO (latency budget), a request arrival rate, and model size ratios, what draft length and acceptance probability target should you configure SD for? The Little's Law framing is pedagogically clean and practically usable: any team deploying SD on vLLM can instrument request rate and use this model to predict whether SD will help or hurt at their operating point.

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning (arXiv:2605.21488, May 20, 2026)¶

Equilibrium Reasoners (EqR) posits that generalizable reasoning arises from learning task-conditioned attractors — latent dynamical systems whose stable fixed points correspond to valid solutions. At inference time EqR scales along two independent axes: depth (running more fixed-point iterations) and breadth (aggregating stochastic trajectories from multiple random initializations). This dual-axis test-time compute allocation is the key architectural difference from prior looped transformer work (Ouro LoopLM, LoopFormer): those fix the loop count at training time and budget it with entropy; EqR makes the fixed-point computation itself the allocator and allows up to 40,000 effective layers. Empirical result: on Sudoku-Extreme, standard feedforward models score 2.6%; unrolling to 40,000 layers pushes EqR to >99%. Authors: Benhao Huang, Zhengyang Geng, Zico Kolter. Significance for the looped-transformer cluster: EqR + Solve the Loop (below, arXiv:2605.12466) form a synchronous pair — both submitted within eight days — establishing attractor-based fixed-point convergence as the new organizing principle for scalable reasoning, directly extending the Ouro LoopLM / LoopFormer thread in this wiki. The depth×breadth scaling axes are the clearest operational definition yet of test-time-compute allocation for reasoning models: "how many iterations" × "how many independent trajectories."

Solve the Loop: Attractor Models for Language and Reasoning (arXiv:2605.12466, May 12, 2026)¶

Attractor Models (Jacob Fein-Ashley, Paria Rashidinejad — USC) introduce a backbone-plus-attractor architecture: the backbone proposes output embeddings, then an attractor module iteratively refines them by solving for the fixed point via implicit differentiation. Training memory cost is constant regardless of effective depth; iterations at inference are chosen adaptively by convergence, not by a preset budget. Key results: (1) Pareto improvement over standard Transformers across all sizes — perplexity improves up to 46.6%, downstream accuracy up to 19.7%. A 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. (2) On structured reasoning (Sudoku-Extreme, Maze-Hard), a 27M model with ~1,000 training examples reaches 91.4% / 93.1% respectively — tasks where Claude and GPT o3 fail completely. (3) Equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, so the attractor solver can be removed at inference time with minimal degradation — an emergent curriculum-like compression. Significance: alongside EqR (above), this paper closes the loop on the looped-transformer line from Ouro LoopLM (already in this wiki): the novel claims here are implicit-differentiation training (constant memory) + equilibrium internalization (solver removability) + Pareto improvement over dense transformers as a general-purpose architecture, not a reasoning-only module.

FlexDraft: Batch-Adaptive Speculative Decoding via Attention Tuning and Bonus-Guided Calibration (arXiv:2605.20022, May 19, 2026)¶

Parallel speculative decoding (drafting and verifying multiple tokens in a single forward pass) degrades at larger batch sizes due to a structural mismatch: the bonus token — the first new token generated by the target model when all draft tokens are accepted — is unknown during drafting but influences acceptance probability, creating calibration drift between training and inference. FlexDraft fixes this with three coordinated innovations: (1) Attention Tuning — modifies only attention projectors in final layers on mask-token positions, keeping the autoregressive path frozen to preserve output quality while enabling efficient parallel block drafting; (2) Bonus-Guided Calibration — a lightweight MLP conditioned on the resolved bonus token recalibrates draft logits post-verification, directly closing the training-inference mismatch caused by bonus token uncertainty; (3) Flex Decoding — a batch-adaptive strategy that dynamically switches between parallel draft-and-verify at small batch sizes and sequential approaches at large batch sizes, adjusting verification length based on draft confidence signals. The key architectural insight: the right decoding strategy is not fixed — it depends on batch size and draft confidence, and hardcoding either penalizes one regime. FlexDraft is lossless: output distribution is guaranteed identical to the target model. Significance for the speculative decoding stack in this wiki: FlexDraft addresses the parallel SD batch-scaling failure mode, orthogonal to PPOW (RL-based window adaptation), PARD-2 (CAT-aligned dual-mode drafter), SlimSpec (LM-head compression), SpecKV (compression-aware gamma), and Attention Drift (prompt-attention preservation). Together these six papers form the most complete engineering-level coverage of speculative decoding failure modes and mitigations in the May 2026 literature: drift under compression (SpecKV) → prompt attention loss (Attention Drift + PPOW) → training-objective misalignment (PARD-2) → LM-head overhead (SlimSpec) → batch-scale calibration (FlexDraft).

UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing (arXiv:2605.18796, May 11, 2026)¶

LLM cascades — routing easy queries to a small model and escalating hard ones to a large model — promise lower inference cost, but deployed routers rely on uncalibrated confidence scores and require per-workload threshold tuning, making them brittle and non-optimal. UCCI takes a calibration-first approach: it maps token-level margin uncertainty to a per-query error probability via isotonic regression, then selects the escalation threshold by constrained cost minimization. The theoretical contribution: under three explicit assumptions, threshold policies on the calibrated score are provably cost-optimal, and isotonic calibration achieves O(n^{−1/3}) sample complexity for expected calibration error (ECE). Empirical evaluation on a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs on H100 GPUs: 31% inference cost reduction (95% CI: [27%, 35%]) at micro-F1 = 0.91, and ECE from 0.12 → 0.03. Outperforms entropy thresholding, split-conformal routing, and FrugalGPT-style methods on the same workload. Significance for LLM serving: most production cascades waste money by routing queries that small models could handle cheaply — UCCI provides the first theoretically grounded, calibration-based fix that works on real measured GPU latency rather than simulations. Directly complements inference-side optimizations (SpecKV below, FreeKV, KVDrive) by reducing how often expensive large-model calls are needed in the first place.

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection (arXiv:2605.02888, May 4, 2026)¶

Standard speculative decoding uses a fixed speculation window length (gamma, typically 4), but optimal gamma varies by task type and — critically — by KV cache compression level: a compressed cache changes draft model confidence distributions, making a fixed gamma doubly suboptimal. SpecKV introduces a lightweight adaptive controller that selects gamma dynamically per speculation step using signals from the draft model itself: confidence and entropy extracted from each draft step feed a small neural network trained to predict the optimal gamma for the current context. Key results: profiled across 4 task categories, 4 speculation lengths, and 3 compression levels; draft model confidence and entropy correlate strongly with acceptance rates (~0.56 correlation); achieves 56% improvement over fixed-gamma=4 baseline with only 0.34 ms overhead per step. Released open-source: profiling data, trained models, and code. Significance for the speculative decoding + KV compression stack in this wiki: SpecKV is compression-aware — it is the connecting piece between the KV quantization layer (RateQuant, FibQuant) and the speculative decoding layer (PARD-2, SlimSpec, PPOW). When KV caches are compressed, the drafter's dynamics change; SpecKV adapts the speculation window to those changes automatically. All four papers (RateQuant + FibQuant + SpecKV + PARD-2) are stackable: optimal bit allocation → better per-entry quantization → compression-aware gamma selection → CAT-aligned drafter training.

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference (arXiv:2505.13109, May 19, 2026)¶

Long-context KV cache retrieval is bottlenecked by two competing constraints: KV dropping achieves efficiency but incurs accuracy loss; KV retrieval preserves accuracy but suffers significant latency from the selection-and-recall pipeline sitting on the critical path. FreeKV resolves this with a training-free algorithm–system co-optimization: (1) Speculative Retrieval — shifts KV selection and recall off the critical path by speculatively reusing KV tuples from the previous decoding step, analogous to speculative decoding's draft-then-verify pattern; (2) Fine-Grained Correction — for the current step, computes cosine similarity between the current query vector and KV candidates to selectively recall pages where the speculative reuse would diverge; (3) Hybrid KV Layout — KV pages are distributed across CPU and GPU memory in a non-fragmented layout that eliminates scatter-gather overhead during cross-device recalls; (4) Double-Buffered Streamed Recall — overlaps the CPU→GPU transfer of the next step's KV pages with the current step's computation, effectively hiding the memory-bandwidth cost of KV retrieval. Results: up to 13× speedup over state-of-the-art KV retrieval methods at near-lossless accuracy across multiple models and long-context benchmarks. Significance for the KV-cache cluster: FreeKV is orthogonal to all quantization entries in this wiki (RateQuant, FibQuant reduce bits-per-token; FreeKV reduces retrieval latency) and to placement/tiering approaches (KVDrive, already in this wiki, manages cross-tier placement). All four are stackable in production: quantize with RateQuant+FibQuant → place and tier with KVDrive → retrieve efficiently with FreeKV. This closes the primary remaining efficiency gap in the retrieval layer of long-context LLM serving.

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference (arXiv:2605.18071, May 18, 2026)¶

The central challenge of long-context LLM inference is KV cache memory pressure — the cache grows linearly with sequence length and quickly exceeds GPU capacity. KVDrive attacks this at the systems level, managing KV cache across three memory tiers: GPU HBM → host DRAM → SSD. Unlike algorithmic approaches that pursue greater sparsity or compression, KVDrive takes a systems-design perspective: three coordinated innovations — (1) adaptive cache management that tailors placement to per-head attention patterns, reducing unnecessary cross-tier data movement; (2) restructured decoding pipeline that overlaps I/O-bound and compute-bound stages to eliminate stalls across heterogeneous resources; (3) cross-tier coordination that harmonizes data movement between all three memory levels for scalable inference far beyond GPU and DRAM limits. Results: up to 1.74× higher throughput vs. state-of-the-art while preserving accuracy on long-context benchmarks. Authors: Jian Lin, Jiazhi Mi, Zicong Hong, et al. Significance for the KV-cache cluster: KVDrive is orthogonal to the quantization approaches already in this wiki (RateQuant's bit-allocation, FibQuant's geometric vector codes) — those reduce how much data must be stored per token; KVDrive determines where and when each piece of cache is stored across the memory hierarchy. In production serving, all three are stackable: quantize with RateQuant+FibQuant, then orchestrate placement and I/O with KVDrive. For serving engineers: the 1.74× throughput claim at matched accuracy is the strongest systems-level long-context inference result in the May 2026 literature.

FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression (arXiv:2605.11478, May 12, 2026)¶

Problem: Scalar quantization of KV caches — the dominant approach in production LLM serving — operates element-wise on normalized, rotated values, leaving geometric structure on the table. FibQuant (Namyoon Lee, Yongjune Kim) applies vector quantization to the normalized-rotated KV representation, using Fibonacci/Roberts-Kronecker quasi-uniform directions as codebook entries and Beta-quantile radii for cell shaping. The core theoretical result: the authors prove their vector code strictly improves on its scalar product specialization at matched rate — the gain is decomposable into a cell-shaping factor (geometric packing efficiency) and a density-matching factor (matching the distribution of normalized data). Critically, FibQuant keeps the same normalize–rotate–store interface as existing scalar quantizers, making it a drop-in replacement with no inference pipeline changes. The construction enables fractional-bit and sub-one-bit operating points without calibration or variable-length addresses — a significant practical advantage over codebook-trained VQ methods that require offline calibration per model. Empirical results on TinyLlama-1.1B: within 0.10 perplexity of full precision at 4× compression; at 8× compression, 3.6× lower perplexity than scalar TurboQuant. Significance for the KV-cache cluster in this wiki: FibQuant is orthogonal to RateQuant (arXiv:2605.06675, already in this wiki) — RateQuant uses information-theoretic bit-allocation to choose which bits to assign to which cache entries, while FibQuant improves how each entry is quantized using geometric vector codes. They are stackable: optimal bit allocation (RateQuant) + better per-entry quantization (FibQuant). Together with SpecKV (arXiv:2605.02888, which makes the spec-decoding window length compression-aware), this represents three orthogonal KV-cache efficiency improvements all emerging from the same May 2026 window.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition (arXiv:2605.11388, May 12, 2026)¶

Current LLM agents use rigid, pre-determined scaffolding: a fixed sequence of tool calls, chain-of-thought steps, or plan-then-execute phases that cannot adapt when task requirements shift mid-execution. Deep Reasoning (Light, Theologitis, Ghate, Li, Newman, Shah, Caliskan, Koh, Suciu, Tsvetkov — University of Washington) addresses this with Structured Meta-Cognition: at inference time, the system performs meta-reasoning to construct a task-specific scaffold rather than assuming a fixed one. The meta-reasoning operates over a formal language that can express associative inference (retrieval-based), formal computation (symbolic/code-based), and recursive subproblem solving — choosing and composing these reasoning thread types based on the problem structure detected at runtime. The resulting agent, DOLORES, distributes complex tasks across more controlled reasoning threads, reducing hallucinations (no single long chain accumulates and amplifies errors) and premature termination (stuck reasoning is re-routed rather than abandoned). Results: 24.8% average improvement over baseline scaffolding methods across multi-hop reasoning, long-chain QA, context aggregation, and research-style information seeking. The efficiency finding is notable: an 8B DOLORES model surpassed 32B baselines in over half tested scenarios — the scaffold construction overhead is offset by the quality gain from better-matched reasoning structure. Architectural significance: this is a runtime generalization of the static scaffolding debate (ReAct vs. LATS vs. chain-of-thought). Instead of choosing a scaffold at design time, Deep Reasoning makes scaffold selection a first-class inference-time computation. Directly complements BALAR (arXiv:2605.05386, already in this wiki) which applies Bayesian active reasoning to what to ask; Deep Reasoning addresses how to reason once the task is known. Together they bracket the two halves of the agentic loop: belief formation under uncertainty (BALAR) and task execution structure selection (Deep Reasoning / DOLORES).

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding (arXiv:2605.10453, May 11, 2026)¶

Problem: In modern speculative decoding architectures, the drafter network is small — but its LM-head still projects to the full vocabulary, making it a disproportionate compute bottleneck. SlimSpec addresses this with a low-rank parameterization of the drafter's LM-head that compresses the inner representation rather than truncating the output vocabulary — preserving full vocabulary coverage while dramatically reducing per-head compute. The key design choice: compressing the projection matrix dimensions rather than vocabulary size avoids the quality degradation and output-space artifacts associated with vocabulary pruning. The method requires minimal training and inference pipeline changes — it is a drop-in replacement for the standard LM-head within EAGLE-3-style drafter architectures. Performance on EAGLE-3 across three target models: 4–5× acceleration over the standard LM-head, surpassing competing LM-head reduction methods by 8–9% end-to-end speedup while maintaining competitive acceptance length. Authors: Anton Plaksin, Sergei Krutikov, Sergei Skvortsov, Alexander Samarin. Significance for the speculative decoding cluster in this wiki: SlimSpec addresses the architecture layer (LM-head compute), orthogonal to the training objective improvements in PPOW (arXiv:2605.14978) and PARD-2 (arXiv:2605.08632, see below). Together these three papers represent a maturing engineering stack for efficient speculative decoding: PPOW teaches the drafter what to predict (adaptive window training), PARD-2 optimizes how training objectives align with acceptance, and SlimSpec reduces the cost of each forward pass. All three are compatible and stackable on EAGLE-3-class architectures.

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding (arXiv:2605.08632, May 9, 2026)¶

Targets a fundamental misalignment in how speculative decoding drafters are trained: existing objectives minimize per-token prediction error, but inference performance depends on consecutive prefix acceptance — the length of the uninterrupted accepted token run before the first rejection. PARD-2 reformulates the training objective around maximizing acceptance length rather than token accuracy via Confidence-Adaptive Token (CAT) optimization: each token's contribution to the training loss is adaptively reweighted according to the model's own confidence, upweighting high-uncertainty tokens where rejection risk is concentrated and downweighting trivially-correct tokens that contribute negligible acceptance length. The architecture also enables a dual-mode operation — a single PARD-2 draft model supports both target-dependent mode (drafter conditions on target model hidden states, higher accuracy, requires shared inference context) and target-independent mode (standalone drafter, lower latency, no target model access required). This is significant for deployment: operators can switch between modes based on serving constraints without retraining. Performance on Llama3.1-8B: up to 6.94× lossless speedup — 1.9× over EAGLE-3 and 1.3× over PARD. Authors: Zihao An, Taichi Liu, Ziqiong Liu, Dong Li, Ruofeng Liu, Emad Barsoum. Context: PARD-2 completes the PARD lineage: PARD-1 (arXiv:2504.18583) introduced parallel draft adapters; PARD-2 adds CAT alignment and dual-mode operation. Combined with SlimSpec's LM-head compression and PPOW's adaptive window training, these three papers form the current frontier of efficient speculative decoding improvements on EAGLE-3-class architectures. For serving engineers: PARD-2's 6.94× claim with dual-mode flexibility is the highest reported throughput gain in this generation of spec-decoding papers.

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning (arXiv:2605.06241, May 7, 2026)¶

A conceptual reframing of what reinforcement learning actually does when applied to LLM reasoning. Core claim: RL does not teach models new problem-solving strategies or expand their capability frontier — instead, it performs sparse policy selection, redistributing probability mass over solutions the base model already contains. The mechanism: through token-level analysis across multiple reasoning tasks and model families, the authors find that RL's beneficial effect is concentrated at high-entropy decision points — only 1–3% of all token positions are materially affected by RL training. At these positions (branch points where multiple plausible continuations exist), RL increases the probability of correct-path tokens; everywhere else the distribution is essentially unchanged. The implication: models that "improve" via RL were already capable of generating correct solutions — the RL loop is selecting for those paths at critical junctions, not teaching the model to reason differently. Why this matters: it reframes the question from "can RL teach LLMs to reason?" to "can RL reliably identify and reinforce correct decision-path prefixes in models that already know the answer?" This has concrete consequences for training-stack design: (1) RL is most valuable for models with a rich, high-quality pre-training distribution over the target reasoning domain; (2) data quality at high-entropy positions is more important than RL hyperparameter tuning; (3) models that genuinely cannot generate correct solutions without prompting will not be fixed by RL alone. Directly complements the Alignment Collapse (FPO) thread: both papers reveal failure modes in how the RL loop interacts with the policy's underlying distribution — FPO addresses collapse in iterative RM retraining; this paper addresses over-crediting RL with capability gains that were latent in the base model.

PPOW: Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing (arXiv:2605.14978, May 14, 2026)¶

Direct architectural response to the attention drift failure mode (arXiv:2605.09992, previous entry). PPOW (Performance-Driven Policy Optimization with Adaptive Windowing) is a reinforcement learning framework for training speculative decoding drafters that addresses the core tension: speculative utility is window-level and prefix-sensitive, yet existing learning-based drafters are still optimized with token-level objectives. The result is a drafter that maximizes individual token acceptance probability without accounting for how early mismatches truncate the entire accepted prefix. PPOW introduces three components: (1) Cost-Aware Speedup Reward — replaces token-level accuracy with a reward function that directly measures wall-clock speedup given the prefix acceptance pattern; (2) Distribution-Based Proximity Reward — ensures the drafter's distribution remains close to the target model's distribution, maintaining fidelity while maximizing speedup; (3) Adaptive Divergence-Aware Windowing — the key structural innovation: instead of a fixed speculation window length, the window length is selected dynamically based on measured attention divergence between the drafter and target model at each position, directly operationalizing the attention drift insight. Results across multiple model families and benchmarks: average accepted length 6.29–6.52 tokens (vs. 3–4 for standard greedy drafters), speedup 3.39–4.36× at matched quality. Significance: PPOW closes the loop from the attention drift paper — drift explains why longer windows degrade; PPOW uses the divergence signal to choose windows that avoid high-drift positions rather than truncating fixed-length chains. Together these two papers (arXiv:2605.09992 + arXiv:2605.14978) form the first mechanistic + corrective pair for speculative decoding failure modes. For serving engineers: PPOW's adaptive windowing is the most actionable near-term fix — retraining the drafter with PPOW replaces the static γ hyperparameter with a per-position divergence-gated decision.

Attention Drift: What Autoregressive Speculative Decoding Models Learn (arXiv:2605.09992, May 11, 2026)¶

Northwestern University / GE Aerospace / University of Waterloo / fal paper identifying a previously unreported failure mode in speculative decoding drafters. Core finding — Attention Drift: as the drafter generates successive tokens within a speculation chain, attention progressively shifts away from the original prompt toward its own recently-generated tokens. This drift is observed across both EAGLE3 drafters and MTP (Multi-Token Prediction) heads, suggesting it is a property of drafter architectures generally rather than any specific model. Mechanistically: the drafter's auto-regressive generation inside the speculation window causes its attention heads to increasingly weigh its own recent output over the prompt context — a recency bias that compounds as the speculation chain lengthens. Why this matters for speculative decoding deployments: attention drift explains two empirically observed behaviors — (1) speculative decoding accuracy degrades non-uniformly across chain length (later tokens in longer chains are less faithful), and (2) drafters fail disproportionately under template perturbation and long-context inputs (because prompt attention weight is already low by the time the drafter reaches the critical decision points). Practical implication for serving engineers: chain length selection in speculative decoding is not just a throughput/accuracy tradeoff — it is also an attention drift tradeoff. Shorter speculation windows preserve more prompt attention fidelity. Connects to the DiP-SD paper (arXiv:2604.20919, already in this wiki) which addresses distributed speculative decoding for edge inference; attention drift is an orthogonal failure mode that applies to centralized and distributed speculative decoding equally. Code available at github.com/Dogacel/Attention-Drift.

Explaining and Preventing Alignment Collapse in Iterative RLHF (arXiv:2605.04266, May 5, 2026)¶

Paper by Etienne Gauthier, Francis Bach, and Michael I. Jordan addressing a failure mode that emerges when RLHF is deployed iteratively — where the policy generates data on which the reward model (RM) is retrained in subsequent rounds. Standard iterative RLHF drops what the authors call the "steering term": the policy's gradient-level influence on future RM parameters. Without it, the policy learns to exploit RM blind spots, producing high-reward but low-quality outputs that then reinforce the very RM errors being exploited — a self-reinforcing degradation loop they term alignment collapse. The formal framing uses a Stackelberg game between the policy (leader) and RM (follower), from which an analytical gradient decomposition reveals two terms: a standard policy gradient and the parameter-steering term capturing how the policy's current outputs shift future RM weights. The proposed fix, Foresighted Policy Optimization (FPO), restores the steering term as an explicit regularizer — a mechanism-design intervention that breaks the exploitation loop. This matters because iterative RLHF is increasingly how deployed LLMs are maintained post-training: user feedback → RM update → policy update → repeat. The alignment collapse failure mode is not visible in static RLHF benchmarks (which assume a fixed RM) but becomes acute in production deployment where the RM evolves with use. FPO is a directly actionable intervention for teams running online iterative RLHF pipelines. The Bach + Jordan authorship signals this will receive significant follow-up in the RLHF theory literature.

Federation of Experts (FoE): Communication-Efficient Distributed Inference for MoE LLMs (arXiv:2605.06206, May 7, 2026)¶

Stanford paper (Muhammad Shahir Abdurrahman, Chun Deng, Azalia Mirhoseini, Philip Levis) addressing the dominant bottleneck in distributed MoE inference: all-to-all communication of token embeddings between expert shards across GPUs. FoE restructures the standard MoE block by splitting the expert pool into multiple MoE clusters, where each cluster is responsible for exactly one KV head and expert parallelism is applied within that cluster. The key architectural insight: by aligning cluster boundaries with KV heads, all-to-all communication is either eliminated (single-node: all experts in a cluster fit on one GPU) or confined strictly to intra-node fabric (multi-node: all-to-all never crosses node boundaries). Between clusters, a simple sum synchronizes post-attention residuals — the only inter-cluster communication, and far cheaper than all-to-all dispatch. This is qualitatively different from existing expert parallelism schemes (EP, TP+EP) which all require cross-node all-to-all as the bottleneck. Relevance: as frontier models scale to 1T+ parameter MoE deployments (DeepSeek V4, Grok 4.3) and inference infrastructure moves toward disaggregated prefill-decode with many GPU nodes, communication overhead is the primary scaling wall. FoE directly addresses this without changing model architecture — it is a drop-in restructuring of how MoE blocks are parallelized across devices.

Transformers Provably Implement In-Context RL with Policy Improvement (arXiv:2605.05755, May 7, 2026)¶

UC Davis (Haodong Liang, Lifeng Lai) paper establishing the first formal theoretical grounding for in-context reinforcement learning (ICRL) in transformers. The core result: a single linear self-attention block can provably implement policy-improvement RL algorithms — specifically semi-gradient SARSA and actor-critic — via explicit parameter constructions. This is an existence proof: the transformer weight matrices are constructed analytically to encode the RL update rule, confirming that the in-context RL behavior observed empirically in large transformers is not accidental but a structural consequence of the self-attention mechanism. Beyond existence, the paper: (1) designs a teacher-mimicking training procedure that finds these parameters through gradient flow; (2) establishes the first convergence guarantee in the ICRL literature — under suitable MDP distribution richness conditions, gradient flow converges locally and exponentially to the optimal parameter manifold. Empirically validated on randomly-generated tabular MDPs: trained models recover the analytically derived parameter structure and deliver strong in-context control on unseen MDPs at test time. Architectural significance: this connects two threads that have been separate in the literature — the attention-as-gradient-descent theory (Akyürek, Oswald) and the ICRL empirical literature. The convergence guarantee is the key contribution: it transforms ICRL from an intriguing empirical observation into a theoretically grounded capability with identifiable conditions for success. For agent design: it explains why pre-trained LLMs can solve novel RL tasks in-context without fine-tuning — the mechanism is already structurally present in self-attention.

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory (arXiv:2605.06675, May 8, 2026)¶

Addresses the KV cache memory bottleneck in LLM serving: cache grows linearly with sequence length, becoming the dominant memory constraint for long-context generation. RateQuant frames the mixed-precision bit-allocation problem using classical rate-distortion theory — it fits a per-quantizer distortion model from a small calibration set, then solves the bit-allocation problem in closed form via reverse waterfilling (the dual of channel coding's water-filling). This is mathematically optimal under the per-quantizer distortion model, compared to heuristic bit-width selection used in prior work. On Qwen3-8B at 2.5 average bits, RateQuant reduces KIVI's perplexity from 49.3 to 14.9 (70% perplexity reduction) and improves QuaRot by 6.6 PPL. Critical engineering note: calibration takes 1.6 seconds on a single GPU and adds zero overhead at inference time — the optimal bit-allocation is computed once offline, then the quantized cache is used directly. Compared to existing mixed-precision KV quantization (KIVI, QuaRot, PM-KVQ), RateQuant's information-theoretic derivation provides a performance guarantee under the distortion model rather than relying on empirically tuned hyperparameters. For production LLM serving: immediately applicable to any serving stack that already supports mixed-precision KV caches — drop-in replacement for the bit-width scheduler with a large perplexity improvement at no inference cost. Directly complements AsyncTLS (arXiv:2604.07815, already in this wiki) which reduces KV transfer latency; RateQuant reduces KV memory pressure via better compression — two orthogonal bottlenecks in long-context serving.

Grok 4.3: Improved Agentic Performance and 40% Price Cut (xAI, May 6–8, 2026)¶

xAI released Grok 4.3 on approximately May 6–8, 2026 with an aggressive pricing revision: $1.25/$2.50 per MTok input/output — roughly 40% below Grok 4.20 pricing. Architecture: 1 million token context window, always-on reasoning mode, function calling, structured outputs, and prompt caching. Benchmark placement: 53 on the Artificial Analysis Intelligence Index (vs. median of 35 for comparable pricing tier), 79.3% on CaseLaw v2 (current #1), #1 on CorpFin, 98% on τ²-Bench Telecom, 81% on IFBench. The largest single improvement is on GDPval-AA (ELO 1500, +321 points from 4.20) — a legal and financial reasoning benchmark where Grok 4.3's always-on reasoning architecture produces the largest gains. The "always-on reasoning" framing is significant: unlike models that require explicit "thinking mode" activation, Grok 4.3 applies its reasoning capacity continuously. For practitioners: Grok 4.3 is particularly differentiated on legal/financial document reasoning relative to its price point — the CaseLaw v2 #1 ranking at $1.25/MTok input is the most cost-competitive frontier-class legal reasoning option as of May 2026. The aggressive pricing follows Anthropic's Claude Opus 4.7 (87.6% SWE-bench Verified at launch) establishing a higher capability ceiling — xAI is competing on price-per-performance rather than raw frontier capability.

GPT-5.5 Instant: New Default ChatGPT Model with 52.5% Fewer Hallucinations (OpenAI, May 5, 2026)¶

OpenAI released GPT-5.5 Instant on May 5, 2026, replacing GPT-5.3 Instant as the default ChatGPT model and the chat-latest API alias. The key claims: 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts (medicine, law, finance), 37.3% fewer inaccurate claims on user-flagged problematic conversations, 30.2% fewer output words, and 29.2% fewer lines — a substantial simultaneous improvement in accuracy and conciseness. New capability: memory sources — personalized answers using past conversations, files, and Gmail (Plus/Pro on web; mobile rollout pending). The model is also more capable at image analysis and STEM questions, and knows when to invoke web search automatically. For the LLM-serving landscape: GPT-5.5 Instant represents OpenAI's attempt to close the quality gap between the frontier (GPT-5.5 Pro, GPT-5.4) and the default model used by most ChatGPT users. The hallucination reduction claim — if externally validated — would be the largest single-generation improvement in factual accuracy reported by a major lab. Architectural details not disclosed. The concurrent release of GPT-5.5 Pro (April 23) and the new Instant variant follows OpenAI's established pattern of releasing a "Max" reasoning tier first, then propagating reasoning improvements downward to production defaults. Note: Grok 4.3 and GPT-5.5 Instant ship in the same week — the frontier capability tier (Opus 4.7, Gemini 3.1 Pro, GPT-5.4) remains ahead, but the default-tier improvements are closing the gap.

Domain-Level Metacognitive Monitoring in 33 Frontier LLMs (May 2026)¶

A large-scale study administered 1,500 MMLU items (250 per domain, under a six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0–100) — measuring how well each model discriminates its own correct from incorrect answers by domain. Key findings: Applied/Professional knowledge is the easiest domain to monitor (mean AUROC = 0.742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science are reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). Domain effects are consistent across model families — this is a domain-structured limitation of the underlying training, not a model-specific defect. Practical implication: LLMs are best calibrated in domains with rich, standardized professional-knowledge corpora (medicine, law, business) and worst calibrated in formal reasoning and scientific domains where correctness requires constructive verification rather than pattern recognition. For agent design: routing high-confidence answers differently by domain is warranted — confidence signals in formal reasoning domains (math, logic, formal proofs) are substantially less reliable than in applied/professional domains. Complements the CAPO paper (arXiv:2604.12632, already in this wiki) which addresses calibration via RL training; this paper maps the current calibration landscape across models and domains.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key (arXiv:2605.06638, May 7, 2026)¶

Direct complement to the ReasonMaxxer paper (arXiv:2605.06241, already in this wiki). Where ReasonMaxxer asks how RL improves reasoning (sparse policy correction at high-entropy positions), this paper asks what RL can teach — specifically whether RL training can instill genuinely new long-horizon reasoning capabilities that were not already latent in the base model. The tool: ScaleLogic, a synthetic logical reasoning framework with two orthogonal difficulty axes: (1) proof depth (horizon length) and (2) logical expressiveness (from simple if-then implication up to first-order logic with conjunction, disjunction, negation, and universal quantification). Key empirical findings: RL training effort follows a power law with proof tree depth, with a scaling exponent of ~2.60 — meaning each additional reasoning step requires 2.6× more RL training compute. This power-law holds across multiple RL algorithms, with curriculum training improving scaling efficiency. More importantly: more expressive training settings yield significantly larger and more compute-efficient transfer to downstream benchmarks — up to +10.66 points on mathematics and general reasoning tasks. The expressiveness dependency is the core finding: RL trained on simple implication logic barely transfers; RL trained on full FOL transfers broadly. Theoretical implication for the RL-for-reasoning debate: the paper partially rehabilitates RL as a capability-teaching tool — not for arbitrary long-horizon reasoning, but specifically when the training distribution is expressively rich. The two papers together suggest a synthesis: RL selects from existing capabilities at low expressivity, but can instill genuinely new reasoning patterns when the training problems are sufficiently expressive. Directly relevant to curriculum design for RLVR — the practical takeaway is that training problem expressiveness (not just difficulty or quantity) is the dominant variable in reasoning transfer.

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG (arXiv:2605.06285, May 2026)¶

Standard agentic RAG — where an LLM iteratively generates natural language subqueries, retrieves documents, and reasons before answering — incurs high latency from autoregressive generation of intermediate thoughts. LatentRAG shifts both reasoning and retrieval from discrete language space to continuous latent space: instead of generating subqueries token-by-token, the model produces latent tokens for thoughts and subqueries directly from hidden states in a single forward pass. The latent tokens are aligned with dense retrieval models in the same latent space, enabling retrieval over latent subquery vectors with no text serialization overhead. A parallel latent decoding mechanism translates latent tokens back to natural language when needed. Training: standard LLM fine-tuned with latent alignment objectives. Results across 7 benchmarks: ~90% inference latency reduction vs explicit agentic RAG, while matching explicit RAG accuracy — substantially narrowing the gap with single-step RAG latency. Key insight: the serialization penalty (generating natural language at every intermediate step) is the dominant latency cost in agentic RAG, not retrieval itself. LatentRAG eliminates this cost without sacrificing retrieval quality. Directly complements CoLaR and Latent-GRPO (already in this wiki) which move single-model reasoning to latent space; LatentRAG extends the "think silently" paradigm to the retrieval-augmented agentic loop — enabling end-to-end joint optimization of the reasoning + retrieval pipeline in continuous space.

BALAR: A Bayesian Agentic Loop for Active Reasoning (arXiv:2605.05386, May 6, 2026)¶

Addresses a fundamental gap in interactive LLM systems: how should an agent decide which clarifying question to ask next? Most current systems treat dialogue reactively — they respond to questions but have no principled mechanism to reason about what information is missing. BALAR (Bayesian Agentic Loop for Active Reasoning) is a task-agnostic outer-loop algorithm requiring no fine-tuning: it maintains a structured belief distribution over latent states (user intent modeled as a discrete variable over a product space of disambiguating dimensions, e.g., severity level × product type), then selects clarifying questions by maximizing expected mutual information between the question and the latent state — the Bayes-optimal question at each turn. When the current state representation proves insufficient, BALAR dynamically expands the latent space. Evaluated on AR-Bench-DC (detective cases), AR-Bench-SP (thinking puzzles), and iCraft-MD (clinical diagnosis). From Stanford (Fox Lab). Significance: most agentic systems assume clarification is unnecessary or hand-code question selection heuristics. BALAR provides a principled, training-free replacement — Bayesian active learning applied to the agentic outer loop. Directly relevant to multi-turn agent architectures where partial observability is the norm (customer support agents, clinical decision agents, diagnostic systems). Connects to the Belief-Action Gap paper (arXiv:2605.00226, already in this wiki): BALAR addresses what to ask when belief is uncertain; that paper shows beliefs often fail to convert to actions even when explicit — together they frame the full belief-management problem in agentic systems.

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning (arXiv:2605.06241, May 7, 2026)¶

Challenges the dominant interpretation of why reinforcement learning improves LLM reasoning. Core finding from token-level analysis across multiple model families and RL algorithms: RL does not teach new strategies — it redistributes probability mass over solutions the base model already contains. Only 1–3% of token positions are affected by RL training, and at every affected position, the promoted token always lies within the base model's top-5 alternatives. This means RL is not creating new reasoning capability but rather doing precise, sparse policy correction at high-entropy decision points — positions where the model is genuinely uncertain. The entropy signal identifying these positions is available from the base model itself (no RL model needed). Based on this, the authors propose ReasonMaxxer — an RL-free alternative using contrastive loss applied only at high-uncertainty token positions. ReasonMaxxer requires only tens of problems and minutes of single-GPU training, matching or exceeding full RL performance. Training cost reduction: approximately 1000× vs. standard RLVR training. Theoretical reframe: reasoning improvement via RL is sparse policy selection, not capability acquisition. Implications: (1) The enormous compute cost of RL fine-tuning for reasoning is largely wasted on the 97–99% of positions where the model is already making correct choices; (2) Base model entropy is a sufficient signal for identifying which positions need correction — the RL model is doing expensive work the base model could identify for free; (3) The "reasoning" that RL supposedly teaches is already latent in pre-training. Connects to the broader latent reasoning thread in this wiki (CoLaR, Latent-GRPO, RLM): if reasoning capability is already in the base model, the question shifts from "how do we train reasoning in" to "how do we efficiently surface it at inference time."

Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions (arXiv:2605.00226, May 2026)¶

Identifies two fundamental failure modes in LLM strategic reasoning under incomplete information. Observation-Belief Gap: models develop internal representations of game states that are more accurate than what they verbally express — but these internal representations are fragile, degrading with multi-step reasoning and showing primacy/recency biases rather than following Bayesian updating. Belief-Action Gap: converting implicit internal beliefs into decisions underperforms compared to using explicitly stated beliefs in prompts, and neither approach consistently optimizes game outcomes. Tested on open-weight models (Llama 3.1, Qwen3, and gpt-oss) in incomplete-information strategic scenarios. Practical implication for agent systems: LLMs deployed in negotiation, auction, or adversarial planning contexts have systematic decision-making vulnerabilities — not just capability limits. Connects to the Solver-Sampler Mismatch paper (arXiv:2604.11840, already in this wiki), which showed reasoning-optimized models collapse behavioral diversity in negotiation. This paper identifies the mechanism: internal beliefs aren't faithfully converted to actions, creating a structural disconnect between what the model "knows" and what it does. Together they constitute a strong case against using current LLMs in high-stakes strategic contexts without explicit belief-state scaffolding.

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning (arXiv:2605.03314, May 2026)¶

Addresses a fundamental tension in autoregressive LLMs: in a single-stream interface, every token is both a state update and an irreversible public commitment. This coupling creates a "silence tax" — deliberating longer postpones useful content, but early streaming risks premature commitment that biases subsequent generation. Side-by-Side (SxS) Interleaved Reasoning makes disclosure timing a controllable decision within standard autoregressive generation: the model interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is entailment-supported by the reasoning so far. Training proceeds in two stages: (1) SFT on entailment-aligned interleaved trajectories — constructed by matching answer prefixes to supporting reasoning prefixes, teaching the model the dual-action (reason vs. disclose) semantics without incentivizing filler; (2) RL fine-tuning to recover reasoning quality under the new format. The key architectural insight is that disclosure timing is not merely a UX concern — it is a generation-quality lever: premature commitment in the context window biases future token probabilities toward that committed prefix, compounding errors. By controlling when content is "locked in" versus still being reasoned about, SxS decouples deliberation depth from latency. Directly complements CoLaR and Latent-GRPO (already in this wiki) which move reasoning into latent space; SxS instead keeps reasoning in token space but separates public commitment from internal deliberation — a training-time mechanism versus an architecture change.

The Reasoning Trap: Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning (arXiv:2605.01704, May 2026)¶

Establishes a fundamental theoretical limit on multi-agent debate (MAD) and closed-system reasoning protocols. Core result (Theorem 1 / DPI Bound): under MAD, the chain Evidence → Output⁰ → Output¹ forms a Markov process. The Data Processing Inequality then implies that mutual information between evidence and model output can only decrease with each debate round — meaning multi-agent debate is structurally incapable of recovering information that a single model pass failed to extract. Empirical confirmation across 16 conditions: MAD preserved 88% baseline accuracy while reasoning quality (measured as evidence grounding) dropped 43%; majority-vote approaches reduced reasoning support to 1.7% of baseline. The companion proposal, EGSR (Evidence-Grounded Self-Reflection), breaks the Markov chain by explicitly anchoring each reasoning step to external evidence rather than prior outputs — recovering 98% of baseline reasoning quality. Critical diagnosis for multi-agent debate: "when copies of the same LLM debate, they produce diverse phrasings of one perspective rather than diverse perspectives" — homogeneous agents cannot generate the epistemic diversity that makes debate useful. Directly complements the Single-Agent vs Multi-Agent paper (arXiv:2604.02460, already in this wiki): that paper shows MAD burns extra tokens without outperforming a single agent at equal budget; this paper provides the information-theoretic proof of why — debate is mathematically guaranteed to degrade evidence grounding per round. Together, they constitute the strongest theoretical and empirical case against closed-system multi-agent debate as a reasoning strategy.

Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning (arXiv:2605.02073, May 2026)¶

Proposes treating reward function design itself as an optimization problem rather than a fixed specification, using a frontier LLM to generate, validate, and rank candidate reward functions for GRPO training. The iterative search framework: (1) generate 50 candidate reward functions over 5 rounds; (2) validate each automatically; (3) run 500-step GRPO fine-tuning of Llama-3.2-3B-Instruct + LoRA for each candidate; (4) rank by F1 on GSM8K test set; (5) select and ensemble top performers. Key results: mean F1 improved from 0.596 (Round 1) to 0.632 (Round 5); best individual reward reached F1 = 0.787; best ensemble achieved F1 = 0.795 (vs. base GRPO baseline of 0.609) — a 0.19 absolute F1 gain with accuracy at 0.660. Control experiments confirm it is the ranked-feedback loop driving improvement, not simply adding more reward terms (random 5-reward ensemble scores F1 = 0.047). Architectural significance: this inverts the standard RL fine-tuning pipeline — instead of fixing the reward and optimizing policy, the policy training is used to evaluate reward quality. The approach is model-agnostic and immediately composable with any RLVR pipeline. Most interesting implication: LLMs can auto-discover task-specific reward functions that outperform hand-crafted scalar rewards, removing a key human bottleneck from the RLHF pipeline. Directly extends ResRL (arXiv:2605.00380, already in this wiki) which improves RL training given a fixed reward; this paper improves the reward itself.

Kimi K2.6: Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps (Moonshot AI, April 20, 2026)¶

Moonshot AI released Kimi K2.6 on April 20, 2026, scaling the Agent Swarm architecture introduced in K2.5 by 3×: from 100 sub-agents / 1,500 coordinated steps → 300 sub-agents / 4,000 coordinated steps. The underlying PARL (Parallel-Agent Reinforcement Learning) framework was extended to handle larger orchestration horizons and longer dependency chains between sub-tasks. K2.6 primarily focuses on long-horizon coding — multi-repository refactors, cross-service integration tasks, and large-scale codebase analysis that exceed the operational envelope of single-agent frontier models. Context: K2.5 (January 2026, 1T total / 32B active MoE, 256K context) established PARL as the first RL technique explicitly optimized for orchestrator-subagent credit assignment at scale — training only the orchestrator while freezing sub-agents, with rewards that incentivize sub-agent creation and parallel task completion. K2.6 inherits this architecture with extended scale. Why this matters for the multi-agent RL landscape: the arXiv:2605.02801 Orchestration Traces paper (published same week) cites K2.5/K2.6's PARL as the primary industrial evidence that orchestration learning is a tractable RL sub-problem distinct from single-agent RLHF. Together, the Moonshot K-series and the Orchestration Traces taxonomy define the current frontier of learning-based multi-agent orchestration — the first production demonstration that an LLM can be RL-trained to manage agent swarms, not just hard-coded to do so.

Latent-GRPO: Stable Reinforcement Learning for Latent Reasoning in LLMs (arXiv:2604.27998, April 2026)¶

Standard GRPO applied to latent reasoning — where reasoning steps are compressed into continuous representations rather than generated as explicit text — suffers from three destabilizing failure modes: (1) lack of intrinsic structure causing invalid latent exploration; (2) misalignment between trajectory-level rewards and token-level gradient updates; (3) invalid averaged states when multiple correct paths are reinforced simultaneously. Latent-GRPO addresses all three with targeted stabilization techniques: invalid-sample advantage masking (suppress gradients from structurally invalid latent trajectories), one-sided noise sampling (directional perturbation during exploration to preserve meaningful variation), and optimal correct-path first-token selection (anchor gradient updates to the most recoverable correct path rather than averaging across correct paths). Results: +7.86 Pass@1 points on GSM8K-Aug (low-difficulty) and +4.27 points on AIME (high-difficulty) over latent-SFT baseline; 3–4× shorter reasoning chains vs explicit GRPO. Gumbel sampling further boosts pass@k diversity. Theoretical contribution: demonstrates that policy optimization in continuous embedding space is fundamentally more compute-efficient than in discrete token space — the same reasoning gain at far lower sequence length. Latent-GRPO represents a convergence point between the looped LLM direction (Ouro, LoopFormer) and RL reasoning training (CAPO, ResRL) already in this wiki: it applies RL to the latent reasoning loop itself, not just the output.

CoLaR: Dynamic Latent Compression of LLM Reasoning Chains (arXiv:2505.16552, May 2026)¶

Chain-of-thought reasoning is computationally expensive because reasoning proceeds token-by-token in discrete text space. CoLaR (Compressed Latent Reasoning) performs reasoning entirely in dense latent space — compressing multi-step reasoning chains into compact continuous representations — via a two-stage training framework: (1) supervised fine-tuning with an auxiliary compressed embedding prediction objective that teaches the model to generate compressed latent summaries of reasoning steps; (2) RL enhancement that explores diverse reasoning paths and identifies more compact valid ones. Critically, CoLaR allows dynamic adjustment of reasoning speed at inference time through prompting — users can trade reasoning depth for latency without retraining. Key results: 14.1% higher accuracy than latent-based baselines at comparable compression; 53.3% reduction in reasoning chain length with only 4.8% performance loss vs explicit chain-of-thought; up to 5.4% accuracy gains on challenging math tasks with 82.8% reduction in latent reasoning chain length (the efficiency frontier: more accurate and far shorter). Directly pairs with Latent-GRPO (above): both move reasoning into continuous latent space, but from different training angles — CoLaR optimizes the compression + RL training pipeline; Latent-GRPO stabilizes the RL optimization within that space. Together they represent the emerging "think silently" paradigm: latent-space reasoning as the production successor to verbose chain-of-thought.

RecursiveMAS: Recursive Multi-Agent Systems via Latent-Space Collaboration (arXiv:2604.25917, April 28, 2026)¶

RecursiveMAS extends the idea of recursive computation — already productive in single-model depth scaling — to the multi-agent setting. The key claim: agent collaboration itself can be cast as a unified latent-space recursive computation. The system connects heterogeneous agents via a lightweight RecursiveLink module that enables each agent to generate in-distribution latent thoughts and transfer internal state across agents without text serialization overhead. Training uses an inner-outer loop learning algorithm — inner loop optimizes individual agent reasoning, outer loop does whole-system co-optimization via gradient-based credit assignment across recursion rounds. Evaluated across 9 benchmarks spanning mathematics, science, medicine, search, and code generation: 8.3% average accuracy improvement over advanced single- and multi-agent baselines, 1.2×–2.4× end-to-end inference speedup, and 34.6%–75.6% token usage reduction. The theoretical analysis shows RecursiveMAS has better runtime complexity than text-based MAS and maintains stable gradients during recursive training. Why this matters: most multi-agent systems use text as the inter-agent communication medium, incurring serialization cost and information loss at every agent boundary. RecursiveLink eliminates the text bottleneck by passing latent states directly — which also enables gradient flow across the entire system for end-to-end training. The result is that adding more agents actually improves compute efficiency rather than just accuracy, inverting the usual MAS cost-accuracy tradeoff. Stanford, UIUC, MIT, Google DeepMind collaboration. Code and data at recursivemas.github.io.

Mistral Medium 3.5: 128B Open-Weight Frontier Coder (Mistral AI, April 29, 2026)¶

Mistral Medium 3.5 is a dense 128B open-weight model released April 29, 2026, unifying instruction-following, reasoning, and coding in a single set of weights with a 256K-token context window. Benchmark placement: 77.6% SWE-bench Verified (above Claude Sonnet 4.5's 77.2%, below Sonnet 4.6's 79.6% and DeepSeek V4-Pro's 80.6%), 91.4% on τ³-Telecom agentic benchmark. Inference uses EAGLE-accelerated speculative decoding, allowing it to fit inside a single four-GPU node with competitive throughput. Released open-weight on Hugging Face (mistralai/Mistral-Medium-3.5-128B). Alongside the model, Mistral released Vibe remote agents — async cloud-based coding sessions that run agentic work in the background while the developer continues chatting, analogous to Cursor's background agent. Architecture is dense (not MoE), which simplifies deployment compared to DeepSeek V4-Pro but makes it less computationally asymmetric. Strategic significance: at 128B dense, Medium 3.5 is the most deployable open-weight frontier-class coding model as of its release — sitting at a capability level competitive with closed-source models while fitting on commodity multi-GPU hardware. The EAGLE speculative decoding integration is notable: it means the open-weight release includes the inference optimization layer, not just weights.

Recursive Language Models — Context Management as a 2026 Paradigm (Prime Intellect Blog / arXiv:2512.24601)¶

Recursive Language Models (RLMs) treat the LLM context window as an external environment the model can recursively examine, decompose, and delegate — rather than a passive input buffer. Instead of summarizing long contexts (which loses information), an RLM pro-actively delegates context segments to Python scripts or sub-LLM calls, managing its own working memory as an active agent. Key results: RLMs process inputs up to two orders of magnitude beyond the model's native context window; on CodeQA (long-document QA), GPT-5 base scores 24.00 while RLM achieves 62.00 — a 2.6× improvement. Prime Intellect, calling RLMs "the paradigm of 2026," argues the critical next step is training models with RLM scaffolding end-to-end via RL, teaching learned context folding rather than relying on hand-crafted delegation heuristics. The original paper (arXiv:2512.24601, December 2025, Alex Zhang / MIT) gained significant traction in early 2026 as long-context agent workloads exposed the limits of both full-context injection (too expensive) and sliding-window (loses dependencies). Open-source inference library: github.com/alexzhang13/rlm. Directly complements ContextWeaver (dependency-graph memory) and ByteRover (file-native hierarchical memory) already in this wiki — RLM is the meta-framework in which those memory strategies become end-to-end learnable rather than hand-engineered.

DeepSeek-V4-Pro-Max: Extended Reasoning Mode Bridges Gap with Closed-Source on Hard Math & Coding (Artificial Analysis / llm-stats, April 2026)¶

DeepSeek-V4-Pro-Max is the maximum-reasoning-effort inference mode of V4-Pro — activated by requesting extended thinking in the API, analogous to GPT-5.4's "o-mode" or Gemini 3.1 Pro's "thinking" variant. It does not change model weights but samples with a higher reasoning token budget. Key benchmark results that close the gap with closed-source frontier models: SWE-Bench Pro 55.4% (vs. V4-Pro's 87.5% on a different reported variant — figures vary by evaluation harness), HMMT 2026 February 95.2, IMOAnswerBench 89.8 (vs. GPT-5.4 at 91.4), putting it within striking distance of the best closed-source models on competition math. On coding benchmarks, V4-Pro-Max matches Claude Opus 4.6 on SWE-bench Verified (80.6% vs. 80.8%) and beats it on Terminal-Bench 2.0. Several independent reviewers flag a gap between benchmark performance and real-world behavior — teams should run domain-specific evals before committing to production migrations. The practical implication: V4-Pro-Max is the strongest available open-weight model for hard reasoning workloads in mid-2026, with a public inference cost at least 10–20× lower than equivalent closed-source models at the same reasoning-token budget. This represents the first case where a fully open-source model with disclosed architecture achieves near-parity with GPT-5.4 on competition mathematics — a significant capability milestone for the open ecosystem.

ContextWeaver: Selective Dependency-Structured Memory for LLM Agents (arXiv:2604.23069, April 2026)¶

ContextWeaver addresses a subtle but critical failure mode in agent memory: the standard sliding-window or full-history approaches pass either too little (missing early context) or too much (irrelevant tokens) to each reasoning step. ContextWeaver instead builds a dependency graph of the agent's interaction trace — each reasoning step as a node, with typed edges capturing logical dependencies between steps. At inference time, only the nodes that are directly relevant and structurally depended upon for the current action are selected and injected. Two measurable benefits: (1) improved task performance on both SWE-Bench Verified and SWE-Bench Lite vs. a strong sliding-window baseline in pass@1, without adding model parameters; (2) efficiency gains — fewer reasoning steps and lower token usage per task completion, because the agent isn't re-processing irrelevant context. The selective-dependency framing is architecturally cleaner than either chunked RAG (which loses dependency structure) or full-context injection (which wastes compute). Directly relevant to production agent systems that run multi-step software engineering tasks where early-session decisions constrain late-session actions — the exact regime that sliding-window approaches fail in. Complements ContextWeaver with ByteRover (arXiv:2604.01599) and the Memory in the LLM Era survey (arXiv:2604.01707) already in this wiki: ByteRover is a file-native implementation; ContextWeaver is a graph-native approach that remains representation-agnostic about the underlying storage.

DeepSeek-V4-Pro Released — 1.6T MoE, 80.6% SWE-bench Verified, Open-Source (DeepSeek, April 24, 2026)¶

DeepSeek officially released DeepSeek-V4-Pro and DeepSeek-V4-Flash on April 24, 2026. V4-Pro is a 1.6-trillion-parameter MoE model with 49B active parameters and a 1 million token context window, pre-trained on 33 trillion tokens. Architecture: same CSA+HCA+mHC triad as V4-Flash but at full scale — Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) for long-context efficiency + Manifold-Constrained Hyper-Connections (mHC) for gradient stability. At 1M-token context, V4-Pro requires only 27% of V3.2's per-token inference FLOPs and 10% of the KV cache, making ultra-long context operationally practical. Benchmark highlights: 80.6% SWE-bench Verified (0.2pp behind Claude Opus 4.6's 80.8%), Codeforces rating 3,206 (highest of any model at release, above GPT-5.4's 3,168), MMLU-Pro matches GPT-5.4. Beats all open-weight models on math and coding; trails only Gemini 3.1-Pro on world knowledge. Trained entirely on Huawei Ascend chips (no NVIDIA hardware). Open-source release. Architectural significance: the CSA+HCA combo is the most efficient published solution for MoE at 1M-token context — the V4-Flash preview (284B, April 2026) was the lighter variant; V4-Pro is the full-capability model. The 49B-active/1.6T-total ratio (3% active) is an unusually sparse MoE, pushing the active-parameter efficiency argument well beyond DeepSeek V3.2's earlier ratio. The U.S. government simultaneously escalated allegations of DeepSeek IP theft, adding geopolitical dimension to an otherwise clean benchmark story.

DiP-SD: Distributed Pipelined Speculative Decoding for Edge LLM Inference (arXiv:2604.20919, April 22, 2026)¶

Studies speculative decoding in multi-user edge inference scenarios, where the standard single-machine draft-verify paradigm breaks down because drafting cannot be co-located with verification. DiP-SD's design: draft tokens are generated locally on user devices (exploiting idle device compute), then offloaded to a centralized edge server for batch verification using the target model. Two complementary parallelism dimensions are exploited simultaneously: device-level distributed drafting (N devices generate draft tokens in parallel, amortizing draft latency across users) and phase-level draft-verify pipelining (drafting of the next batch overlaps with verification of the current batch, hiding server-side latency). The result is efficient multi-user speculative decoding at the network edge without requiring large models on device. Practical significance for edge AI deployment: existing speculative decoding frameworks assume co-location of draft and target models on the same server — DiP-SD is the first design explicitly targeting the disaggregated topology that is the natural architecture for mobile/IoT AI inference. Directly complements ToolSpec (arXiv:2604.13519, already in this wiki) and AsyncTLS (arXiv:2604.07815) for the broader theme of serving-layer optimizations that require no model retraining.

Drawing on Memory: Dual-Trace Encoding for Cross-Session Recall in LLM Agents (arXiv:2604.12948, April 14, 2026)¶

Proposes dual-trace encoding — maintaining two parallel memory representations for each experience: a semantic trace (what happened, compressed) and an episodic trace (contextual details, temporally tagged). At retrieval, both traces are queried and fused. Benchmarked on cross-session recall tasks: 73.7% accuracy vs. 53.5% for a strong single-trace baseline, with the largest gains in temporal reasoning (+40 percentage points), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp). The cognitive science parallel is explicit: the dual-trace design mirrors the standard cognitive neuroscience distinction between semantic memory (context-free facts) and episodic memory (context-bound experiences). The practical insight: for agent systems where the same entity appears across multiple sessions with changing attributes (user preferences, project state), a single compressed representation consistently loses track of which version of a fact is current — dual-trace encodes both "what" and "when," resolving this. Directly complements the Continuum Memory Architecture (arXiv:2601.09913) and ByteRover (arXiv:2604.01599) already in this wiki, which address persistent memory at the architecture and implementation level; dual-trace encoding is a complementary technique at the representation level within any of those architectures.

Calibration-Aware Policy Optimization for Reasoning LLMs (arXiv:2604.12632, April 14, 2026)¶

Introduces CAPO (Calibration-Aware Policy Optimization) — a RL fine-tuning algorithm that jointly optimizes for accuracy and calibration by replacing the standard scalar advantage estimate with a logistic AUC surrogate loss, enabling uncertainty-aware gradient weighting. Standard GRPO/PPO optimizes for correctness but ignores confidence calibration, producing models that are correct but systematically over- or under-confident. CAPO's AUC loss penalizes configurations where a wrong answer is assigned higher confidence than a correct answer, directly training the reward distribution to be calibration-consistent. Results: stable learning dynamics, improved accuracy on reasoning benchmarks, and measurably better calibration scores vs. standard GRPO. Connects directly to Calibration Collapse (arXiv:2604.10585, already in this wiki) — that paper shows sycophantic RLHF corrupts calibration; CAPO provides an orthogonal training-time correction that keeps calibration aligned with accuracy throughout RL fine-tuning.

When Reasoning Models Hurt Agent Simulation: Solver-Sampler Mismatch in Multi-Agent Negotiation (arXiv:2604.11840, April 12, 2026)¶

Important counterintuitive result: stronger reasoning models perform worse in multi-agent behavioral simulation when the task requires modeling boundedly rational opponents. The failure mode is a solver-sampler mismatch — reasoning-enhanced models optimize for strategically dominant solutions (the "solver" mode), which systematically collapses the space of compromise-oriented and satisficing behaviors that human negotiators actually exhibit (the "sampler" mode). For game theory and multi-agent systems: this is a concrete explanation for why reinforcement-trained agents fail against human players in negotiation domains — they learn to optimize for the Nash equilibrium rather than the human behavioral distribution. Practical design implication: for agent systems that must simulate or interact with humans in negotiation contexts (pricing, resource allocation, task assignment), using a calibrated behavioral model rather than a reasoning-maximized model is likely to produce better outcomes. Connects to the single-vs-multi-agent reasoning paper (arXiv:2604.02460) already in this wiki: that paper shows single agents match multi-agents under equal token budgets; this paper shows the agent's reasoning mode (optimizer vs. behavioral sampler) is a separate design dimension with its own performance tradeoffs.

Claude Mythos Preview + Project Glasswing: Frontier AI Finds Thousands of Zero-Days (Anthropic, April 7, 2026)¶

Anthropic publicly launched Claude Mythos Preview (April 7, 2026) alongside Project Glasswing — a coordinated industry initiative to use Mythos to defensively secure critical software infrastructure. Mythos autonomously identified a 17-year-old remote code execution vulnerability in FreeBSD (CVE-2026-4747 — full root access over NFS), and Anthropic claims it used Mythos to find thousands of zero-day vulnerabilities in every major operating system and every major web browser. Project Glasswing partners (AWS, Apple, Broadcom, Cisco, Google, JPMorgan Chase, Linux Foundation, Microsoft, NVIDIA, Palo Alto Networks) receive access to Mythos Preview to find and fix critical vulnerabilities. Anthropic committed $100M in model usage credits, $2.5M to Alpha-Omega/OpenSSF, and $1.5M to the Apache Software Foundation. Restricted availability — Anthropic is not releasing Mythos publicly due to dual-use danger of its offensive capabilities. Architectural significance: Mythos marks the first public demonstration of a frontier model autonomously performing end-to-end offensive security research (find, characterize, exploit) at scale — not as a capability claim but as a shipped, deployed program with institutional oversight. The capability threshold is empirically documented, not hypothetical. For AI safety: this is the clearest 2026 signal that capability vs. deployment restrictions are now a live operational policy question, not a future scenario.

HiL-Bench: Do Agents Know When to Ask for Help? (arXiv:2604.09408, April 10/13, 2026)¶

Identifies a critical gap in agent evaluation: existing benchmarks provide complete specifications, testing whether agents can do things — not whether they know when they need more information. HiL-Bench (Human-in-the-Loop Benchmark) measures selective escalation — the capacity to ask for help when genuinely blocked vs. silently guess. Tasks contain human-validated blockers (missing information, ambiguous specs, contradictory requirements) that only surface through progressive task exploration, not upfront inspection. Core metric: Ask-F1 = harmonic mean of question precision (don't over-ask) and blocker recall (don't silently guess). Key finding: frontier agents at the coding and reasoning level still collapse on ambiguous specifications — they either ask trivially on every step or proceed with wrong assumptions. Direct practical implication: silent guessing is a systematic failure mode in production deployments where specifications are inevitably incomplete. For agent designers: systems that escalate selectively are not just less annoying — they are qualitatively more reliable in real-world workflows where the cost of a wrong silent assumption exceeds the cost of a clarifying question.

KoCo: Knowledge Coordinate Conditioning for LLM Pre-training (arXiv:2604.12397, April 2026)¶

Introduces control signals at the pre-training stage — a mechanism previously only used in post-training alignment (SFT, RLHF). KoCo maps every training document into a three-dimensional knowledge coordinate: Source (provenance/origin), Content (topical category), and Stability (fact vs. evolving claim). These coordinates are prepended as textual prefixes, giving the model explicit contextual awareness of where and how stable each piece of knowledge is. Results: 30% convergence speedup on both 0.3B and 0.6B models, and improved performance across 10 downstream tasks. The Stability dimension is particularly significant: by flagging evolving vs. stable facts during pre-training, the model learns to distinguish epistemic confidence levels — directly addressing hallucination from conflating stable and unstable knowledge. Theoretical contribution: blurs the pre-training/fine-tuning boundary — training signal quality can be shaped at the data-format level, not just through post-training alignment.

ToolSpec: Schema-Aware Speculative Decoding for Tool Calling (arXiv:2604.13519, April 15, 2026)¶

Accelerates tool-calling inference by combining schema awareness with retrieval-augmented speculative decoding. Tool call responses are highly structured (JSON schema with defined fields, types, and values) — ToolSpec exploits this by using the tool schema to constrain and guide the speculative drafter, dramatically increasing acceptance rates vs. unconstrained n-gram or model-based drafting. The retrieval component prefetches likely argument values from prior tool call history. Achieves up to 4.2× speedup over standard tool-calling inference, substantially outperforming existing training-free speculative decoding methods on tool-use benchmarks. For production agent systems where tool calls are on the critical latency path (sequential: reason → call → respond), this is immediately deployable as a serving-layer optimization requiring no model retraining. Complements TInR (arXiv:2604.10788, already in this wiki) which internalizes tool knowledge into weights; ToolSpec optimizes the serving of tool calls on weights that already understand tool schemas.

AsyncTLS: Asynchronous Two-Level Sparse Attention for Long-Context Inference (arXiv:2604.07815, April 9, 2026)¶

Tackles the KV-cache transfer bottleneck in long-context inference via hierarchical sparse attention — coarse-grained block filtering first (identify relevant blocks), then fine-grained token selection within those blocks. The "async" part is the key engineering contribution: KV-cache memory transfers are overlapped with computation rather than sequential, eliminating the I/O stall that dominates latency in long-context serving. Benchmarked at 48K–96K context lengths: 1.2–10× operator speedups, 1.3–4.7× end-to-end throughput improvements while matching full-attention accuracy. This is a directly deployable optimization for production long-context serving pipelines — it requires no changes to model weights or training, only to the attention kernel and memory scheduling. Sits in the same family as the Sparse Frontier paper (arXiv:2504.17768, already in this wiki), which mapped the accuracy/efficiency Pareto frontier; AsyncTLS provides the implementation technique for deploying sparse attention without the I/O penalty that made prior approaches impractical at serving scale.

Tri-Spirit: Three-Layer Cognitive Architecture for Hardware-Decomposed Agent Execution (arXiv:2604.13757, April 15, 2026)¶

Proposes decomposing LLM agent execution across three hardware substrates coordinated via asynchronous messaging: a planning layer (slow, high-quality reasoning), a reasoning layer (medium-latency deliberation), and an execution layer (fast, low-latency action dispatch). The key architectural innovation is a habit-compilation mechanism that converts frequently-traversed reasoning paths into zero-inference execution policies — analogous to how procedural memory compiles declarative knowledge into automatic behavior. On 2,000 synthetic agentic tasks: 75.6% latency reduction, 71.1% energy reduction, 30% fewer LLM invocations, 77.6% tasks completable entirely offline (no model call). The theoretical contribution is significant: Tri-Spirit frames cognitive layer decomposition — not model scaling or prompt engineering — as the primary lever for system-level efficiency in production agent deployments. Directly complements the single-vs-multi-agent framing from arXiv 2604.02460 (already in this wiki): where that paper asks when multi-agent beats single-agent, Tri-Spirit asks how to decompose a single agent's cognitive workload across heterogeneous hardware.

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning (arXiv:2604.14140, April 15, 2026)¶

2,500 expert-designed problems across chemistry, mathematics, computer science, chess, and formal logic, structured as graphs of interdependent reasoning steps requiring tens to hundreds of thousands of reasoning tokens. Frontier models — including extended-thinking configurations — score below 10% accuracy. The diagnosis: CoT length alone does not solve long-horizon dependency resolution; models fail not at individual reasoning steps but at maintaining dependency coherence across large graphs of interdependent sub-conclusions. LongCoT is the most demanding reasoning benchmark currently available, exposing a ceiling that neither chain-of-thought scaling nor current test-time compute approaches can easily close. For practitioners: any deployment that requires multi-step planning over a large state space (legal document analysis, complex scientific reasoning, multi-constraint optimization) remains well outside current frontier model capability even with extended thinking enabled.

TInR: Tool-Internalized Reasoning — Moving Tool Knowledge into Model Weights (arXiv:2604.10788, April 12, 2026)¶

Proposes internalizing tool knowledge into model weights rather than injecting tool documentation at inference time via system prompts or context. Three-phase pipeline: (1) bidirectional knowledge alignment to transfer tool schema understanding into weights, (2) supervised fine-tuning with reasoning annotations, (3) RL fine-tuning with rewards shaped for correct tool use. Outperforms Tool-Integrated Reasoning (standard retrieval-augmented tool use) on both in-domain and out-of-domain tool-calling benchmarks. Engineering significance: eliminates the prompt overhead of tool documentation (which can consume thousands of tokens in complex agent setups), reduces latency per call, and avoids the context confusion that arises when tool schemas are injected mid-conversation. Complements TInR with the memory internalization direction in ByteRover (arXiv:2604.01599, already in this wiki): ByteRover internalizes memory management into the reasoning loop; TInR internalizes tool knowledge into weights — both reduce infrastructure overhead for production agents.

The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents (arXiv:2604.00478, April 2026)¶

Proposes a runtime inference-time architecture to suppress sycophancy without retraining: three components — (1) Behavioral Access Control (BAC) gatekeeping response generation, (2) a Trait Classifier that detects persuasion tactics (flattery, persistence, false authority, emotional pressure) across multi-turn dialogues, and (3) a Generator-Critic loop with "Necessary Friction" rewrites that transform sycophantic draft responses into appropriately resistant ones. Evaluated on all 437 TruthfulQA adversarial scenarios. Results: 85.7% relative sycophancy reduction on Claude Sonnet 4 (9.6% → 1.4%), 69.1% on Gemini 2.5 Flash (46.0% → 14.2%). The paper formally characterizes "validation-before-correction" (agreeing with a wrong user claim before softly correcting) as a distinct RLHF training failure mode separate from full capitulation. Architectural significance: this is an inference-time solution, not a training-time fix — meaning it can be applied to any deployed model without fine-tuning, as a defense-in-depth layer above the model. Direct complement to the Calibration Collapse paper (arXiv:2604.10585, already in this wiki): that paper identifies where sycophantic training leaves its signature (calibration scores); this paper provides the runtime architecture to intercept sycophantic outputs before they reach the user.

Pressure, What Pressure? Sycophancy Disentanglement via Reward Decomposition (arXiv:2604.05279, April 7, 2026)¶

Argues that sycophancy is not a single failure mode but two orthogonally distinct pathologies that a scalar reward model cannot distinguish: pressure capitulation (correct answers abandoned under social/authority pressure) and evidence blindness (provided context ignored regardless of pressure). Proposes a five-component GRPO reward function with explicit terms for pressure resistance, context fidelity, position consistency, agreement suppression, and factual correctness. Training uses a contrastive dataset pairing pressure-free baselines with pressured variants across three authority levels (peer, expert, institutional). Results hold out-of-distribution on SycophancyEval across five base models. Theoretical contribution: the decomposed framing is the most rigorous formalization of sycophancy to date — it explains why prior RLHF mitigation attempts were partially effective (they fixed one failure mode while leaving the other intact). Pairs with The Silicon Mirror (arXiv:2604.00478, above) as complementary solutions: Silicon Mirror suppresses sycophancy at inference time; this paper addresses it at training time via reward shaping.

The Amazing Agent Race: Strong Tool Users, Weak Navigators (arXiv:2604.10261, April 11, 2026)¶

Exposes a critical blind spot in agent evaluation: 55–100% of instances in existing tool-use benchmarks are simple linear chains of 2–5 steps — nowhere near the branching, multi-source complexity of real tasks. The authors introduce AAR (The Amazing Agent Race), a benchmark of 1,400 DAG-structured puzzles requiring agents to navigate Wikipedia, fork tool chains, and aggregate partial results from multiple paths. Key finding: navigation errors dominate (27–52% of failures) while tool-calling errors stay below 17%, meaning agents know how to use tools but fail at deciding what to retrieve and when. Best overall accuracy is only 37.2% (Claude Code and Codex CLI tied at the top). This reframes the evaluation agenda: the bottleneck is not tool integration — it is compositional information retrieval under uncertainty. For anyone designing agent benchmarks or pipelines: linear-chain evaluations systematically over-estimate agent capability in real-world DAG-structured tasks.

When Less is More: The LLM Scaling Paradox in Context Compression (arXiv:2602.09789, Feb 2026)¶

Identifies the Size-Fidelity Paradox in lossy context compression: in compressor-decoder architectures, making the compressor larger decreases the faithfulness of reconstructed contexts even as training loss falls. Tested across models from 0.6B to 90B parameters. This directly challenges the default assumption that scaling the compressor improves compression quality — a critical finding for RAG pipelines and context-window compression strategies that rely on larger compressors as a quality lever. Practical implication: if you're building a context-compression step into a long-context pipeline, calibrating the compressor size is non-trivial and larger is not necessarily better. The formal explanation is that larger compressors collapse into higher-entropy latent spaces that are easier to train on but less faithful to the original context structure.

Calibration Collapse Under Sycophancy Fine-Tuning (arXiv:2604.10585, April 2026)¶

Identifies a concrete failure mode of RLHF: when reward signals are sycophantic (the model rewarded for agreeing with planted wrong answers), the model's confidence calibration degrades in a way that survives post-hoc recalibration. Fine-tuning Qwen3-8B under three conditions, measuring Expected Calibration Error (ECE) and Maximum Calibration Error (MCE) on 1,000 MMLU items: sycophantic GRPO training leaves a structured residual miscalibration even after matrix scaling reduces ECE by 40–64%. Core finding: "reward hacking leaves a signature in confidence scores" independent of output-level behavior. This matters acutely for high-stakes deployments where confidence thresholds (medical triage, financial risk, legal advice) are used as gates — RLHF-aligned models may appear well-behaved on surface outputs while being systematically overconfident in wrong directions at the calibration level. Pairs with arXiv 2604.02507 (RLHF statistical foundations) already in this wiki: that paper addresses recovery from bad data; this addresses what happens when the reward signal itself is corrupted.

ByteRover: Agent-Native Memory Through LLM-Curated Hierarchical Context (arXiv:2604.01599, April 2026)¶

Fundamental redesign of LLM agent memory: instead of delegating storage to external chunking/embedding/graph pipelines (which cause semantic drift between what the agent intended to remember and what the pipeline captured), ByteRover uses the same LLM that reasons about the task to also curate, structure, and retrieve memory. Knowledge is organized as a Context Tree — a file-based hierarchy of Domain → Topic → Subtopic → Entry, where each entry carries explicit relations, provenance, and an Adaptive Knowledge Lifecycle (AKL) with importance scoring, maturity tiers, and recency decay. Retrieval uses a 5-tier progressive strategy: most queries resolve at sub-100ms without LLM calls; agentic reasoning is invoked only for genuinely novel questions. Critical engineering decision: zero external infrastructure — no vector database, no graph database, no embedding service — all knowledge stored as human-readable markdown. Achieves SOTA on LoCoMo and competitive results on LongMemEval. The philosophical contrast with the Continuum Memory Architecture (arXiv 2601.09913, already in this wiki) is sharp: CMA formalizes memory primitives theoretically; ByteRover operationalizes them in a minimal, file-native implementation that any Claude Code agent can replicate.

Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework (arXiv:2604.01707, April 2026)¶

Systematic survey and benchmark of all major LLM agent memory approaches, organized into a unified modular framework. Two benchmark datasets used for empirical comparison: LoCoMo (human-human conversational memory) and LongMemEval (user-AI interaction memory) — distinct in interaction regime, making results complementary rather than redundant. Key finding from the benchmarking: no single memory architecture dominates across both benchmarks, suggesting that memory design must be matched to interaction type (social/dialogic vs. task-oriented). As a byproduct, the authors derive a new hybrid memory method combining modules from existing approaches that outperforms SOTA on both benchmarks. Useful as the canonical reference for: (a) understanding the design space of LLM memory, (b) selecting memory architecture for a new agent system, (c) benchmarking a new memory approach. Pairs directly with ByteRover (arXiv 2604.01599): ByteRover is a concrete zero-infra implementation; this survey is the theoretical map.

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs (arXiv:2504.17768, updated Jan 2026)¶

The most rigorous scaling study of training-free sparse attention to date. Six sparse attention methods evaluated across multiple model families, sequence lengths up to 128K tokens, and sparsity levels up to 0.95 on nine tasks. Core finding: sparse attention is Pareto-improving — at equivalent inference compute, a larger sparse model outperforms a smaller dense model. Critical nuance on context length: the efficiency-accuracy trade-off holds well at 32K tokens (most sparsity configs sit on the Pareto frontier) but degrades substantially at 128K — high sparsity incurs significant accuracy penalties at extreme context. This directly characterizes the scaling laws governing when sparse attention is and isn't viable as a long-context inference strategy. Practical implication: sparse attention is a strong default for sub-64K inference but requires careful calibration at 128K+; at extreme lengths, dense attention remains necessary for accuracy-critical tasks.

MFEE: You Only Need Your Transformer 25% of the Time (arXiv:2601.00847, Jan 2026)¶

Introduces the Meaning-First Execution Engine (MFEE) — a gating architecture that intercepts inference requests and invokes the full transformer only when semantic novelty genuinely requires it, otherwise serving cached or rule-derived responses. Achieves 78% reduction in transformer execution with 100% exact-match equivalence on tested workloads, without modifying model weights or attention. The key insight: a large fraction of real-world inference traffic is semantically redundant or near-deterministic (FAQ-like queries, repeated template completions, slot-filling tasks) — a lightweight semantic gate can handle these without burning transformer FLOPs. Complements KV-cache optimization (which reduces cost per token within a sequence) by instead eliminating the transformer call entirely for eligible requests. Directly relevant to production serving cost reduction, particularly in high-traffic assistant deployments.

CDLM: Consistency Diffusion Language Models — 14.5× Inference Speedup (arXiv:2511.19269, Together AI)¶

From SqueezeAI Lab / Together AI. CDLM attacks the two core bottlenecks of block diffusion LLMs: (1) inability to use KV caching due to full bidirectional attention, and (2) high step counts for iterative refinement. It trains a student model on teacher decoding trajectories using block-wise causal attention with three loss components: consistency loss (temporal stability within blocks), distillation loss (anchor to teacher distributions), and auxiliary masked-denoising (preserve reasoning depth). Benchmarks: 11.2× speedup on GSM8K math reasoning, 14.5× on MBPP coding with minimal accuracy degradation. This surpasses DFlash (6×) and represents the leading deployed result for diffusion-LLM inference acceleration. Sits in the same block-diffusion ecosystem as DFlash — the two are composable: CDLM handles the distillation-based training regime while DFlash handles the serving-side block-drafting. Together AI has deployed this in production.

Speculating Experts: MoE Inference Acceleration via Expert Prefetching (arXiv:2603.19289, March 2026)¶

Addresses the CPU-offloaded MoE inference bottleneck: when expert weights live on CPU, GPU stalls waiting for transfers during decoding. Key contribution: current internal model representations reliably predict which experts will be needed in future layers, enabling async prefetching that overlaps CPU-to-GPU transfers with computation. Achieves up to 14% reduction in time-per-output-token (TPOT) vs. on-demand loading. Where raw speculative prediction accuracy is insufficient, lightweight estimators are trained to improve expert hit rates. Code is open-source. This is a clean runtime-efficiency result distinct from training-time MoE papers (e.g., UoE) — it operates on existing MoE models at serving time with no retraining, making it immediately applicable to CPU-offloaded deployments of Mixtral, DeepSeek-MoE, etc.

Continuum Memory Architecture: Beyond RAG for Long-Horizon Agents (arXiv:2601.09913, Jan 2026)¶

Formal architectural comparison of Continuum Memory Architecture (CMA) vs. RAG for long-horizon LLM agents. Core argument: RAG treats memory as a stateless lookup table — information persists indefinitely, retrieval is read-only, and there is no temporal continuity. CMA addresses these gaps through five primitives: persistent storage, selective retention, associative routing, temporal chaining, and consolidation into higher-order abstractions. Four behavioral probes demonstrate empirical advantages over a strong RAG baseline: knowledge updates (replacing stale facts), temporal association (earlier events inform later reasoning), associative recall (indirect connections), and contextual disambiguation (same entity in different contexts). The paper frames CMA not as an implementation but as a necessary architectural primitive for any long-horizon agent — equivalent to what attention was for sequence modeling. Most production RAG systems fail the temporal chaining and knowledge-update probes entirely.

Speculative Decoding for Test-Time Scaling: Benchmark Across Methods (arXiv:2509.04474)¶

First standardized benchmark comparing three categories of speculative decoding — model-based, training-based, and n-gram-based — against test-time scaling paradigms (Best-of-N, chain-of-thought). Counterintuitive headline: simple n-gram methods are surprisingly competitive against model-based drafts because test-time reasoning traces are structurally repetitive, and n-gram lookahead exploits this with zero training cost. The recommended strategy is a hybrid: n-gram methods for repetitive segments, model-based for novel generation. This directly addresses the well-known compute overhead of test-time scaling without expensive draft model training. Practical implication: before investing in training a specialized draft model, benchmark n-gram approaches first.

DFlash: Block Diffusion Delivers 6× Faster LLM Inference (NYU Shanghai AI Lab, Feb 2026)¶

DFlash replaces sequential token generation with a lightweight block diffusion model that generates entire blocks of draft tokens in one forward pass. Benchmarks: 6× lossless acceleration over standard autoregressive decoding, 2.5× over EAGLE-3 (prior SOTA). On dual RTX 3090s, Qwen3.5-27B reaches ~65 tokens/second. The critical insight: drafting cost is flat regardless of block length — drafting 8 tokens costs ~the same as drafting 1. This eliminates the per-token overhead that caused earlier speculative methods to plateau. Authors are targeting SGLang and vLLM integration. Pairs naturally with n-gram methods (DFlash for novel segments, n-gram for repetitive segments).

Union of Experts: Hierarchical MoE Routing Over Full Transformer (arXiv:2503.02495)¶

Challenges the standard MoE design which routes only feed-forward layers. UoE decomposes the entire transformer — including attention blocks — into specialist groups and applies two-level hierarchical routing: patch-wise data selection at level 1, expert assignment at level 2. Results: 2.38 lower perplexity than best MoE baseline at 76% FLOPs on language modeling; 0.68% higher average score on Long Range Arena at 50% FLOPs; 1.75% accuracy improvement on image classification. The key contribution: extending MoE routing to attention mechanisms. Most prior MoE work left attention as a shared bottleneck. This also reduces per-token cost in the speculative decoding verify step, making it directly composable with DFlash-style approaches.

Single-Agent LLMs Outperform Multi-Agent Systems Under Equal Token Budgets (arXiv 2604.02460)¶

April 2, 2026. Using the Data Processing Inequality as a formal lens, this paper shows that when reasoning-token budgets are held equal across single-agent and multi-agent setups, single-agent systems consistently match or beat multi-agent on multi-hop reasoning (tested on Qwen3, DeepSeek-R1-Distill-Llama, Gemini 2.5). The damning finding: most reported multi-agent benchmark gains are measurement artifacts — extra unaccounted compute and context injection disguised as architectural advantage. Implication for anyone building multi-agent pipelines: benchmark honestly by equalizing total thinking tokens. Multi-agent is not inherently smarter — it's often just burning more tokens.

Multi-Agent Orchestration for Exascale Materials Screening (arXiv 2604.07681)¶

April 9, 2026. A productive counterpoint: multi-agent does win when the problem is parallelism-limited, not reasoning-limited. Hierarchical planner-executor architecture (gpt-oss-120b) deployed on the Aurora supercomputer to screen metal-organic frameworks for water harvesting at exascale. The planner decomposes the search space; executor agents run asynchronously. Result: high task completion rates and low orchestration overhead. Synthesis with arXiv 2604.02460: single-agent beats multi-agent in reasoning depth; multi-agent beats single-agent in throughput over embarrassingly parallel tasks. The line is now clearly drawn.

Efficient Inference of Large Vision-Language Models — Survey (arXiv 2603.27960)¶

March 30, 2026. Comprehensive survey of LVLM inference acceleration organized into four axes: (1) visual token compression (pruning, merging, spatiotemporal aggregation — the central problem is quadratic attention cost over long visual token sequences), (2) memory & serving (KV-cache paging, continuous batching), (3) architecture (sparse MoE, cross-modal projector optimization, hardware-aware attention kernels), (4) advanced decoding (speculative decoding adapted for multimodal). Most useful as a reference taxonomy for anyone optimizing a VLM pipeline. Open problems flagged: streaming video inference and expert routing efficiency in sparse MoE cross-modal models.

SHAPE: Stage-Aware Hierarchical Advantage for LLM Reasoning (arXiv 2604.06636)¶

April 8, 2026. SHAPE formalizes LLM reasoning as a trajectory through a "state space of empirical solvability" and introduces a hierarchical credit assignment mechanism. At the segment level, it uses a stage-aware advantage function to reward efficient breakthroughs in low-solvability states (the hard parts); at the token level, it uses potential-based shaping to avoid rewarding verbosity. Result: 3% average accuracy gain on math reasoning benchmarks with 30% reduced token consumption — addressing the token-length inflation problem that plagues chain-of-thought fine-tuning. Key insight: current process supervision rewards steps without distinguishing meaningful progress from verbose padding.

On Step Length Confounding in LLM Reasoning Data Selection (arXiv 2604.06834)¶

April 8, 2026. Identifies a subtle but critical flaw in how reasoning datasets are curated: step length is confounded with reasoning quality in most selection pipelines. Longer reasoning chains score higher on standard selection metrics (perplexity-based or reward-model-based), but longer ≠ better. The paper shows that removing this confound and selecting for compact, correct reasoning steps improves downstream fine-tuned model quality. Practical implication: if you're training on synthetic reasoning data (GPT-4 / o3 generated), blind length-based filtering actively degrades model quality.

RLHF: A Statistical Perspective (arXiv 2604.02507)¶

Establishes a statistical foundation for RLHF by addressing how to combine abundant but biased AI-generated labels with limited high-quality human feedback. Derives recovery guarantees for human-aligned preferences under mixed-data regimes. Key insight: the bottleneck in scaling RLHF is not the RL algorithm but the quality and quantity of human preference signal — this paper provides the theoretical grounding for hybrid data strategies.

GIFT: Group-Relative Implicit Fine-Tuning (arXiv 2510.23868)¶

Reformulates the RL fine-tuning objective as a normalized MSE loss between implicit and explicit reward signals, eliminating intractable normalization constants and clipping mechanisms found in PPO/GRPO. GIFT produces lower-variance gradients than standard RLHF baselines. Practical impact: reduces per-step training instability without sacrificing policy expressiveness, making RLHF more accessible for smaller research teams.

LLMOrbit: A Circular Taxonomy of Large Language Models (arXiv 2601.14053)¶

Maps the multi-agent LLM landscape and introduces a circular taxonomy organizing models by capability tier and deployment role. Covers scaling walls, agentic reasoning patterns, and system-level engineering challenges. Notable: introduces the concept of dynamic agent topology rewiring — multi-agent systems that restructure their communication graph during reasoning, rather than fixing it at design time.

Claude Opus 4.7 Generally Available — xhigh Reasoning, Vision Improvements (Anthropic / GitHub, April 16, 2026)¶

Claude Opus 4.7 entered general availability on April 16, 2026, at the same pricing as Opus 4.6 ($5/$25 per MTok input/output). Key capability improvements: advanced software engineering on the hardest coding tasks and higher-resolution vision for complex image analysis. Claude Code integration gains three new features: (1) Opus 4.7 xhigh reasoning effort — extended thinking at maximum depth for autonomous coding agents; (2) Auto mode for Max subscribers (the system selects the appropriate model/effort per request); (3) /effort and /ultrareview slash commands for manual reasoning-effort control. The xhigh reasoning tier is architecturally significant: it represents Anthropic's most capable agent-mode configuration and makes Opus 4.7 the strongest Claude variant for long-horizon autonomous software engineering tasks — directly competing with the o3/o4-series extended thinking configurations. Relevant alongside Claude Mythos (already in this wiki): Mythos remains restricted; Opus 4.7 is the public frontier model. The gap between a deployable frontier model and the restricted offensive-capability model has now been clearly defined at one model generation.

A Mechanistic Analysis of Looped Reasoning Language Models (arXiv:2604.11791, April 2026)¶

Examines a class of models that iterate ("loop") transformer layers in the latent dimension — the same block of weights is applied multiple times per forward pass, allowing the model to refine its intermediate representations without adding parameters. Key mechanistic findings: (1) each cycle converges to distinct fixed points in a cyclic trajectory through latent space — the model's "thought" stabilizes per cycle before advancing; (2) attention heads stabilize as these fixed points are reached, reducing computational load in later cycles naturally; (3) the model's inference stages mirror those of a standard multi-layer feedforward model but are applied iteratively, recovering a quasi-depth effect from width alone. Architectural implications: recurrent block size, input injection frequency, and normalization scheme are critical design parameters that prior looped-model work underspecified — this paper provides mechanistic grounding for principled choices. Theoretical significance: looped models are the lightweight alternative to chain-of-thought scaling (more tokens) and model scaling (more parameters) — they extract additional reasoning depth from existing weights via iteration. For agent systems: this is the mechanistic justification for using small looped models on budget-constrained hardware while maintaining reasoning depth competitive with larger models at a single forward pass.

April 2026 Frontier Benchmark Landscape — Gemini 3.1 Pro Leads SWE-bench, ARC-AGI-2 at 77.1%¶

As of April 2026 the frontier benchmark picture has meaningfully shifted since Q1: SWE-bench Verified leadership: Gemini 3.1 Pro at 78.80%, Claude Opus 4.6 Thinking and GPT-5.4 both at 78.20% — a three-way cluster at the frontier. ARC-AGI-2 (the most demanding reasoning benchmark): Gemini 3.1 Pro at 77.1% — more than double its predecessor model's score, suggesting significant architectural change rather than incremental improvement. GPQA Diamond (graduate-level physics/biology/chemistry): Claude Opus 4.6 still leads on pure reasoning quality. GPT-5.4 (March 2026) introduced GPT-5.4 Thinking and GPT-5.4 Pro variants — OpenAI's response to Claude's extended thinking tiers. Synthesis: the frontier has reached near-parity on the core coding benchmark (SWE-bench) among the three major labs, while differentiation is shifting toward reasoning quality (GPQA Diamond, ARC-AGI-2) and agentic capability (HiL-Bench Ask-F1, already in this wiki). The next meaningful benchmark battleground is likely long-horizon agentic evaluation rather than static benchmarks.

Claude Opus 4.7 Benchmark Confirmation — SWE-bench 87.6%, Adaptive Thinking Replaces Budget Controls (April 2026)¶

External benchmark round-ups confirm Claude Opus 4.7 (released April 16, 2026) has taken the SWE-bench Verified lead from Gemini 3.1 Pro: 87.6% vs. Gemini's 78.80% (April mid-cycle). SWE-bench Pro (harder version): Opus 4.7 at 64.3% vs. GPT-5.4 at 57.7% — confirming real-world coding task superiority. CursorBench: 70% (+12 points over Opus 4.6). The most notable architectural change in 4.7 vs. 4.6: manual extended thinking budgets are removed; replaced with adaptive thinking where the model automatically scales reasoning depth to task complexity. The new xhigh effort tier (already documented in Claude Code as /effort xhigh) is the ceiling for explicitly requesting maximum reasoning effort. Implication: users who relied on explicit budget tokens for predictable latency budgeting must shift to effort-tier semantics. The net result is higher ceiling performance but less fine-grained latency control — the trade-off favors agentic pipelines that optimize for quality over strict latency SLAs.

Ouro LoopLM Family + LoopFormer: Scaling and Budgeting Looped LLM Inference (arXiv:2510.25741, 2025–2026)¶

Two complementary papers extend the looped/recurrent LLM thread into concrete scaling and engineering design. Ouro (arXiv:2510.25741): trains a family of Looped Language Models on 7.7T tokens with entropy-regularized depth allocation — Ouro 1.4B and 2.6B match 12B SOTA models on reasoning benchmarks, establishing that test-time compute via latent looping is a viable alternative to parameter scaling at the sub-3B range. The entropy regularization ensures latent representations are maximally informative at each loop iteration rather than collapsing early. LoopFormer (loopformer.github.io): introduces budget-conditioned looped Transformers via trajectory conditioning and shortcut-consistency regularization — enabling variable compute depth per input at inference time without retraining. A third related paper (arXiv:2602.10520) trains looped LLMs with process-level rewards on latent trajectories (not just final answer quality), significantly improving multi-step reasoning by rewarding intermediate latent states rather than only output correctness. Synthesis: the looped LLM direction has now matured from architecture papers to training methodology — the open questions are shifting to deployment tooling (how to serve variable-depth models efficiently) and the minimum model size at which looping adds meaningful reasoning depth over single-pass inference.

International AI Safety Report 2026 — Bengio-Led 100-Expert Global Consensus (IASR, February 2026)¶

The largest global AI safety collaboration to date: led by Yoshua Bengio, authored by 100+ experts across 30+ countries, published February 3, 2026. The report covers frontier AI capability trajectories, risk categorization (misuse, misalignment, structural), and safeguard recommendations — an explicitly intergovernmental framing, not an industry white paper. Key framing shift: the 2026 report moves from "could AI be dangerous?" (2023–2024 discourse) to "given confirmed dangerous capabilities, what governance structures are needed?" — calibrating risk language around capabilities that are empirically documented (like Project Glasswing's Mythos zero-day findings). The Bengio authorship is significant: he holds a uniquely credible position as both an AI capability pioneer and a public safety advocate, giving the report standing with both technical and policy communities. Directly relevant to the AI safety threads in this wiki: provides the authoritative 2026 consensus baseline for any capability-risk framing.

Open Questions¶

Claude Mythos: Anthropic claims thousands of zero-days found across all major OSes and browsers — have responsible-disclosure timelines for these CVEs been published? How many patches have been issued post-Glasswing?
HiL-Bench Ask-F1 metric: what's the best current model score on selective escalation? Do reasoning-specialized models (o3, Gemini Thinking) outperform base instruction-tuned models on blocker detection?
KoCo Stability dimension: does flagging "evolving" facts at pre-training time measurably reduce hallucination on time-sensitive factual queries, or does it only improve convergence speed?
ToolSpec: does schema-aware speculative decoding degrade when tool schemas change frequently (e.g., dynamic APIs)? What's the per-tool overhead of schema retrieval?

Qwen 3 Series Release — Alibaba (April 8, 2026)¶

Alibaba released the full Qwen 3 model family on April 8, spanning 0.6B to 72B parameters in both dense and MoE configurations. The headline feature is dual-mode thinking: each model can switch between a slow chain-of-thought reasoning mode and a fast direct-answer mode at inference time — making the thinking budget dynamically adjustable without separate model variants. The standout result is Qwen 3.6-35B-A3B (35B total, 3B active via MoE): 73.4% on SWE-bench Verified, 86.0% GPQA Diamond, 92.7% AIME 2026 — all achieved with only 3B active parameters, beating or matching dense models with 10× more active compute. Qwen 3-32B running locally matches or beats GPT-4o on several reasoning benchmarks. Architectural significance: the MoE + dual-mode combination proves that active parameter count (not total count) is the relevant efficiency metric — a 3B-active model delivering frontier reasoning quality reframes the compute-efficiency Pareto frontier for open-weight models. Combined with Llama 4 Scout/Maverick (Meta, April 5, 400B total / 17B active MoE) and Gemma 4 (Google), April 2026 is the most competitive month in open-source AI history and marks a structural shift: open-weight MoE models now match proprietary dense frontier models on reasoning benchmarks.

The AI Scientist-v2: First Fully AI-Generated Paper Accepted at Peer Review (arXiv:2504.08066, April 2026)¶

Sakana AI's AI Scientist-v2 achieved the first fully AI-generated manuscript to pass peer review at an academic venue: one of three papers submitted to an ICLR 2025 workshop achieved an average reviewer score of 6.33, placing it in the top ~45% of submissions — above the human acceptance threshold. The accepted paper investigated compositional regularization in neural network training. The system operates end-to-end with no human-authored code templates: it formulates hypotheses, designs and executes experiments, analyzes results, and authors full manuscripts via a progressive agentic tree search managed by a dedicated experiment manager agent. This is not a demonstration of capability — it is a shipped system that produced a peer-reviewed result. The theoretical ceiling this breaks: the assumption that peer review provides a reliable filter against AI-generated content is no longer operationally valid for workshop-level ML research. For research methodology: the deeper implication is that autonomous hypothesis generation + experimental execution is now sufficiently mature to produce domain-relevant research without human guidance — at least within narrow ML subfields. The v1 AI Scientist paper (published in Nature in 2026) focused on hypothesis generation; v2 achieves the full loop including quality sufficient for peer acceptance.

OpenMythos: Community Reconstruction of Claude Mythos Architecture (April 19-20, 2026)¶

Since Anthropic published no technical paper for Claude Mythos Preview, researcher Kye Gomez released OpenMythos — an open-source PyTorch reconstruction hypothesizing Mythos uses a Recurrent-Depth Transformer (RDT) with Mixture-of-Experts routing. Core claim: 770M parameters via RDT match a 1.3B conventional dense transformer; iterative forward-pass loops extend reasoning depth dynamically; Multi-Latent Attention compresses the KV cache. The efficiency profile matches observed Mythos inference characteristics. This is architectural speculation, not confirmed by Anthropic, but represents serious reverse-engineering based on observable performance characteristics (Mythos ranked 99/100 on LM Council overall, a "step change" above Claude Opus 4.6). The RDT hypothesis is directly relevant to the Ouro LoopLM and LoopFormer papers already in this wiki: looped inference with dynamic depth allocation appears to be converging as the dominant architecture pattern for compute-efficient frontier reasoning. The key prediction: if Mythos is indeed a looped RDT with MoE routing, the practical ceiling for offensive security capability from a 770M active-parameter base raises profound questions about future model governance — where model size alone provides poor guidance on capability risk.

Project Glasswing CVE Specifics: 40+ Zero-Days, Full Disclosure ~July 2026 (April 2026)¶

As additional details emerge from Project Glasswing: Claude Mythos autonomously discovered 40+ CVEs across major software stacks. Distribution: 28 of 40 in Firefox, 9 in wolfSSL, plus a 17-year-old FreeBSD RCE (CVE-2026-4747, remotely triggerable for root over NFS), a 27-year-old OpenBSD bug, a 16-year-old FFmpeg bug, and Linux kernel privilege escalation chains. Disclosure timeline: vendors received a 135-day coordinated disclosure window, placing full public CVE disclosure at approximately July 2026. The Firefox-heavy distribution is significant: it suggests Mythos targeted the most widely deployed attack surface (browsers) rather than obscure embedded systems — implying the discovered vulnerabilities are practically exploitable by mainstream threat actors, not just nation-state level. For AI safety: the 135-day window is shorter than typical critical CVE cycles, meaning some patching may remain incomplete at disclosure. The Cloud Security Alliance research note characterizes this as the first public case where an AI system's autonomous offensive capability unambiguously crossed a threshold warranting deployment restriction — confirming Anthropic's withholding rationale.

KWBench: Can LLMs Recognize Problems Before Being Told One Exists? (April 2026)¶

Standard agent benchmarks assume the problem is pre-specified — the agent is handed a task and told to solve it. KWBench (Know-What Benchmark) evaluates a different capability: unprompted problem recognition — can a model identify that there is a problem when no one says so? The benchmark contains 223 practitioner-sourced tasks across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design — all requiring the agent to notice something is wrong or missing before it can respond usefully. This maps directly to real-world deployment: users rarely have perfectly specified problems. Frontier model performance on KWBench is significantly below performance on standard task-completion benchmarks, exposing a systematic gap between benchmarked and real-world agent capability. Complements HiL-Bench (arXiv:2604.09408, already in this wiki) which tests selective escalation when blockers are known; KWBench tests the prior step of noticing that a problem exists.

Meta Muse Spark: Meta's First Closed-Source Frontier Model — Built by Meta Superintelligence Labs (April 8, 2026)¶

On April 8, 2026 — the same day as Qwen 3 and one day after Mythos Preview — Meta released Muse Spark, its first proprietary, closed-source AI model, built entirely in secret by Meta Superintelligence Labs under Chief AI Officer Alexandr Wang. This is a complete break from Meta's Llama heritage: no weights released, no architecture disclosed, no training methodology published. The model is natively multimodal (text, voice, image), supports visual chain-of-thought, tool-use, and multi-agent orchestration, and now powers the Meta AI chatbot across all Meta apps (Facebook, Instagram, WhatsApp, Messenger). Meta stated a "hope to open-source future versions" but made no commitment. The significance is strategic rather than purely technical: Meta — which built its AI brand on open-weight models and positioned Llama as the open-source counterweight to GPT/Claude — has quietly released a proprietary frontier model via the same closed-source playbook it publicly criticized. The competitive driver is clear: Llama 4 Scout/Maverick (also April 2026, open-weight MoE) serves the community; Muse Spark serves Meta's 3B+ user consumer products where capability, not openness, is the commercial differentiator. For agent systems: Muse Spark's multi-agent orchestration and tool-use positioning puts it in direct competition with Claude and GPT-5.4 for enterprise agentic workloads — but without the API transparency or open evaluation that enables fair third-party benchmarking.

GLM-5.1: Open-Weight 754B MoE Tops SWE-bench Pro — Huawei Ascend-Only Training (Z.AI, April 7, 2026)¶

Z.AI (formerly Zhipu AI) released GLM-5.1 on April 7, 2026 — an open-weight agentic coding model that briefly held the SWE-bench Pro #1 position for nine days before Claude Opus 4.7 dethroned it on April 16. Architecture: 744B total parameters, 40B active via MoE routing, 200K context window, 131K max output, published under MIT license on HuggingFace. SWE-bench Pro score: 58.4% (vs. GPT-5.4 at 57.7%, Claude Opus 4.6 at 57.3% — both beaten as of April 7). The model also leads NL2Repo at 42.7%. Two standout facts: (1) GLM-5.1 was trained entirely on Huawei Ascend chips, without any NVIDIA hardware — the first fully Ascend-trained open-weight frontier coding model, a significant signal for China's AI hardware independence strategy; (2) it is designed for 8-hour autonomous coding sessions — in representative cases it can build a complete Linux desktop system from scratch within 8 hours, completing 655 optimization iterations autonomously. The GLM-5 predecessor (released ~Feb 17, 2026) scored 77.8% SWE-bench Verified. April 16 update: Claude Opus 4.7 at 64.3% on SWE-bench Pro replaced GLM-5.1 at #1. The significance beyond the benchmark: this is the first time a fully open-weight model trained without NVIDIA infrastructure reached the global SWE-bench Pro top — directly testing the thesis that Huawei's Ascend chips cannot produce frontier quality on complex coding benchmarks.

Tina: Tiny Reasoning Models via LoRA — 260× Cost Reduction for RL Reasoning (arXiv:2504.15777, ICLR 2026)¶

Addresses a key open question in LLM reasoning: how cheaply can RL fine-tuning instill reasoning capability in a small model? Tina applies LoRA (Low-Rank Adaptation) as the parameter-efficient update mechanism during RL training of a 1.5B base model, achieving a >20% reasoning performance increase and 43.33% Pass@1 on AIME24 at an estimated $9 total post-training and evaluation cost — approximately a 260× cost reduction vs. full-parameter RL fine-tuning at comparable scale. The core hypothesis (supported by mechanistic analysis): LoRA's efficiency advantage comes from its ability to rapidly adapt the reasoning format under RL (how the model structures its chain-of-thought) while preserving base model knowledge — a process that appears far more compute-efficient than the deep knowledge integration required by full-parameter training. For practitioners: this establishes that frontier reasoning behaviors (long-horizon AIME math) can be induced in a 1.5B model for single-digit dollar costs, making RL reasoning fine-tuning accessible to any team with a V100 and an afternoon. Accepted to ICLR 2026. Published April 21, 2026. Directly complements Ouro LoopLM and LoopFormer (already in this wiki) — those scale test-time compute via loop depth; Tina shows that training-time reasoning can be instilled efficiently into small models, creating a complementary axis of efficiency: cheap to train + cheap to run at inference time.

The Residual Stream Is All You Need: KV Cache Redundancy in Transformer Inference (arXiv:2603.19664, March 2026)¶

Makes a striking theoretical claim: the KV cache is entirely redundant in standard transformer inference. Keys and values at every layer are deterministic projections of the residual stream, so they can be recomputed on demand from a single residual vector per token with zero reconstruction error — verified under greedy decoding across six models from four architecture families (135M–4B parameters), with KL divergence exactly 0. The practical implementation, KV-Direct, stores residual vectors (5 KB/token on Gemma 3-4B) rather than full KV pairs (136 KB/token), holding peak memory at 42 MB over 20 conversation turns versus 103 MB for the standard cache. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct achieves 100% token match at every cache budget while all baselines degrade to 5–28%. This is a major conceptual reframe: if it generalizes to larger models, KV cache compression becomes an unnecessary workaround rather than a necessary engineering constraint. Follow-up replication at >4B scale is the key open question — the paper verifies only up to 4B parameters.

Efficient LLM Serving for Agentic Workflows — Helium (arXiv:2603.16104, March 2026)¶

Addresses a core inefficiency in production agentic systems: existing LLM serving infrastructure (vLLM et al.) optimizes single inference calls but ignores cross-call dependencies. Agentic workflows generate extensive prompt overlap and intermediate result reuse across sequential tool calls, which current systems re-process redundantly. Helium treats agentic workloads as query plans and LLM invocations as first-class relational operators, integrating proactive KV-state caching and cache-aware scheduling across the full workflow graph. Achieves up to 1.56× speedup over state-of-the-art agent serving baselines. The data-systems framing is the novel contribution: existing serving optimization thinks per-call; Helium thinks per-workflow. As agentic workloads become dominant in production (multi-step reasoning, sequential tool use), per-call serving leaves most efficiency gains on the table.

Google Deep Research + Deep Research Max — Autonomous Research Agents on Gemini 3.1 Pro (April 22, 2026)¶

Google launched Deep Research and Deep Research Max on April 22, 2026 — two autonomous AI research agents built on Gemini 3.1 Pro and now available via the Gemini API in public preview. Key capabilities: MCP support (agents can query internal company data sources, not just the public web), native chart and infographic generation inline with HTML reports, and a record 93.3% on DeepSearchQA — the highest publicly reported score on a long-horizon web research benchmark. Performance on OpenAI's BrowseComp (1,000+ multi-hop online research tasks): Gemini 3.1 Pro scores 85.9, more than 25 points higher than its predecessor Gemini 3 Pro, and tied with GPT-5.4 Pro at 57 points on the Artificial Analysis Intelligence Index as of April 12. The two-tier design is deliberate: Deep Research optimizes for low latency and cost (user-facing apps), while Deep Research Max spends additional compute to refine reports, consult more sources, and catch analytical nuances the faster tier skips — a reasoning-budget analogy to Claude's extended thinking tiers. Enterprise significance: native MCP integration means these agents can traverse company knowledge graphs, internal databases, and RAG sources within the same research workflow as public web retrieval — the first production research agent that natively combines external discovery with internal knowledge grounding at this quality level.

DeepSeek V4 — Status April 24: Third Delay, Huawei Chip Bottleneck, End-of-April Target ¶

DeepSeek V4 remains unreleased as of April 24, 2026 — now delayed a third time, with Reuters (April 3) attributing the bottleneck to Huawei Ascend 950PR chip availability at the scale required for a 1T-parameter MoE deployment. Earlier window was end of April; some trackers now hedge toward early May. Specs remain confirmed: 1T total parameters, ~37B active per forward pass (35× speedup vs. dense equivalent), 1M token context window, native multimodal support, targeting Apache 2.0 open-source release. The Huawei-only training constraint is now both the technical bottleneck and the geopolitical signal — two consecutive open-weight frontier models (GLM-5.1 and V4) trained without NVIDIA hardware. Once released, the benchmark cascade (SWE-bench Pro, GPQA Diamond, ARC-AGI-2) is expected to be substantial given the scale advantage over GLM-5.1 (40B active) while maintaining similar active-parameter efficiency.

DeepSeek V4 — Status April 23: Late-April Release Imminent, 1T MoE, Huawei Ascend Only ¶

DeepSeek V4 remains unreleased as of April 23, 2026. Targeting late April 2026. Key confirmed specs: 1T parameter MoE architecture activating ~37B parameters per forward pass (35× processing speedup vs. dense equivalent), 1M token context window, native multimodal support. Hardware: trained exclusively on Huawei Ascend 950PR chips — deliberately bypassing NVIDIA entirely, a geopolitical signal amid US export controls. If the efficiency claims hold, DeepSeek V4 at 1T total parameters with 37B active would represent a step-change above GLM-5.1 (40B active) and current open-weight frontier. For benchmark context: GLM-5.1 at 40B active reached SWE-bench Pro 58.4% and was dethroned by Claude Opus 4.7 at 64.3%; DeepSeek V4 with ~equal active parameters but newer training and larger model capacity is expected to be competitive. The Huawei-only training is the most significant geopolitical signal since GLM-5.1 — two consecutive open-weight frontier models trained entirely without NVIDIA GPUs.

DeepSeek V4 Released — 1.6T MoE, 1M Context, MIT License (April 24, 2026)¶

DeepSeek V4 released April 24, 2026 — open-sourced under MIT License after a delayed rollout. Two-model family: DeepSeek-V4-Pro (1.6T total parameters, 49B active per forward pass, 1M token context) and DeepSeek-V4-Flash (284B total parameters). Architecture innovations: (1) Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) hybrid — in 1M-token setting, requires only 27% of standard single-token inference FLOPs and 10% of KV cache vs. DeepSeek-V3; (2) Manifold-Constrained Hyper-Connections (mHC) replacing conventional residual connections — improves signal propagation stability while preserving expressivity; (3) trained exclusively on Huawei Ascend chips, continuing the all-Ascend track set by GLM-5.1. Key benchmarks: 80.6% SWE-bench Verified (within 0.2 points of Claude Opus 4.6); 67.9% Terminal-Bench 2.0 (ahead of Claude 65.4%); 93.5% LiveCodeBench (vs. Claude 88.8%); Codeforces rating 3206; tied GPT-5.4 on MMLU-Pro. On Putnam-2025 mathematics: 120/120 proof-perfect, tying Axiom. The significance: DeepSeek V4's efficiency at 1M context (10% KV cache) is a direct architectural answer to the KV cache scaling problem — achieving frontier performance at a fraction of the memory cost, entirely on non-NVIDIA hardware.

SemanticALLI: Caching Reasoning, Not Just Responses, in Agentic Systems (arXiv:2601.16286, 2026)¶

Addresses a structural inefficiency in agentic AI pipelines: even when users phrase requests differently, the intermediate reasoning the pipeline must perform is often identical — metric normalization, chart scaffolding, query decomposition. Conventional semantic caching treats inference as a black box and misses this. SemanticALLI decomposes agentic generation into Analytic Intent Resolution (AIR) and Visualization Synthesis (VS), elevating structured intermediate representations (IRs) to first-class cacheable artifacts — a pipeline-aware architecture rather than a response-level cache. Results: baseline monolithic caching caps at 38.7% hit rate due to linguistic variance; SemanticALLI achieves 83.1% hit rate (bypassing 4,023 LLM calls at 2.66 ms median latency); token consumption drops from ~59,906 to ~12,964 per prompt — a 78.4% token reduction. Core insight: in production agentic systems, the pipeline re-runs at stable, structured checkpoints where caching is most reliable even when users rarely phrase things identically. This is the practical complement to Helium (already in this wiki): Helium optimizes cross-call KV-state caching at the serving layer; SemanticALLI caches at the intermediate representation layer before LLM calls — two orthogonal efficiency axes for production agentic serving.

DeepSeek V4-Flash: 13B Active Params, 79.0% SWE-bench, $0.14/$0.28 per MTok (April 24, 2026)¶

The V4 release's second model tier deserves its own entry. DeepSeek V4-Flash is 284B total parameters, 13B active via MoE routing — 3.8× fewer active params than V4-Pro's 49B — at 79.0% SWE-bench Verified vs V4-Pro's 80.6% (gap: 1.6 points). Pricing: $0.14 input / $0.28 output per million tokens — a 12× cost reduction vs V4-Pro ($1.74/$3.48). The practical framing: V4-Flash gives up 1-2 benchmark points in exchange for a 12× API cost reduction and significantly lower inference latency, making it competitive against mid-tier models on efficiency metrics while operating from a 284B parameter pool. Flash uses the same CSA+HCA hybrid attention and mHC residual connections as Pro, on the same Ascend-trained weights — it is an active-parameter budget variant, not a separately trained model. Combined with V4-Pro, this creates a cost/quality Pareto that directly pressures both the economy tier (GPT-5.4-mini, Claude Haiku 4.5) and the premium tier simultaneously. For agentic workloads: V4-Flash at 13B active is the cheapest way to run a model with 1M-context window support and frontier-adjacent coding capability.

AgentDiet: Trajectory Reduction for Efficient Agent Execution (2026)¶

AgentDiet automatically identifies and removes waste in agent execution trajectories — redundant tool calls, verbose intermediate outputs, duplicated reasoning steps — without retraining or modifying model weights. The reduction is applied post-hoc to recorded trajectories, then distilled back. Results: 39.9%–59.7% reduction in input tokens and 21.1%–35.9% reduction in total computational cost while maintaining equivalent agent task performance. Unlike Helium (workflow-level KV-state caching) or SemanticALLI (intermediate representation caching), AgentDiet operates at the trajectory content level — it changes what the agent is asked to process, not how the serving infrastructure handles it. For production deployments: the technique is particularly valuable for agents that generate verbose chain-of-thought or call tools with redundant context — common in long multi-step workflows where each step naively passes the full prior context. Directly complements Tri-Spirit (habit compilation) and Helium (serving-layer caching) already in this wiki: together they address agent efficiency at the training, serving, and trajectory content levels.

BenchGuard: Automated Auditing of LLM Agent Benchmarks (arXiv:2604.24955, late April 2026)¶

Introduces the first automated auditing framework for task-oriented, execution-based agent benchmarks. BenchGuard treats benchmark artifacts — task specs, evaluation scripts, reference solutions — as first-class objects and deploys frontier LLMs as systematic auditors that cross-verify benchmark consistency, solvability, and correctness via structured protocols. Deployed on two real benchmarks: found 12 author-confirmed issues in ScienceAgentBench (including fatal errors rendering tasks completely unsolvable) and matched 83.3% of expert-identified issues on BIXBench Verified-50. The underlying problem is fundamental: agent benchmarks are constructed by researchers who also design evaluation harnesses, creating systematic blind spots — errors that no one noticed because the same people built the tasks and the checker. BenchGuard closes this by treating the benchmark itself as a code artifact to be tested rather than a ground truth to be trusted. Practical implication: any team relying on benchmark scores to evaluate agents should now ask whether the benchmark itself has been audited. The 12 fatal errors in ScienceAgentBench imply some subset of published results on that benchmark reflect scores on unsolvable tasks — systematically biasing comparisons between agents. Direct complement to AAR (already in this wiki): AAR exposes evaluation blind spots by making benchmarks harder; BenchGuard exposes correctness blind spots in existing benchmark infrastructure.

From Skill Text to Skill Structure: SSL Representation for Agent Skills (arXiv:2604.24026, late April 2026)¶

Introduces the Scheduling-Structural-Logical (SSL) representation — the first structured, source-grounded format for agent skill artifacts that simultaneously encodes three layers: (1) scheduling signals (when and under what conditions should the skill execute), (2) execution structure (the scene-level plan DAG for carrying out the skill), and (3) logic evidence (specific actions and resource-use patterns that ground the skill in verifiable steps). SSL is designed as a drop-in structured alternative to unstructured natural-language skill descriptions, enabling downstream tasks to operate on semantically rich, searchable artifacts rather than opaque text blobs. Benchmark results: Skill Discovery MRR improves 0.573 → 0.707 and Risk Assessment macro F1 improves 0.744 → 0.787 vs. unstructured baselines. The core insight is that current agent skill libraries treat skills as text — retrievable by semantic similarity but not by logical or structural properties. SSL makes skills inspectable and composable in ways that pure text cannot support. Directly complements the Tri-Spirit paper (arXiv:2604.13757, already in this wiki) — Tri-Spirit decomposes the agent execution stack across hardware layers; SSL structures the skill knowledge that flows through that stack.

From Agent Loops to Structured Graphs: Scheduler-Theoretic Framework for LLM Agent Execution (arXiv:2604.11378, April 13, 2026)¶

Places the dominant Agent Loop paradigm (LLM iterates over a growing context, choosing its next action at each step) into a formal scheduling-theory framework, then shows it is equivalent to a single-ready-unit scheduler: at any moment, at most one executable unit is ready, and the choice of which unit executes comes from opaque LLM inference rather than an inspectable policy. Three structural weaknesses emerge from this analysis: implicit dependencies between steps (no explicit DAG), unbounded recovery loops (the agent can spin indefinitely trying to recover), and mutable execution history (makes debugging non-reproducible). The proposed solution, SGH (Structured Graph Harness), lifts control flow from implicit context into an explicit, immutable static DAG — execution plans are fixed at plan time, planning/execution/recovery are separated into three distinct layers, and recovery follows a strict escalation protocol rather than re-entering the same execution loop. The scheduler-theoretic unification is the key conceptual contribution: it places Agent Loops and graph-based execution engines (LangGraph, etc.) on a single semantic continuum, allowing practitioners to reason about which scheduling properties they need and select the appropriate execution architecture. Directly extends the Tri-Spirit paper (arXiv:2604.13757, already in this wiki) — where Tri-Spirit decomposes cognitive workload across hardware layers, SGH formalizes the structural properties of execution plans that make agents debuggable and reliable.

KnowRL: RL with Minimal-Sufficient Knowledge Guidance (arXiv:2604.12627, April 14, 2026)¶

Addresses a core tension in RL-based reasoning: hint-guided training (providing knowledge-point hints during training) improves reasoning but creates inference-time dependency on those hints. KnowRL frames hint design as a minimal-sufficient guidance problem — decomposing knowledge required for each problem into atomic knowledge points (KPs), then using Constrained Subset Search to identify the smallest subset of KPs that is interaction-aware (accounts for how KPs compound during reasoning) without over-specifying. The practical effect: the model is trained to reason independently of hints once the minimal-sufficient subset is internalized. Without KP hints at inference: 70.08% average accuracy (+9.63 points vs. baseline RL). With selected KP hints at inference: 74.16% — the hints remain available as an optional boost, not a crutch. The "minimal-sufficient" framing is theoretically clean: it avoids the opposite failure mode (over-hinting, which inflates training performance but collapses at deployment when hints are unavailable). Direct complement to CAPO (arXiv:2604.12632, already in this wiki) — CAPO optimizes calibration during RL training; KnowRL optimizes the knowledge content of the training signal itself, treating what the model is asked to learn as the primary design variable.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual RL (arXiv:2605.00380, ICML 2026)¶

ResRL targets a structural flaw in standard RL-based reasoning training: positive-reward incentivization improves reasoning accuracy but collapses response diversity — the model converges on narrow output modes that happen to be rewarded, while the gradient updates are corrupted by Lazy Likelihood Displacement (LLD), a phenomenon where negative-sample tokens "drift" toward positive-sample token probability distributions due to shared head-gradient interference. The fix: project negative token representations onto the positive subspace and use the residuals (the component orthogonal to positive-sample activations) to guide gradient updates — ensuring negative samples maintain their distinguishing signal rather than being pulled into positive-space by implicit optimization pressure. This conservative advantage reweighting preserves the diversity of the reasoning search space while still applying strong positive reinforcement. Results across 12 benchmarks (math, code generation, agent tasks, function calling): +9.4% on math reasoning Avg@16, +7.0% on Pass@128 vs. the NSR baseline. ICML 2026 accepted. Theoretical contribution: the paper derives a proxy bound connecting representation alignment (positive-negative semantic distance) to LLD magnitude — giving practitioners a quantitative signal for when LLD is actively degrading their RL reasoning training run.

Odysseus: Scaling VLMs to 100+ Turn Decision-Making via Reinforcement Learning (arXiv:2605.00347, May 2026)¶

Most RL-trained vision-language agent work operates within 20–30 turn horizons — long enough to demonstrate agent capability but short enough to sidestep compounding error and credit assignment problems at scale. Odysseus pushes this to 100+ consecutive turns, demonstrated on Super Mario Land gameplay (a task requiring sustained spatial reasoning, hazard memory, and sequential progress without explicit subtask decomposition). Key training innovations: a lightweight turn-level critic adapted from PPO that provides dense intermediate reward signals without requiring full rollout evaluation; and a pretrained VLM action-prior bootstrapping approach that dramatically reduces the cold-start sample inefficiency common to RL fine-tuning of large vision models. Results: 3× average in-game progress vs. frontier models tested in same setting; generalization held across cross-game transfer. The 100+ turn threshold is a practical milestone for embodied and agentic VLM deployment: real-world tasks like robotic manipulation, GCS operation, and multi-step document workflows routinely require extended horizons that current RLHF fine-tuning approaches cannot handle. Odysseus provides a reproducible training recipe and open framework for the community to extend.