LLM Wiki Pattern¶
Source: Karpathy's gist Ingested: 2026-04-09
Core Idea¶
Stop re-deriving. Start compiling.
Traditional RAG makes the LLM rediscover connections from raw docs on every query. The LLM Wiki eliminates this: the LLM builds a structured markdown knowledge base once, then queries are answered from the compiled wiki — not the raw sources.
"The wiki is a persistent, compounding artifact."
Three-Layer Architecture¶
Raw Sources (immutable, human-curated)
|
v [ingest]
Wiki Pages (LLM-maintained markdown)
|
v [query]
Answers (grounded, with citations)
Layer 1 — Raw Sources Documents, articles, papers. Never modified by LLM. Authoritative source of truth.
Layer 2 — Wiki LLM-generated and maintained markdown files. Summaries, entity pages, concept pages, comparisons. The human reads; the LLM writes.
Layer 3 — Schema Config (SCHEMA.md or CLAUDE.md) defining wiki structure, ingestion rules, conventions.
Three Operations¶
Ingest¶
New source added → LLM reads it, extracts key info, integrates into existing pages (may touch 10-15 files), updates index, logs activity.
Query¶
User asks → LLM reads index first, finds relevant pages, synthesizes answer with citations. Optionally saves valuable explorations as new pages.
Lint¶
Periodic health check: contradictions, stale claims, orphaned pages, missing cross-references, conceptual gaps.
Navigation Files¶
index.md — content catalog. LLM reads this first during queries to locate relevant pages.
log.md — append-only chronological record of all operations.
Why it beats RAG¶
| RAG | LLM Wiki |
|---|---|
| Re-derives on every query | Compiled once, queried fast |
| Context window filled with raw chunks | Context filled with synthesized knowledge |
| No accumulation | Compounds over time |
| Hard to synthesize across many docs | Cross-references pre-built |
v2 Extensions (rohitg00)¶
- Memory lifecycle: confidence scoring, supersession, gradual forgetting
- Knowledge graph: typed entities (people, projects, libs) + relationships (uses, depends-on, contradicts)
- Automation: event-driven hooks, auto-ingest, scheduled consolidation
- Consolidation tiers: working memory → episodic → semantic → procedural
Implementation in this Wiki¶
This wiki IS the implementation. Hosted at wiki.mukhayyar.my.id.
To ingest new content: tell Ductor "wiki ingest [URL or content]" — it will create/update pages and update log.md.
Recent Finds¶
Updated: 2026-04-11
Single-Agent LLMs Outperform Multi-Agent Systems Under Equal Token Budgets (arXiv 2604.02460)¶
April 2, 2026. Using the Data Processing Inequality as a formal lens, this paper shows that when reasoning-token budgets are held equal across single-agent and multi-agent setups, single-agent systems consistently match or beat multi-agent on multi-hop reasoning (tested on Qwen3, DeepSeek-R1-Distill-Llama, Gemini 2.5). The damning finding: most reported multi-agent benchmark gains are measurement artifacts — extra unaccounted compute and context injection disguised as architectural advantage. Implication for anyone building multi-agent pipelines: benchmark honestly by equalizing total thinking tokens. Multi-agent is not inherently smarter — it's often just burning more tokens.
Multi-Agent Orchestration for Exascale Materials Screening (arXiv 2604.07681)¶
April 9, 2026. A productive counterpoint: multi-agent does win when the problem is parallelism-limited, not reasoning-limited. Hierarchical planner-executor architecture (gpt-oss-120b) deployed on the Aurora supercomputer to screen metal-organic frameworks for water harvesting at exascale. The planner decomposes the search space; executor agents run asynchronously. Result: high task completion rates and low orchestration overhead. Synthesis with arXiv 2604.02460: single-agent beats multi-agent in reasoning depth; multi-agent beats single-agent in throughput over embarrassingly parallel tasks. The line is now clearly drawn.
Efficient Inference of Large Vision-Language Models — Survey (arXiv 2603.27960)¶
March 30, 2026. Comprehensive survey of LVLM inference acceleration organized into four axes: (1) visual token compression (pruning, merging, spatiotemporal aggregation — the central problem is quadratic attention cost over long visual token sequences), (2) memory & serving (KV-cache paging, continuous batching), (3) architecture (sparse MoE, cross-modal projector optimization, hardware-aware attention kernels), (4) advanced decoding (speculative decoding adapted for multimodal). Most useful as a reference taxonomy for anyone optimizing a VLM pipeline. Open problems flagged: streaming video inference and expert routing efficiency in sparse MoE cross-modal models.
SHAPE: Stage-Aware Hierarchical Advantage for LLM Reasoning (arXiv 2604.06636)¶
April 8, 2026. SHAPE formalizes LLM reasoning as a trajectory through a "state space of empirical solvability" and introduces a hierarchical credit assignment mechanism. At the segment level, it uses a stage-aware advantage function to reward efficient breakthroughs in low-solvability states (the hard parts); at the token level, it uses potential-based shaping to avoid rewarding verbosity. Result: 3% average accuracy gain on math reasoning benchmarks with 30% reduced token consumption — addressing the token-length inflation problem that plagues chain-of-thought fine-tuning. Key insight: current process supervision rewards steps without distinguishing meaningful progress from verbose padding.
On Step Length Confounding in LLM Reasoning Data Selection (arXiv 2604.06834)¶
April 8, 2026. Identifies a subtle but critical flaw in how reasoning datasets are curated: step length is confounded with reasoning quality in most selection pipelines. Longer reasoning chains score higher on standard selection metrics (perplexity-based or reward-model-based), but longer ≠ better. The paper shows that removing this confound and selecting for compact, correct reasoning steps improves downstream fine-tuned model quality. Practical implication: if you're training on synthetic reasoning data (GPT-4 / o3 generated), blind length-based filtering actively degrades model quality.
RLHF: A Statistical Perspective (arXiv 2604.02507)¶
Establishes a statistical foundation for RLHF by addressing how to combine abundant but biased AI-generated labels with limited high-quality human feedback. Derives recovery guarantees for human-aligned preferences under mixed-data regimes. Key insight: the bottleneck in scaling RLHF is not the RL algorithm but the quality and quantity of human preference signal — this paper provides the theoretical grounding for hybrid data strategies.
GIFT: Group-Relative Implicit Fine-Tuning (arXiv 2510.23868)¶
Reformulates the RL fine-tuning objective as a normalized MSE loss between implicit and explicit reward signals, eliminating intractable normalization constants and clipping mechanisms found in PPO/GRPO. GIFT produces lower-variance gradients than standard RLHF baselines. Practical impact: reduces per-step training instability without sacrificing policy expressiveness, making RLHF more accessible for smaller research teams.
LLMOrbit: A Circular Taxonomy of Large Language Models (arXiv 2601.14053)¶
Maps the multi-agent LLM landscape and introduces a circular taxonomy organizing models by capability tier and deployment role. Covers scaling walls, agentic reasoning patterns, and system-level engineering challenges. Notable: introduces the concept of dynamic agent topology rewiring — multi-agent systems that restructure their communication graph during reasoning, rather than fixing it at design time.