4 Autopsies in Progress
Subscribe to AI Letters to be notified when new autopsies are published.
AutoGPT & Long-Horizon Autonomy
Claimed: Fully autonomous GPT-4 agents that could complete any multi-step task
Reality: Token loops, context collapse, irreversible action failures. Long-horizon autonomy is still an open research problem.
Key lesson: The gap between "can call tools" and "can reliably complete a 20-step task" is enormous. Stateful memory architecture is the missing piece.
Sparse Mixture-of-Experts
Claimed: 7x parameter efficiency, same quality as a dense model at a fraction of the compute
Reality: Training instability, expert collapse, communication overhead, and the gap between FLOPs-on-paper vs wall-clock time.
Key lesson: MoE works, but the engineering complexity is significant. Mixtral was the first architecture that actually solved the training challenges.
Early RAG: The Original Paper
Claimed: Retrieval solves hallucination. Ground your LLM in documents and it answers accurately
Reality: Sparse retrieval, no reranking, single-hop only. The original paper benchmarks hid the failure modes that real deployments encountered.
Key lesson: RAG v1 benchmarks used clean, matching corpora. Production RAG requires dense retrieval, reranking, chunking strategy, and query decomposition.
MMLU: The Benchmark That Outlived Its Usefulness
Claimed: A comprehensive test of language model knowledge across 57 subjects
Reality: Training data contamination, pattern-matching shortcuts, and the divergence between MMLU scores and real-world task performance.
Key lesson: Benchmark saturation happens fast. Once top models score 90%+, the benchmark stops measuring what it claimed to measure.
