Skip to main content
Honest AnalysisBenchmark DecoderProduction FailuresUnique Content

PaperAutopsies

Most research coverage focuses on what worked. This section focuses on what didn't, and why that matters more for engineers making decisions. Honest analysis of hyped papers that underdelivered.

UniqueContent Type
0Autopsies Planned
BrutalHonesty Policy
Freeto Read
Framework

Anatomy of an Autopsy

Every autopsy follows a structured format. Here's what each one covers.

01
The Claim

What the paper claimed in its abstract and results tables, including the benchmark numbers that led to the hype.

02
The Benchmark Decoder

What the benchmark actually measures. What assumptions were made. What conditions made the numbers achievable.

03
The Production Gap

What failed when engineers tried to deploy this in real systems. The failure modes the paper didn't test for.

04
What the Community Learned

The follow-up papers, the blog posts, the quiet acknowledgements. How the community moved on and what it built instead.

05
The Engineering Verdict

Should you use this technique? Under what conditions? With what caveats? An honest answer for practitioners.

Coming Soon

4 Autopsies in Progress

Subscribe to AI Letters to be notified when new autopsies are published.

Coming Soon

AutoGPT & Long-Horizon Autonomy

Claimed: Fully autonomous GPT-4 agents that could complete any multi-step task

Reality: Token loops, context collapse, irreversible action failures. Long-horizon autonomy is still an open research problem.

Key lesson: The gap between "can call tools" and "can reliably complete a 20-step task" is enormous. Stateful memory architecture is the missing piece.

Coming Soon

Sparse Mixture-of-Experts

Claimed: 7x parameter efficiency, same quality as a dense model at a fraction of the compute

Reality: Training instability, expert collapse, communication overhead, and the gap between FLOPs-on-paper vs wall-clock time.

Key lesson: MoE works, but the engineering complexity is significant. Mixtral was the first architecture that actually solved the training challenges.

Coming Soon

Early RAG: The Original Paper

Claimed: Retrieval solves hallucination. Ground your LLM in documents and it answers accurately

Reality: Sparse retrieval, no reranking, single-hop only. The original paper benchmarks hid the failure modes that real deployments encountered.

Key lesson: RAG v1 benchmarks used clean, matching corpora. Production RAG requires dense retrieval, reranking, chunking strategy, and query decomposition.

Coming Soon

MMLU: The Benchmark That Outlived Its Usefulness

Claimed: A comprehensive test of language model knowledge across 57 subjects

Reality: Training data contamination, pattern-matching shortcuts, and the divergence between MMLU scores and real-world task performance.

Key lesson: Benchmark saturation happens fast. Once top models score 90%+, the benchmark stops measuring what it claimed to measure.

Stay Ahead

Get Notified When Autopsies Drop

Paper autopsies take time to get right. Subscribe to AI Letters, the weekly digest, and you'll be the first to know when new ones go live.