Skip to main content
Open · Practitioner-Led · Production-Grounded

EEAI Research Lab

Research the world can use.

[ Research ]

View all →

LLM Framework Showdown

A 30-notebook reproducible benchmark comparing SynapseKit, LangChain, and LlamaIndex head-to-head across 18 production dimensions. We measure cold start time (SynapseKit: 12ms vs LangChain: 360ms), dependency count (2 vs 67), memory footprint under load, streaming latency at P99, API abstraction depth, RAG pipeline throughput, agent loop reliability, and error handling patterns. Every benchmark runs on Kaggle. Clone the notebook, hit run, and verify the numbers yourself. No cherry-picked results, no synthetic loads. We test under the conditions you actually deploy in: concurrent requests, memory-constrained containers, and real API endpoints with network jitter.

SynapseKitLangChainLlamaIndexFramework Benchmark, 18 DimensionsCold start · Memory · Latency · DX

Traceprop: End-to-End Provenance-Guided Data Attribution for Auditable Machine Learning

We built Traceprop because every ML pipeline we audited had the same gap: raw source files on one side, model predictions on the other, and nothing connecting them. Traceprop closes it with three co-designed layers: ProvenanceTensor lineage tracking with sub-1% runtime overhead at 1M elements, a JL-projected GradientStore delivering LDS 0.622 on tabular data at 0.22s CPU - 266x faster than TRAK on CIFAR-2 - and provenance-guided unlearning that exceeds retrain-from-scratch by closing over 100% of the gap. Production-ready for EU AI Act Article 26 and GDPR Article 17 compliance. Apache 2.0. Preprint on Zenodo DOI 10.5281/zenodo.20036000.

credit_scores.csv row 4821ProvenanceTensornormalize, row_filterGradientStorek=4096, JL projectionAudit Answerrow 4821, score 0.921Benchmark ResultsLineage overhead1.007xLDS tabular (CPU)0.622vs TRAK speed266xUnlearn gap closed>100%Multi-src query2.36msApache 2.0 - pip install traceprop

[ Projects ]

View all →

Traceprop: End-to-End Provenance-Guided Data Attribution for Auditable Machine Learning

We built Traceprop because every ML pipeline we audited had the same gap: source files on one side, predictions on the other, nothing connecting them. Traceprop closes it with three layers - ProvenanceTensor lineage (sub-1% overhead at 1M elements), JL-projected GradientStore attribution (LDS 0.622 on tabular, 266x faster than TRAK), and provenance-guided unlearning that exceeds retrain-from-scratch. EU AI Act Article 26 enforcement starts August 2026. Apache 2.0. Two lines of code change. Everything else stays the same.

credit_scores.csv row 4821ProvenanceTensornormalize, row_filterGradientStorek=4096, JL projectionAudit Answerrow 4821, score 0.921Benchmark ResultsLineage overhead1.007xLDS tabular (CPU)0.622vs TRAK speed266xUnlearn gap closed>100%Multi-src query2.36msApache 2.0 - pip install traceprop

SynapseKit v1.7.0

A production-grade LLM framework built because LangChain frustrated us, and we decided to measure whether a simpler approach could actually work better. SynapseKit has 2 dependencies (vs LangChain's 67), a 30x faster cold start (12ms vs 360ms), and built-in cost guardrails that prevent a single agent run from blowing your API budget. Chains, agents, RAG pipelines, and tool use, all with zero magic and full debuggability. When something breaks at 3am in production, you can read the source in 20 minutes. MIT-licensed, fully documented, and battle-tested against 18 objective benchmarks published on Kaggle.

SynapseKithttpxpydanticvsLangChain, 67 dependencies30× faster2 vs 67MIT

ChunkRank

A Python library for ranking and filtering document chunks by relevance before sending them to an LLM. In any RAG pipeline, the retriever returns chunks - but not all chunks are equally useful. ChunkRank scores and re-ranks them so only the most informative content enters the context window. The result: lower token cost, less noise, and measurably better answer quality. Install with pip, plug into any retrieval pipeline, and start filtering chunks in three lines of code. Open-source, lightweight, and designed to sit between your vector store and your LLM without adding complexity.

92.4%Overall Accuracy47msAvg Latency$0.003Cost/Task99.1%ReliabilityOpen Evals LeaderboardProvider-agnostic · Standardized metrics

Framework Showdown Notebooks

30 Jupyter notebooks on Kaggle, each one a fully reproducible benchmark you can clone and run in under 5 minutes. We compare developer experience (how many lines of code to build a chain?), RAG pipeline quality (retrieval accuracy, generation faithfulness, latency), agent capabilities (multi-step reasoning, tool use, error recovery), and production concerns (memory usage, cold start, dependency conflicts). Every notebook includes the methodology, raw results, statistical significance tests, and our interpretation. No hidden preprocessing, no curated datasets. If you disagree with a result, fork the notebook and prove us wrong. That's the point.

30notebooksReproducible on Kaggle

[ Models & Tools ]

View all ↗

SynapseKit, Open Source

The framework itself, MIT-licensed on GitHub. Chains, agents, RAG, and tool use with zero bloat, designed for engineers who read source code, not documentation. Two dependencies means two things that can break. No monkey-patching, no global state, no implicit retry logic you discover in production. Every function does exactly what its signature says. The agent loop is 47 lines of code you can understand in one sitting. Cost guardrails are built into the execution engine. Set a budget, and the agent stops cleanly instead of burning through your API credits. Fork it, extend it, or just read it to understand how LLM frameworks actually work under the hood.

$ pip install synapsekit✓ 2 dependencies installed$ synapse chain run --model gpt-4→ chain executed in 47ms→ cost: $0.003$ synapse agent --tools search,calc✓ agent ready (cold start: 12ms)→ guardrails: cost=$0.50 maxMIT LicenseZero MagicFull Control

[ Blog ]

View all →

I Built a Lightweight LLM Framework

The SynapseKit origin story, a detailed engineering deep dive into why we built a new LLM framework, what architectural decisions we made, and what 18 objective benchmarks against LangChain and LlamaIndex actually revealed. Cold start: 30× faster. Dependencies: 2 vs 67. But we also document where SynapseKit loses, because honest benchmarking means publishing the uncomfortable numbers too. This post walks through the chain abstraction, the agent loop design, the cost guardrail implementation, and the RAG pipeline architecture. If you're choosing an LLM framework for production, the numbers in this post will change how you think about the decision.

30×faster18 Benchmarks InsideAI Letters NewsletterWeekly deep dives for production engineers

AI Letters Newsletter

A weekly newsletter for senior engineers building with AI, not executives skimming headlines. Every issue makes a specific argument, backs it with evidence from papers and production data, and tells you exactly what to do about it. Topics span scaling laws (what the math actually predicts), agent reliability (why your ReAct loop fails at step 7), inference optimization (speculative decoding, KV cache strategies), and framework comparisons (with real benchmarks, not vibes). 29+ issues published. No fluff, no buzzwords, no "AI will change everything" without explaining how. Written by engineers, for engineers who ship.

AI Letters #29Read More29+issuesWeekly · Scaling Laws · Agent FailuresProduction AI insights for senior engineers

150+ Paper Breakdowns

Every major AI paper decoded through an engineering lens, not a summary, but a critical analysis of what each paper claims, what the benchmarks actually measure, and whether the results hold up when you try to reproduce them in production. Sourced from arXiv, Hugging Face Daily Papers, OpenReview, ACL Anthology, and Papers With Code. Each breakdown is tagged by topic, searchable, and filterable by source. We ask the questions practitioners care about: what's the real compute cost? Does this scale? What breaks when you go from 1,000 to 1,000,000 users? If a paper matters for production AI, we've probably broken it down.

150+papers5 Sources · Tagged · FilterablearXiv · HF Daily · OpenReview · ACL · PwC

Let's Connect

We're building this in the open. Join us. Contribute to research, collaborate on benchmarks, or just follow along.