[ Research ]
View all →LLM Framework Showdown
A 30-notebook reproducible benchmark comparing SynapseKit, LangChain, and LlamaIndex head-to-head across 18 production dimensions. We measure cold start time (SynapseKit: 12ms vs LangChain: 360ms), dependency count (2 vs 67), memory footprint under load, streaming latency at P99, API abstraction depth, RAG pipeline throughput, agent loop reliability, and error handling patterns. Every benchmark runs on Kaggle. Clone the notebook, hit run, and verify the numbers yourself. No cherry-picked results, no synthetic loads. We test under the conditions you actually deploy in: concurrent requests, memory-constrained containers, and real API endpoints with network jitter.
Traceprop: End-to-End Provenance-Guided Data Attribution for Auditable Machine Learning
We built Traceprop because every ML pipeline we audited had the same gap: raw source files on one side, model predictions on the other, and nothing connecting them. Traceprop closes it with three co-designed layers: ProvenanceTensor lineage tracking with sub-1% runtime overhead at 1M elements, a JL-projected GradientStore delivering LDS 0.622 on tabular data at 0.22s CPU - 266x faster than TRAK on CIFAR-2 - and provenance-guided unlearning that exceeds retrain-from-scratch by closing over 100% of the gap. Production-ready for EU AI Act Article 26 and GDPR Article 17 compliance. Apache 2.0. Preprint on Zenodo DOI 10.5281/zenodo.20036000.
[ Projects ]
View all →Traceprop: End-to-End Provenance-Guided Data Attribution for Auditable Machine Learning
We built Traceprop because every ML pipeline we audited had the same gap: source files on one side, predictions on the other, nothing connecting them. Traceprop closes it with three layers - ProvenanceTensor lineage (sub-1% overhead at 1M elements), JL-projected GradientStore attribution (LDS 0.622 on tabular, 266x faster than TRAK), and provenance-guided unlearning that exceeds retrain-from-scratch. EU AI Act Article 26 enforcement starts August 2026. Apache 2.0. Two lines of code change. Everything else stays the same.
SynapseKit v1.7.0
A production-grade LLM framework built because LangChain frustrated us, and we decided to measure whether a simpler approach could actually work better. SynapseKit has 2 dependencies (vs LangChain's 67), a 30x faster cold start (12ms vs 360ms), and built-in cost guardrails that prevent a single agent run from blowing your API budget. Chains, agents, RAG pipelines, and tool use, all with zero magic and full debuggability. When something breaks at 3am in production, you can read the source in 20 minutes. MIT-licensed, fully documented, and battle-tested against 18 objective benchmarks published on Kaggle.
ChunkRank
A Python library for ranking and filtering document chunks by relevance before sending them to an LLM. In any RAG pipeline, the retriever returns chunks - but not all chunks are equally useful. ChunkRank scores and re-ranks them so only the most informative content enters the context window. The result: lower token cost, less noise, and measurably better answer quality. Install with pip, plug into any retrieval pipeline, and start filtering chunks in three lines of code. Open-source, lightweight, and designed to sit between your vector store and your LLM without adding complexity.
Framework Showdown Notebooks
30 Jupyter notebooks on Kaggle, each one a fully reproducible benchmark you can clone and run in under 5 minutes. We compare developer experience (how many lines of code to build a chain?), RAG pipeline quality (retrieval accuracy, generation faithfulness, latency), agent capabilities (multi-step reasoning, tool use, error recovery), and production concerns (memory usage, cold start, dependency conflicts). Every notebook includes the methodology, raw results, statistical significance tests, and our interpretation. No hidden preprocessing, no curated datasets. If you disagree with a result, fork the notebook and prove us wrong. That's the point.
[ Models & Tools ]
View all ↗SynapseKit, Open Source
The framework itself, MIT-licensed on GitHub. Chains, agents, RAG, and tool use with zero bloat, designed for engineers who read source code, not documentation. Two dependencies means two things that can break. No monkey-patching, no global state, no implicit retry logic you discover in production. Every function does exactly what its signature says. The agent loop is 47 lines of code you can understand in one sitting. Cost guardrails are built into the execution engine. Set a budget, and the agent stops cleanly instead of burning through your API credits. Fork it, extend it, or just read it to understand how LLM frameworks actually work under the hood.
[ Blog ]
View all →I Built a Lightweight LLM Framework
The SynapseKit origin story, a detailed engineering deep dive into why we built a new LLM framework, what architectural decisions we made, and what 18 objective benchmarks against LangChain and LlamaIndex actually revealed. Cold start: 30× faster. Dependencies: 2 vs 67. But we also document where SynapseKit loses, because honest benchmarking means publishing the uncomfortable numbers too. This post walks through the chain abstraction, the agent loop design, the cost guardrail implementation, and the RAG pipeline architecture. If you're choosing an LLM framework for production, the numbers in this post will change how you think about the decision.
AI Letters Newsletter
A weekly newsletter for senior engineers building with AI, not executives skimming headlines. Every issue makes a specific argument, backs it with evidence from papers and production data, and tells you exactly what to do about it. Topics span scaling laws (what the math actually predicts), agent reliability (why your ReAct loop fails at step 7), inference optimization (speculative decoding, KV cache strategies), and framework comparisons (with real benchmarks, not vibes). 29+ issues published. No fluff, no buzzwords, no "AI will change everything" without explaining how. Written by engineers, for engineers who ship.
150+ Paper Breakdowns
Every major AI paper decoded through an engineering lens, not a summary, but a critical analysis of what each paper claims, what the benchmarks actually measure, and whether the results hold up when you try to reproduce them in production. Sourced from arXiv, Hugging Face Daily Papers, OpenReview, ACL Anthology, and Papers With Code. Each breakdown is tagged by topic, searchable, and filterable by source. We ask the questions practitioners care about: what's the real compute cost? Does this scale? What breaks when you go from 1,000 to 1,000,000 users? If a paper matters for production AI, we've probably broken it down.
Let's Connect
We're building this in the open. Join us. Contribute to research, collaborate on benchmarks, or just follow along.
