AI Engineering Letters - EngineersOfAI

AI Letters #35 - Why We Built SynapseKit: The Framework We Deserve

Thu, 14 May 2026 00:00:00 GMT

It was 3 AM and production was on fire. An LLM pipeline had cold-started on Lambda taking 30 seconds just to import dependencies, while the $99/month observability tool told us nothing useful. We'd chosen a "safe" framework with 100K stars and enterprise support—but we were fighting it as much as building with it. That moment led us to rebuild from first principles. Meet SynapseKit: 2 dependencies, async-native, full cost transparency, and Apache 2.0 forever.

Interactive Chart

SynapseKit Roadmap - v1.7.0 to v2.0.0 →

From 12 contributors in month 1 to 40+ by month 3. 8 major features shipping June-September 2026.

Interactive Chart

SynapseKit Design Philosophy →

5 principles that compound: dependency minimalism, async-native, transparency, community, open source as moat.

Interactive Chart

Public Benchmarks & Verdicts →

Cold start, token costs, latency, and feature coverage. All data published. Anyone can reproduce.

The Problem We Lived

It was 3 AM. Production was on fire. An LLM pipeline had cold-started on Lambda, and the container was taking 30 seconds just to import dependencies. Meanwhile, the observability tool we paid $99/month for was telling us... nothing useful.

We'd chosen a popular framework because it was the "safe" choice. It had 100K stars, enterprise support, and a massive ecosystem. But in production, it felt like we were fighting the framework as much as building with it.

The async APIs were baked on top of synchronous code. The dependency tree was a forest (50+ transitive deps). Observability required another SaaS subscription. And debugging? Forget it—too much "magic" between you and the LLM call.

We're not unique. Thousands of teams have hit the same wall. And we thought: What if we rebuilt this from first principles?

That question became SynapseKit.

What SynapseKit Actually Is

SynapseKit is not trying to be a LangChain killer. It's trying to be different.

The difference starts here—not features, but principles:

Problem	LangChain-Style	SynapseKit
Dependencies	50+ (200 MB)	2 (numpy, rank-bm25)
Async Design	Bolted on	Native from day 1
Cost Visibility	$99+/month SaaS	Built-in, free
Deployment Tools	Deprecated	synapsekit serve
Observability	Black box	Instrumented, transparent
Token Tracking	Hidden	Per-call tracking

We're building for production teams who are tired of choosing between:

Power (but complexity)
Simplicity (but missing features)
Open source (but no support)
Commercial (but expensive and lock-in)

SynapseKit says: You don't have to choose.

What This Means for You

1. You Own Your Code

Every LLM call, every prompt, every decision—it's yours. There's no proprietary "chain" abstraction hiding what's happening.

from synapsekit import RAG

rag = RAG(model="gpt-4o")
rag.add("Your documents")

# This actually does what you think it does.
# No hidden orchestration. No vendor-specific magic.
result = await rag.query("What is this about?")

Compare to frameworks where rag.query() invokes 12 internal transformations you didn't ask for.

2. You Keep 90% of Your Cold Start

Lambda cold starts matter. A 2 KB framework matters.

We measured: import synapsekit = 200 ms. import langchain = 2.8 seconds.

That's not hypothetical. That's real deployments. That's the difference between your API responding in 100 ms vs 3 seconds during scale events.

3. You See Your Costs

from synapsekit import CostTracker, BudgetGuard

tracker = CostTracker()
guard = BudgetGuard(daily=10.0, per_request=0.50)

with tracker.scope("my_pipeline"):
    result = await rag.query("Question?")

print(tracker.summary())
# Output:
# total_cost: $0.0234
# tokens_in: 1,200
# tokens_out: 450
# model: gpt-4o
# cost_per_1k: $2.50 / $15.00

Every LLM framework should have this. No SaaS fees. No surprise bills. Just facts.

Why We're Staying Open Source (Forever)

This matters. So let's be clear about what open source means to us.

The Temptation

VC-backed frameworks always face the moment: "When do we monetize?"

LangChain took it by building LangSmith ($99+/mo). That's a valid business model. But it creates incentive misalignment: the best features live behind a paywall.

We're choosing differently.

The Bet

SynapseKit core = Apache 2.0 forever.

No tricky license changes. No "open core" where the good stuff is closed. No "we're keeping the best for enterprise."

The framework you use in production is the same framework available to students, hobbyists, and competitors.

Why? Because:

Trust compounds. If you know the code can't suddenly become proprietary, you can bet your infrastructure on it.
Bugs matter less. Open source means crowdsourced debugging. 200 eyes beat 20.
Optimization flows both ways. When a user optimizes for their use case and contributes it back, everyone wins.
We make money differently. (More on that below.)

What We Monetize

We monetize on top, not instead of:

SynapseKit Core (framework) - Apache 2.0, always free
EvalCI Pro (evaluation SaaS) - Team dashboards, Slack alerts, private repos
synapsekit.cloud (managed hosting) - Deploy with one command
Compliance reports - EU AI Act and GDPR audits for enterprises

The core framework is the funnel. Everything else is optional.

This is the bet: Build the most trustworthy LLM framework. Let it be free. Earn money by solving operational problems the framework surfaces.

What the Community Taught Us

We shipped SynapseKit in March 2026. By May, we had 12 contributors and 9,200 downloads in 30 days.

Here's what the community actually cares about (not what we thought they would):

Simplicity Beats Ecosystem

We expected people to love our 33 LLM providers. They do. But what they really love: changing one line (model="anthropic/claude" to model="groq/mixtral") and the entire pipeline switches.

Lesson: Unified APIs beat breadth.

Cost Visibility Beats Ease

We built CostTracker assuming 5% of users would enable it. 40% did immediately.

Teams aren't afraid of complexity. They're afraid of surprise bills.

Lesson: Make the invisible visible.

Async-Native Beats Backwards Compatibility

We chose async-first, sync-wrappers. We got pushback: "But some teams only use sync!"

Six months later, those teams were refactoring to async. The performance difference was too obvious to ignore.

Lesson: The future is async. Bet on it.

Testing Beats Documentation

We shipped with thorough tests (2,161 by v1.5.6) but sparse docs. People still contributed. They read the tests as documentation.

Lesson: Tests are the spec.

Transparency Beats Polish

When we had a bug in async evaluation (v1.5.1), we posted a detailed postmortem explaining why we missed it. The community response: "At least you're honest."

Lesson: Admit mistakes. Explain root causes. Ship fixes.

How We're Benchmarking Everything (No Illusions)

We could say "SynapseKit is faster" and assume no one would check. But we're betting on people who will check.

So we're running public benchmarks:

Cold Start Benchmarks

Framework          Import Time    Container Size
SynapseKit         200 ms         ~5 MB
Framework B        2,800 ms       ~200 MB
Framework C        1,200 ms       ~150 MB

Published monthly. Real data. Anyone can reproduce it.

Token Cost Benchmarks

Task: "Summarize 10 documents, return JSON"

Model    Via SynapseKit    Via Others    Difference
GPT-4o   $0.0234           $0.0234       (same!)
Claude   $0.0198           $0.0198       (same!)
Groq     $0.00001          $0.00001      (same!)

No hidden markup. No feature taxes. We're a passthrough.

Latency Benchmarks

Operation                  P50    P95    P99
RAG query (retrieval)      45ms   120ms  300ms
Agent tool call            80ms   250ms  800ms
Graph workflow (10 nodes)  200ms  600ms  1.5s

Published, reproducible, hardware-specified.

Feature Coverage Benchmarks

Feature              SynapseKit    Others
LLM Providers        33            38+
Document Loaders     53            200+
Vector Stores        11            15+
Built-in Tools       47+           50+
Async Support        ✅ Native     ⚠ Bolted-on
Token Tracking       ✅ Free       ❌ Paid
Deployment           ✅ Built-in   ❌ Deprecated

No hidden asterisks. No "features you can't use."

Why We Benchmark

We're not trying to win on every metric. We're trying to be honest about the tradeoffs.

Yes, LangChain has 200+ loaders. We have 53. But those 53 are maintained and tested. A loader that breaks silently is worse than no loader.

Yes, we're missing some providers. But when you use a provider on SynapseKit, you know it works because we test it against actual APIs.

The bet: Teams would rather have 90% great than 100% mediocre.

Why We'll Be the Best Tool

Not because we have the most features. Not because we have the most stars.

Because we're built on principles that compound:

Dependency Minimalism = Embeddability

Every dependency you add is a future security hole, a version conflict, a cold start penalty.

We said: What if we just didn't? What if we built for embedding first, plugins second?

This means SynapseKit works in:

Lambda (fast cold starts)
Kubernetes (light containers)
Mobile (small binaries)
Edge (no Python stdlib bloat)

Others can't do this without a rewrite.

Async-Native = Production-Ready

Async isn't about being faster in theory. It's about handling real-world concurrency: 100 concurrent requests, 50 LLM API calls in flight, 10K tokens streaming.

Sync-first frameworks hit a wall at scale. Async-first frameworks scale to infinity.

We bet on infinity.

Transparency = Trust

No proprietary chains. No hidden costs. No surprise bills. Every LLM call is logged, tracked, and visible.

Trust is the hardest thing to build. And the easiest to lose. We're not willing to risk it for short-term gains.

Community = Compounding Returns

12 contributors in month 1. We're not paying them. They're contributing because:

They believe in the mission
The codebase is legible
Contributions are credited
The community is kind

This compounds. Month 2: 20 contributors. Month 3: 40 contributors. By year 2: a community-driven framework that no VC team could build.

Open Source = Moat

Counterintuitive: staying open source is our biggest competitive advantage.

Why? Because:

Teams bet their infra on open source. Not on a company.
Open source survives company acquisition/failure. Closed source doesn't.
Switching costs from open source are high (migration time, vendor trust). But lock-in is low (you always own the code).

This is a different kind of moat. It's built on trust, not contracts.

The 8 Features We're Shipping (v1.8.0 - v2.0.0)

We just mapped the roadmap. Here's what's coming:

v1.8.0: Production Grade (June 15)

🔍 Observability Dashboard: OpenTelemetry and Prometheus (no SaaS needed)
✅ Structured Output: Validation and auto-retry (no more JSON failures)
💾 Smart Context: Hierarchical allocation and prompt caching (80% cost reduction)
📊 Retrieval Metrics: Measure if RAG actually helps

v1.9.0: Advanced Retrieval (July 20)

🌐 Knowledge Graphs: Multi-hop reasoning and entity relationships
🧠 Reasoning Routing: Smart routing to o1/o3/Claude thinking models

v2.0.0: Distributed (September 1)

🤖 Agent Federation: Multi-agent coordination at scale
📈 Feedback Loops: Production to training data to auto-improvement

We're shipping 8 major features in 4 months. The framework as built by the community.

What Success Looks Like

Not valuation. Not GitHub stars (though those help).

Success is:

A team deploys an LLM app on SynapseKit and it just works.
A student learns async Python by reading SynapseKit's codebase.
An open-source contributor ships a feature that 10,000 people use.
A startup scales to 1M requests/day without hitting a wall.
An enterprise can audit the code and say "Yeah, we trust this."

Join Us

We're hiring open-source contributors. Not employees. Contributors.

You pick an issue. You ship it. You're credited as co-author. End of transaction.

Start here: https://github.com/SynapseKit/SynapseKit/issues/695-702

8 issues. Your choice. 1-3 weeks. Shipped to production.

The Final Truth

We're not building SynapseKit because we think we're smarter than the frameworks that came before. We're building it because we learned from them.

We learned that:

Teams care about cold starts more than ecosystem breadth
Cost transparency beats feature parity
Async-native isn't optional in 2026
Open source isn't a business model; it's a moat
Benchmarks matter more than claims

We're building for the 10,000 teams shipping LLM apps in production right now. Not the 100 teams with billion-dollar budgets. Not the students building chatbots. Not the conferences talking about theory.

For people who actually care that their imports don't take 3 seconds. Who track every dollar. Who want to read the code they ship. Who believe open source beats closed ecosystems.

If that's you, we'll see you in the PRs.

Let's build the framework we deserve.

Resources

GitHub: https://github.com/SynapseKit/SynapseKit
Docs: https://synapsekit.github.io/synapsekit-docs/
Discord: https://discord.com/invite/PSuAXHRywJ

Written May 14, 2026. SynapseKit v1.7.0 is live. v1.8.0 ships June 15.

This post will be outdated in 2 months. That's the point.

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #34 - The 30-Day LLM Framework Verdict: 25 Benchmarks, One Clear Answer

Fri, 08 May 2026 00:00:00 GMT

30 notebooks. 25 benchmarks. SynapseKit 14 wins (8.39/10), LangChain 7 wins (6.83/10), LlamaIndex 4 wins (6.40/10). Here is where each framework wins, where it loses, and which one you should actually use.

Interactive Chart

30-Day Benchmark Timeline ->

All 25 benchmark scores across 4 weeks. Filter by week, hover for details, see which framework won each notebook.

Interactive Code Comparison

Simplest RAG: Line by Line ->

Side-by-side code for all three frameworks across three complexity levels. See the LoC cost of adding retrieval and memory.

Final Verdict Dashboard

Win Distribution & Category Breakdown ->

Click each framework to see exactly what it wins and where it struggles. Radar chart, category averages, and when-to-use guide.

After 30 notebooks and 25 benchmarks, the ranking is clear. But the more interesting result is where each framework loses.

I started this series with a simple question: if you were starting a new AI project today, which framework should you actually use?

Not "which has the most GitHub stars." Not "which has the best documentation." Not "which do the most job listings mention." Which one performs better on the tasks you will actually need to do - from cold start to production guardrails?

Thirty notebooks later, the data has an answer. The answer is not what I expected when I designed the benchmarks.

The series ran four weeks. Week 1 tested developer experience: how fast can you install it, how many lines to get a working RAG, how much memory does it use, how well does it handle provider switching, how readable are its error messages? Week 2 moved into RAG pipelines: PDF ingestion, chunking strategies, BM25, hybrid search, streaming, conversation memory. Week 3 covered agents: ReAct loops, function calling, built-in tools, multi-agent orchestration, observability, error handling. Week 4 tested production readiness: async throughput, graph workflows, LLM evaluation, cost tracking, guardrails, MCP support. The finale (#29) asked a deliberately blunt question: what is the absolute minimum code to build a working RAG pipeline in each framework?

Disclosure: I am the author of SynapseKit. All benchmarks are reproducible - every notebook is public on Kaggle. Fork and run yourself.

What the Data Actually Shows

The final scores across 25 benchmarks:

Framework	Avg Score	Total	Wins	Win %
SynapseKit	8.39/10	209.7	14	56%
LangChain	6.83/10	170.8	7	28%
LlamaIndex	6.40/10	160.0	4	16%

That top-line number is not the interesting part. The interesting part is the pattern of where each framework wins and loses.

SynapseKit wins 4 of 6 in Week 1, 2 of 6 in Week 2, 3 of 6 in Week 3, and 4 of 6 in Week 4. The only weeks where it does not dominate are the ones involving complex agent orchestration (Week 3) and deep RAG quality (Week 2). Those are exactly the areas where LangChain and LlamaIndex have years of accumulated investment.

LangChain wins 7 of 25. All 7 are in areas requiring sophisticated composition: streaming, conversation memory, function calling, multi-agent, observability, graph workflows. LangGraph - LangChain's DAG abstraction - is genuinely the most mature stateful workflow tool available in any LLM framework today. That is not close.

LlamaIndex wins 4 of 25. Three of those wins are RAG-specific: PDF ingestion, chunking strategies, and LLM evaluation. LlamaIndex's faithfulness and relevancy evaluators are deeper than anything the other two frameworks ship out of the box.

The Evidence

Week 4: Production Readiness

The Week 4 results were the most lopsided of the series. SynapseKit took 4 of 6.

Async throughput (#22): SynapseKit delivered 3.2x LangChain's throughput at 20 concurrent requests. The framework is async-native at the core. LangChain and LlamaIndex treat async as an add-on.

Guardrails (#26): SynapseKit is the only framework with built-in PIIDetector, PIIRedactor, and ContentFilter primitives. LangChain scored 4.5/10. LlamaIndex scored 3.5/10. SynapseKit scored 9.8/10. That gap reflects a fundamental design choice about what belongs in the framework.

MCP Support (#27): SynapseKit supports MCP in-process, with a sync API, hitting 8/8 protocol features. LangChain hit 3/8 and requires a subprocess. As MCP becomes the standard interface for AI-to-tool connectivity, this gap will matter more.

Cost Tracking (#25): CostTracker in SynapseKit is 2 lines. Per-call tracking, session rollups, and budget limits. In LangChain you write this yourself using callbacks. In LlamaIndex you hook into their event system.

LangChain took graph workflows (#23). LangGraph scored 9.0/10. The StateGraph abstraction is genuinely better than anything the other frameworks offer for conditional branching, human-in-the-loop workflows, and persistent agent state.

The Simplest RAG Test (#29)

This was the most revealing single benchmark of the series.

SynapseKit - Level 1 (minimum viable RAG):
  from synapsekit import RAG
  answer = RAG.quick(SAMPLE_DOC, QUERY)
  Total: 2 lines

LlamaIndex - Level 1:
  from llama_index.core import VectorStoreIndex, Document
  index  = VectorStoreIndex.from_documents([Document(text=SAMPLE_DOC)])
  engine = index.as_query_engine()
  answer = engine.query(QUERY)
  Total: 4 lines (+ global Settings.llm required)

LangChain - Level 1:
  from langchain_core.prompts import ChatPromptTemplate
  from langchain_core.runnables import RunnablePassthrough
  from langchain_core.output_parsers import StrOutputParser
  prompt = ChatPromptTemplate.from_template(...)
  chain = (
      {"context": RunnablePassthrough(), "question": ...}
      | prompt | llm | StrOutputParser()
  )
  answer = chain.invoke({"context": SAMPLE_DOC, "question": QUERY})
  Total: 13 lines

The complexity tax per added feature:

Feature added      SK    LC    LI
----------------------------------
Base (L1)           2    13     4
+ Retrieval (L2)   +3    +8    +3
+ Memory (L3)      +2    +6    +4
----------------------------------
Full pipeline (L3)  7    27    11

The Full 30-Day Pattern

Category          SK avg   LC avg   LI avg   SK wins
----------------------------------------------------
Week 1 Dev Exp      8.37     5.83     6.00      4/6
Week 2 RAG          8.08     7.00     7.33      2/6
Week 3 Agents       8.17     8.08     6.08      3/6
Week 4 Production   8.75     6.63     5.92      4/6
Week 5 Simplest     9.50     5.50     8.00      1/1
----------------------------------------------------
Overall             8.39     6.83     6.40     14/25

Week 3 (Agents) is where the race was closest: SynapseKit 8.17, LangChain 8.08. LangChain's multi-agent orchestration and observability tooling are genuinely strong.

What This Means for Engineers

1. The "fewest lines" metric is not vanity - it predicts maintenance cost.

Every line of boilerplate is a line someone has to read, debug, and update when the API changes. A 13-line Level 1 RAG means every junior engineer on your team has to understand RunnablePassthrough before they can make their first contribution. A 2-line RAG means they start from the problem, not the plumbing.

2. LangGraph is a genuine competitive advantage - but only if you need it.

If your application requires stateful DAG workflows - conditional branching, human-in-the-loop approval steps, persistent agent memory across sessions - LangGraph is the best tool available. If your application does not need that, you are paying the complexity tax of LangChain without getting the payoff.

3. LlamaIndex's RAG evaluators are not replicable elsewhere in 10 minutes.

The faithfulness and context recall evaluators LlamaIndex ships have years of iteration behind them. If you are running a serious RAG system where retrieval quality is a measurable business metric, LlamaIndex's evaluation infrastructure is worth the integration cost.

4. Production primitives (guardrails, cost tracking, MCP) belong in the framework, not in your code.

Every PII detection regex you write in your app layer is a liability. Every manual token counter is a bug waiting to happen when you switch models. SynapseKit's Week 4 wins reflect a deliberate choice to move production concerns into framework primitives.

5. The ecosystem gap is real and will not close quickly.

LangChain has more blog posts, more Stack Overflow questions, more third-party integrations, and more engineers who already know it than SynapseKit. When something breaks in production at 2am, you want that ecosystem.

The Part Most People Will Get Wrong

The top-line verdict - SynapseKit wins 14/25 - will be read as "use SynapseKit for everything." That is not what the data says.

LangChain's 7 wins cluster in exactly the scenarios that matter most for large teams and complex systems: orchestration, observability, multi-agent coordination. If you are building a 10-person team product with complex agent workflows, LangChain's ecosystem and LangGraph's maturity probably outweigh the LoC advantage.

LlamaIndex's 4 wins are in a tightly defined domain where it is the best tool available. If your core product is document Q&A or knowledge base search, LlamaIndex's chunking strategies and evaluation framework represent real engineering investment you should not ignore.

The honest one-line per use case:

New project, small team, wants to ship fast: SynapseKit
Complex agents, large engineering team, needs ecosystem: LangChain
RAG quality as a core metric, document intelligence: LlamaIndex

Three Things Worth Doing This Week

Run the simplest-rag benchmark (#29) with your own document and query. The LoC difference is more visceral when it is your code, not mine.
If you are currently using LangChain for a simple RAG pipeline (no agents, no complex branching), count how many lines of boilerplate exist solely for framework composition. That number is your migration ROI estimate.
If you have a production LLM system with no PII detection layer, add one this week. It does not have to be SynapseKit - but it has to be something. The cost of a PII leak is not worth the shortcut.

The full series index with all 30 notebooks is on Kaggle. Every score is reproducible. Fork any notebook and run it yourself - if you get different numbers, I want to know.

This is not "my framework won so I declare victory." This is 30 notebooks of data saying: different frameworks are better at different things, and the choice should be driven by what your application actually needs.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #33 - We Built Traceprop: Finally, an ML Audit Trail That Answers the Regulator's Question

Tue, 05 May 2026 00:00:00 GMT

GitHub - AmitoVrito/Traceprop Preprint - DOI 10.5281/zenodo.20036000 pip install traceprop

We spent months auditing ML pipelines across regulated industries. Every single one had the same gap: source files on one side, model predictions on the other, and nothing connecting them. MLflow knew which file. DVC knew which commit. Influence libraries knew which tensor. Nobody knew which source row drove which decision. We built Traceprop to fix this. Today it's open source.

Interactive Timeline

The Provenance Gap: History and Enforcement Deadlines ->

From MLflow (2018) to EU AI Act enforcement (2026-2027): the full timeline of tools, gaps, and regulatory deadlines that made building Traceprop unavoidable.

Interactive Architecture

How Traceprop Works: Three-Layer Architecture ->

Click through each layer - lineage, attribution, unlearning - to see exactly how ProvenanceTensor, GradientStore, and the compliance exporter connect source files to predictions to audit certificates.

Benchmark Dashboard

The Numbers: Overhead, Attribution Quality, Unlearning ->

Sub-1% overhead at 1M elements. LDS 0.622 on tabular data. 266x faster than TRAK. Unlearning that exceeds retrain-from-scratch. Every benchmark in one view.

We built Traceprop because every ML pipeline we audited had the same fatal gap: source files on one side, model predictions on the other, and nothing in between that could answer a regulator's question. Today that changes.

A credit-scoring model declines an application. The regulator invokes Article 26 of EU Regulation 2024/1689. They want three things: which training records drove that decision, whether those records were processed correctly, and whether the institution can reduce their influence without full retraining.

We watched a well-resourced ML team try to answer this question. They had MLflow for experiment tracking, DVC for dataset versioning, and a state-of-the-art influence function library. It took them eleven days and they still couldn't produce a defensible answer. MLflow knew which file was used - not which rows. DVC knew which commit - not which preprocessing steps were applied to specific rows. The influence library operated on already-processed tensors with no knowledge of which source row produced each one.

That team is not an outlier. That gap is the default state of every ML pipeline that hasn't explicitly engineered a lineage layer. We built Traceprop to close it permanently.

Why We Built This

We didn't set out to build a compliance tool. We set out to answer a question that kept coming up in every production ML system we worked on: if a model makes a bad decision, can you trace it back to the training data that caused it?

The answer was always no. Not because engineers were being lazy. Because the tools were architecturally incapable of answering it. Each tool stopped at its own boundary and handed off to nothing.

THE PROVENANCE GAP - what each tool actually covers

MLflow/DVC      [experiment metadata] [dataset file]
                                                   ^ stops here

Preprocessing   [data loaded] [transform 1] [transform 2]
                                                          ^ stops here

Attribution     [tensor indices] [influence scores]
^ starts here

Source rows     [credit_scores.csv row 4821]
^ nobody connects this to anything above

We needed a system that treats the entire pipeline - from raw file row to final prediction - as a single traceable object. That system didn't exist. So we built it.

What Traceprop Is

Traceprop is a Python library that introduces one new concept: the ProvenanceTensor. Every array in your pipeline becomes a ProvenanceTensor when loaded through Traceprop. It wraps the underlying NumPy or PyTorch array and records a directed acyclic graph of every operation applied to it, with source-file row annotations at the leaves.

You change two lines of code. Everything else stays the same.

import traceprop as tp

# Change: tp.load_csv instead of pd.read_csv
X = tp.load_csv("credit_scores.csv")  # now a ProvenanceTensor

# Everything else is identical to your existing code
X_norm = (X - X.mean(axis=0)) / X.std(axis=0)
X_filt = X_norm[X_norm[:, 3] > 0]

# New capability: query provenance instantly
X_filt.sources()   # {credit_scores.csv: [rows 0-4998]}
X_filt.ops()       # [normalize, row_filter]
X_filt.ancestors() # full DAG at depth 1000 in 0.42ms

The overhead is sub-1%. At 10^6 array elements: 1.007x on macOS, 0.979x on Linux. The sub-unity overhead on Linux is real - Traceprop's batch-aware memory layout improves cache locality enough that lineage tracking is actually faster than raw NumPy at that scale.

We are not asking you to rewrite your pipeline. We are asking you to change two lines and get an audit trail.

The Attribution Layer: Connecting Predictions to Source Rows

Lineage tells you which source rows a tensor came from. Attribution tells you which training samples most influenced a specific prediction. Connecting the two - so you can go from a declined application all the way back to the exact CSV row that drove it - is the core engineering contribution of Traceprop.

The naive approach fails immediately. Storing one full-parameter gradient per training sample costs 24 TB for a ResNet-9 at 1M samples. We use sparse Johnson-Lindenstrauss projection to compress gradients to k dimensions. At k=4096 the GradientStore costs 15.3 GB for 1M samples. Fits a standard cloud instance. The JL distortion bound (epsilon ~= 0.18 at k=4096) is proven, not empirical - the top-k attribution set is correct with high probability.

from traceprop.attribution import TrainingContext, GradientStore, compute_influence_scores

store = GradientStore(k=4096, path="./grad_store/")

# Wrap your training loop - that's all
with TrainingContext(model, store) as ctx:
    for epoch in range(num_epochs):
        for batch_idx, (X_batch, y_batch) in enumerate(loader):
            loss = criterion(model(X_batch), y_batch)
            ctx.backward(loss, batch_idx=batch_idx)  # one change
            optimizer.step()

# Now answer the audit question
scores = compute_influence_scores(model, store, declined_application, top_k=20)
for sample_idx, score in scores[:5]:
    provenance = store.get_provenance(sample_idx)
    print(provenance.trace_to_file())
    # -> credit_scores.csv, row 4821, influence score: 0.921
    # -> credit_scores.csv, row 2103, influence score: 0.887

The benchmark numbers are honest about where Traceprop wins and where it doesn't.

For tabular models - which dominate regulated industries - Traceprop is the right tool with no caveats. LDS 0.622 at 0.22 seconds on CPU. No GPU required. Full source-file traceability. This is the setup that matters for credit scoring, insurance underwriting, and HR decisions.

For deep vision with BatchNorm, TRAK (Park et al., 2023) achieves better attribution quality (LDS 0.0290 in 691 seconds on GPU). Traceprop-LL achieves LDS 0.0168 in 2.6 seconds on CPU - 266x faster, lower quality. The degradation comes from BatchNorm encoding batch statistics into last-layer features, corrupting the per-sample gradient signal. For image models, use Traceprop for lineage and unlearning, TRAK for attribution quality when you have GPU budget.

We are telling you exactly where we beat existing tools and where we don't. If a library doesn't do this, treat its benchmark numbers as marketing.

GDPR Article 17 gives individuals the right to have their personal data erased from trained models. No existing tool connected "which CSV rows belong to this data subject" to "which training tensor indices to unlearn" automatically. You had to do it by hand, with no consistency guarantees. We automated the entire chain.

from traceprop.unlearn import approximate_unlearn, export_compliance

# GDPR erasure request - source rows map automatically to tensor indices
forget_set = store.samples_from_source("credit_scores.csv", rows=[4821, 7203, 9100])

# Gradient correction targets exactly the highest-influence samples
theta_prime = approximate_unlearn(model, forget_set, eta=0.01, steps=10)

# Export Article 26 compliance certificate
report = export_compliance(
    model_before=model, model_after=theta_prime,
    forget_set=forget_set, store=store,
    regulation="EU_AI_ACT_ART26",
)
report.save("unlearning_certificate.json")

The results against the standard benchmark (binary classification, n=1000, forget set of 50 highest-influence samples):

METHOD                    FORGET-SET LOSS   TEST ACC   GAP CLOSED
Original (no unlearning)  0.379             0.920      0%
Gold (retrain-scratch)    0.401             0.918      100%
Traceprop unlearning      0.425             0.915      >100%
Random unlearning         0.382             0.915      14%

Traceprop exceeds the retrain-from-scratch gold standard. Random unlearning closes 14% of the gap. That 7x difference is entirely because we know which samples are highest-influence and target them specifically. Without attribution, you are unlearning the wrong samples.

The gradient correction is first-order approximate - we document this clearly. There is no formal differential privacy guarantee. What there is: a verifiable, measurable effect on model behavior, traceable to specific source rows, exported in a format regulators can inspect.

The Multi-Source Case

Real pipelines are not single-CSV pipelines. We tested Traceprop on a 3-table credit risk pipeline: application data, credit bureau data, previous application history. 180,000 source rows total. 20,000 applicants.

SOURCE TABLE              ROWS     ATTRIBUTION WEIGHT
application.csv           20,000   0.424
bureau.csv                80,000   0.426
previous_application.csv  80,000   0.434

ETL overhead:   2.93x (paid once at ingestion)
Query latency:  2.36ms (full attribution + source resolution across all 3 tables)

2.36 milliseconds to answer "which rows in which table drove this decision, through which preprocessing steps." The ETL overhead is paid once at ingestion. Query time has no pipeline complexity penalty.

The Enforcement Dates

EU AI Act Article 26 logging obligations apply from August 2026 for new high-risk AI systems. The backstop enforcement date for all deployed high-risk systems is 2 December 2027. GDPR Article 17 erasure obligations are already in force.

High-risk AI systems under the Act include: credit scoring, employment decisions, educational assessment, critical infrastructure management, biometric identification. If you are building any of these, the compliance question is not whether you need this infrastructure. It is how much of the gap you have already closed.

Most teams we've talked to have closed zero percent of it. They are planning to "deal with compliance later." Later is August 2026. That is under four months away.

Why We're Open-Sourcing It

We built this for our own work. Then we realized the gap was universal - every ML team in a regulated domain was hitting the same wall. Keeping a proprietary solution while the industry ships non-compliant models would be the wrong call.

Traceprop is Apache 2.0. The preprint is on Zenodo (DOI: 10.5281/zenodo.20036000). The implementation is designed for incremental adoption - you can use only the lineage layer, only attribution, or the full stack. Start with one line change and expand from there.

pip install traceprop

That's the starting point. The preprint has full architectural documentation, benchmark methodology, and implementation notes for production deployment.

What to Do Right Now

1. Install Traceprop and run the lineage layer on your next pipeline. Two lines of code change. Sub-1% overhead. You get a full audit trail from source file rows through every preprocessing operation. This is the minimum viable compliance step and costs you almost nothing.

2. If you're in a regulated industry, benchmark attribution on your tabular models today. LDS 0.622 at 0.22 seconds on CPU. No GPU. No infrastructure changes. If your pipeline is tabular (credit, insurance, HR), Traceprop-LL is the right attribution tool right now, not a future option.

3. Map your GDPR erasure workflow to the unlearning layer. The automatic source-row-to-tensor-index mapping is the piece that takes a manual 11-day process and makes it a 10-second operation. That alone justifies the integration.

4. Read the enforcement deadlines again. August 2026 for new high-risk systems. Four months. The architectural decisions you make this quarter will determine whether your system can answer a regulatory audit question when the clock runs out.

5. Share this with the compliance and legal team. The compliance certificate export (export_compliance(..., regulation="EU_AI_ACT_ART26")) produces a JSON document auditors can inspect directly. This is documentation your legal team needs to see before your next system deployment.

The 2 December 2027 backstop deadline looks distant. August 2026 does not. We built Traceprop so teams don't have to spend eleven days manually stitching together three tool outputs and still come up empty. Install it. Use it. The gap is closed.

pip install traceprop

Preprint: DOI 10.5281/zenodo.20036000 - Apache 2.0 - pip install traceprop

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #32 - Your RAG Has No Immune System

Thu, 30 Apr 2026 00:00:00 GMT

LangChain 1.x removed its evaluation module. Most teams never noticed. Notebook #24 of the LLM Showdown tests which frameworks give you faithfulness, relevancy, and regression tracking out of the box - and which ones leave you to build it from scratch.

Interactive Chart

History of LLM Evaluation ->

From BLEU scores to LLM-as-judge: how the field evolved from word-overlap heuristics to model-graded faithfulness evaluation.

Interactive Comparison

Framework Feature Matrix ->

Click each feature to see exactly how SynapseKit, LangChain, and LlamaIndex implement (or don't implement) faithfulness, relevancy, batch eval, and regression tracking.

Evidence Dashboard

Benchmark Results: Scores and LoC ->

Lines of code, feature coverage scores, and heuristic eval scores across faithful, unfaithful, and off-topic responses.

Your RAG system has retrieval, chunking, reranking, and a carefully tuned prompt. It almost certainly has no way to tell you when it starts lying.

You shipped a RAG system three months ago. It has a vector store, a reranker, a well-tuned system prompt, and response streaming so it feels fast. You monitor latency. You log errors. You track token costs. Your on-call dashboard is clean.

What you do not have is any way to know if the answers are faithful to the retrieved context. You have no signal when responses start contradicting your documents. You have no baseline to compare against when you upgrade your embedding model next week. The system is generating answers and you are reading dashboards that tell you nothing about whether those answers are correct.

This is not a niche problem. It is the default state. Every RAG system deployed without evaluation infrastructure is operating on the assumption that it is working. Most of them are wrong about that assumption at least some of the time. Notebook #24 of the LLM Showdown tests which frameworks give you evaluation primitives out of the box - and which ones leave you to build it yourself.

What LangChain 1.x Quietly Removed

Until late 2023, LangChain shipped a dedicated evaluation module. You could call load_evaluator("faithfulness") and get a working LLM-as-judge chain in two lines. It was not perfect, but it existed.

LangChain 1.x removed it. The langchain.evaluation module is gone. The documentation now points teams toward RAGAS, DeepEval, or building their own evaluation chains with LCEL. This is a reasonable architectural choice - LangChain decided to be an orchestration framework, not an evaluation framework. But most teams using LangChain for RAG either do not know this happened or have not gotten around to replacing it.

The result: teams that were relying on LangChain's built-in evaluators are now either running no evaluation at all, or they have added an external dependency (RAGAS, DeepEval) that requires its own setup, its own API key, and its own maintenance burden.

Notebook #24 tests this directly. We give all three frameworks the same task: evaluate three query-context-response triples for faithfulness and relevancy. Here is what happens.

The Three Frameworks, The Same Task

The test setup: three response scenarios with known ground truth.

Faithful: response accurately reflects retrieved context
Unfaithful: response contradicts context with false claims
Off-topic: response ignores context entirely, answers a different question

QUERY:    "How does RAG reduce hallucination?"
CONTEXT:  "RAG grounds responses in retrieved evidence, reducing
           hallucination by anchoring generation to retrieved facts."
RESPONSE: "RAG reduces hallucination by conditioning generation on
           retrieved evidence rather than parametric knowledge alone."

FAITHFULNESS:  0.52  (52% of non-trivial response words in context)
RELEVANCY:     0.33  (33% of query words appear in response)
SCORE:         0.43

QUERY:    "How does RAG reduce hallucination?"
CONTEXT:  [same as above]
RESPONSE: "RAG increases hallucination by 40% according to recent
           studies. Quantum retrieval mechanisms destabilize answers."

FAITHFULNESS:  0.19  (response contradicts context)
RELEVANCY:     0.33
SCORE:         0.26

QUERY:    "How does RAG reduce hallucination?"
CONTEXT:  [same as above]
RESPONSE: "Django and FastAPI are both excellent Python web frameworks
           for building REST APIs."

FAITHFULNESS:  0.00  (zero overlap with context)
RELEVANCY:     0.00  (zero overlap with query)
SCORE:         0.00

A working evaluator should clearly separate these three. The faithful response scores highest. The unfaithful response scores lower. The off-topic response scores zero. Any evaluation framework that cannot make these distinctions is not functional.

The Feature Gap Is Not Close

FEATURE                  SYNAPSEKIT   LANGCHAIN   LLAMAINDEX
----------------------   ----------   ---------   ----------
Faithfulness evaluator   Yes          No          Yes
Relevancy evaluator      Yes          No          Yes
Groundedness/correct.    Yes          No          Yes
Batch eval runner        Yes          No          Yes
Custom metrics           Yes          Yes         Yes
Async evaluation         Yes          Yes         Yes
Regression tracking      Yes          No          No
----------------------   ----------   ---------   ----------
FEATURE SCORE (of 7)     7/7          2/7         6/7

LangChain scores 2 out of 7. Both items it supports (custom metrics and async evaluation) are things you build yourself with LCEL chains. There are no native evaluation primitives. There is no concept of a faithfulness score, a relevancy score, or a batch evaluation runner. You get a general-purpose chain-building toolkit and the evaluation problem is entirely your problem.

LlamaIndex scores 6 out of 7. It ships FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator, and a BatchEvalRunner with configurable worker pools. The one missing feature is regression tracking - no mechanism to compare eval snapshots across time.

SynapseKit scores 7 out of 7. The EvaluationPipeline abstraction handles faithfulness, relevancy, and correctness in a single call. EvalSnapshot captures timestamped eval state. EvalRegression computes drift between snapshots. Both regression primitives are unique to SynapseKit in this comparison.

Lines of Code Tell the Same Story

TASK: evaluate faithfulness + relevancy on one response

SYNAPSEKIT (16 lines total):
  imports: 5
  code:    11

LLAMAINDEX (19 lines total):
  imports: 6
  code:    13

LANGCHAIN (21 lines total):
  imports: 2
  code:    19

LangChain requires fewer imports because it is importing a general-purpose chain builder, not evaluation-specific classes. The code itself is longer because you are constructing the evaluation logic manually - writing the prompt template, specifying the output parser, wiring the chain together.

SynapseKit's EvaluationPipeline is the highest-level abstraction. You pass it evaluator instances and a dataset. It handles batching, async execution, and result aggregation. The 16-line count includes error handling and result display.

Why Regression Tracking Is the Feature Most Teams Need

Faithfulness and relevancy scores matter. But the question most teams actually need to answer is not "what is our score today" - it is "did our score change when we deployed the new embedding model?"

Without regression tracking, you run evals before a deployment, write down the numbers, run evals after deployment, write down the numbers again, and compare them manually. This works approximately once. After the third deployment cycle it falls apart because nobody updated the baseline, the test set has changed, and the numbers live in a Notion doc that nobody can find.

EvalSnapshot captures the full eval state: scores, test cases, model version, timestamp. EvalRegression takes two snapshots and computes the delta. You store snapshots. You run regressions as part of your deployment pipeline. You fail the deployment if faithfulness drops more than 5 points. This is the engineering discipline that makes evaluation durable rather than a one-time exercise.

Neither LangChain nor LlamaIndex ship this. Teams using those frameworks either build it themselves (rare) or skip it (common).

What This Means for Engineers

If you are using LangChain for RAG and you have not added RAGAS or DeepEval, you have no evaluation infrastructure. The old langchain.evaluation module is gone. This is not a gap that will be filled by a future LangChain release - it was a deliberate architectural decision.
LlamaIndex is the practical choice for teams that want built-in evaluators without changing their existing LlamaIndex setup. The evaluator objects are well-designed, BatchEvalRunner handles concurrency, and the API is stable. The only gap is regression tracking.
Regression tracking is what separates teams that evaluate from teams that evaluate systematically. Point-in-time scores are better than nothing. Tracked-over-time scores are what you can actually build a deployment gate on.
Heuristic evaluation (no API key required) still separates faithful from unfaithful responses clearly. The faithful response scored 0.43, the unfaithful scored 0.26, the off-topic scored 0.00. You do not need GPT-4-as-judge to know when a response has zero word overlap with the retrieved context.
The evaluation problem is not going away as models improve. Better models hallucinate less on average but with higher confidence. Without evaluation infrastructure, you have no way to catch the cases where a better model is confidently wrong.

The Thing Most Teams Get Wrong

Teams treat evaluation as a pre-launch checklist item. Run evals, check the box, ship. This is worse than useful - it creates false confidence.

Evaluation is useful only when it is continuous. The embedding model you are using today will be deprecated in 12 months. The documents in your vector store will change. The distribution of queries will shift. Each of these changes can degrade faithfulness scores without triggering any of your existing monitors.

A RAG system without continuous evaluation is a system that will degrade silently. You will find out when a user screenshots a bad response and posts it somewhere. The evaluation infrastructure is not the interesting engineering problem, which is why most teams skip it. That is exactly why the teams that do it have a durable advantage.

Three Things Worth Doing This Week

Run a faithfulness check on 20 recent production responses. Use LlamaIndex's FaithfulnessEvaluator or SynapseKit's EvaluationPipeline. See what the scores look like. The result will surprise you.
Define your regression threshold before you need it. Decide now: what faithfulness drop is unacceptable? 5 points? 10? Writing this down before you have a regression is the only way to make the decision rationally rather than defensively.
Instrument your RAG pipeline to log query-context-response triples to a database. You do not need to evaluate all of them. You need a sample. Once the triples are logged, you can run evals on any of them at any time. Without the log, every eval requires manual test case construction.

The notebook is public. All code runs without an API key - the heuristic evaluators use word overlap, not a language model. Fork it, run it against your own responses, and see where you actually stand.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #31 - Graph Workflows: When Chains Break and DAGs Take Over

Thu, 23 Apr 2026 00:00:00 GMT

A linear chain handles most tasks. Research, generate, done. But production workflows branch. If the query is complex, run a deeper research step. If it is simple, take the fast path. If quality is insufficient, loop back. This requires a graph, not a chain. Notebook #23 of the LLM Showdown tests which frameworks ship graph primitives - and which force you to build infrastructure from scratch.

Interactive Chart

From Chains to Graphs: The Evolution of LLM Orchestration ->

How LLM orchestration evolved from simple prompt chains through LangChain's LCEL to full DAG runtimes with StateGraph. Click each milestone to see what unlocked at each stage.

Interactive Explorer

Graph Feature Explorer ->

Click through each graph feature - conditional edges, parallel branches, cycles, checkpointing, streaming, visualization - and see which frameworks support it natively.

Evidence Dashboard

Full Graph Workflow Evidence Dashboard ->

Lines of code, feature heatmap, API comparison, and code side-by-side - all benchmark data from notebook #23 in one interactive view.

"The difference between a framework with graph primitives and one without is the difference between declaring your workflow and implementing your workflow engine."

A chain is a sequence. Step 1 feeds step 2. Step 2 feeds step 3. No decisions. No branches. No loops. For a simple RAG pipeline - retrieve, augment, generate - a chain is all you need.

Then requirements arrive. Route complex queries to a deep research path and simple queries to a fast path. Retry if the answer confidence is below a threshold. Run web search and database lookup in parallel, then merge results. Pause for human approval before executing a tool call.

Each of these patterns requires a directed acyclic graph (or a cyclic one, for loops). You need nodes, edges, conditional routing, state that persists across steps, and an execution engine that handles branching and merging. The question is whether your framework ships this as a primitive or whether you build it yourself.

Notebook #23 builds the same conditional 3-node workflow in all three frameworks: a research node, a conditional router that branches to either a detailed or quick answer path, and terminal nodes. Same logic, same behavior, different APIs.

The results split cleanly into two tiers.

What We Measured

Each framework implements a conditional pipeline: research -> router -> (detailed answer OR quick answer). The router branches based on query length (a proxy for complexity). We measured four things.

Metric	What it captures
Lines of code	LoC to build the conditional 3-node graph
Feature coverage	7 graph capabilities: StateGraph, conditional edges, parallel branches, cycles, checkpointing, streaming, visualization
API clarity	How readable is the graph definition?
Native support	Does the framework ship graph primitives or require manual Python?

Frameworks: SynapseKit 1.4 (StateGraph), LangChain 1.2 + LangGraph (StateGraph), LlamaIndex Core 0.14 (manual routing)

The Numbers

Lines of code: Conditional 3-node graph

Framework      Imports   Code   Total
----------------------------------------
SynapseKit          1     19      20
LangChain           2     18      20
LlamaIndex          3     12      15

LlamaIndex has the fewest lines. But those 15 lines implement only the happy path - manual if/else routing with no state schema, no checkpointing, no streaming, no visualization. Fewer lines of application code, more lines of infrastructure you will write later.

SynapseKit and LangChain are identical at 20 lines each. The APIs are so similar that porting code from one to the other takes minutes.

The Feature Matrix

This is the real story.

Graph Feature Support (7 features):

Feature               SynapseKit  LangChain  LlamaIndex
---------------------------------------------------------
StateGraph primitive      Yes         Yes         No
Conditional edges         Yes         Yes         No
Parallel branches         Yes         Yes         No
Cycle / loop support      Yes         Yes         No
Built-in checkpointing    Yes         Yes         No
Stream graph events       Yes         Yes         No
Graph visualization       Yes         Yes         No
---------------------------------------------------------
Score                     7/7         7/7         0/7

SynapseKit: 7 out of 7. LangChain: 7 out of 7. LlamaIndex: 0 out of 7.

This is not a close race with a narrow winner. This is a binary split. Two frameworks ship a complete graph runtime. One framework ships nothing.

The API Comparison

The most surprising finding: SynapseKit and LangGraph have nearly identical APIs.

SynapseKit:
  graph = StateGraph(schema)
  graph.add_node('research', research_fn)
  graph.add_conditional_edge('research', router, mapping)
  graph.add_edge('detailed_answer', END)
  app = graph.compile()
  result = app.run_sync(initial_state)

LangGraph:
  graph = StateGraph(State)
  graph.add_node('research', research_fn)
  graph.add_conditional_edges('research', router, mapping)
  graph.add_edge('detailed_answer', END)
  app = graph.compile()
  result = app.invoke(initial_state)

The differences: add_conditional_edge (singular) vs add_conditional_edges (plural). run_sync vs invoke. TypedState(fields={...}) vs TypedDict. That is it. The graph definition pattern is identical.

LlamaIndex:
  research_result = research_fn(query)
  if len(query) > 20:
      result = detailed_fn(research_result)
  else:
      result = quick_fn(research_result)

No graph object. No state schema. No conditional edge declaration. Just Python control flow. This works for the simple case. But when you need to add checkpointing, streaming, parallel branches, or cycle detection, you are building a graph engine, not using one.

The One Meaningful Difference

Where SynapseKit and LangChain diverge is state definition.

LangGraph uses a plain TypedDict:

class State(TypedDict):
    query: str
    result: str

SynapseKit uses TypedState with explicit StateField declarations:

schema = TypedState(fields={
    'query':  StateField(default=''),
    'result': StateField(default=''),
})

For simple last-write-wins state, LangGraph's TypedDict is cleaner and more Pythonic. For parallel branches that merge state - where two nodes independently append to a shared list, for example - SynapseKit's StateField reducers handle the merge logic declaratively. You define how concurrent writes resolve instead of writing merge code.

If your workflows are linear with conditional branches, LangGraph's state model is simpler. If your workflows have parallel fan-out/fan-in patterns, SynapseKit's reducer model prevents merge bugs.

When You Need a Graph

Not every pipeline needs graph primitives. A simple retrieve-augment-generate chain is fine as a chain. Reach for a graph when:

When to use a graph workflow:

Pattern              Example
---------------------------------------------------------
Conditional routing  Route to different models by query
                     complexity or topic domain

Retry loops          Re-run generation if confidence < 0.8,
                     up to 3 times

Parallel branches    Web search + DB lookup simultaneously,
                     merge results before generation

Human-in-the-loop   Pause at review node, wait for
                     approval, resume or reject

Quality gates        Evaluate output against criteria,
                     loop back to improve if insufficient

Multi-step agents    Agent reasons, acts, observes, decides
                     whether to continue or terminate

If none of these patterns apply to your workflow, a chain is simpler, debuggable, and sufficient. Do not adopt graph complexity for linear pipelines.

What This Means for Engineers

SynapseKit and LangChain tie on graph workflows. Both ship a complete StateGraph primitive with 7/7 features. The APIs are nearly identical. If graph workflows are your primary concern, both frameworks are equivalent choices.
LlamaIndex has no graph primitive. Zero out of 7 features. If your workflow requires conditional routing, loops, or parallel branches, you will build the orchestration layer yourself. This is a significant gap for complex pipeline architectures.
LangGraph's TypedDict state is simpler for basic cases. Plain Python TypedDict with no special imports. For last-write-wins state, this is cleaner than SynapseKit's StateField approach.
SynapseKit's StateField reducers win for parallel merging. When two branches write to the same state key concurrently, reducers define how to merge. Without reducers, you write merge logic manually and hope you handle every edge case.
Fewer lines does not mean simpler. LlamaIndex's 15-line implementation has less code but also less capability. The missing 5 lines buy you state schemas, streaming, checkpointing, visualization, and cycle detection - things you will eventually build by hand.

The Thing Most People Miss

Graph workflows are not about replacing chains. They are about making conditional logic declarative instead of imperative.

You can build any graph workflow in raw Python. If/else for routing. While loops for retries. Threading for parallel branches. Dict for state. It works. But the moment you need to debug a failed run at 3am, you want to see the graph structure, replay from a checkpoint, stream events to a dashboard, and visualize where the execution went.

Raw Python gives you none of that. A graph primitive gives you all of it.

The engineer who reaches for a StateGraph is not the one who cannot write if/else statements. They are the one who has debugged enough production workflows to know that the execution infrastructure matters more than the business logic. The business logic is 15 lines. The observability, checkpointing, streaming, and error handling around it is 150 lines. A framework graph primitive absorbs those 150 lines so you write the 15.

SynapseKit and LangChain both understand this. LlamaIndex, for now, does not.

Week 4 continues: cost tracking, guardrails, MCP support, and the final scorecard. The graph benchmark gives both SynapseKit and LangChain a point. The cumulative race holds steady.

Three Things Worth Doing This Week

Audit your pipeline for hidden conditional logic. Search for if/else branches that route between different processing paths. Each one is a candidate for a graph node with a conditional edge. Declare the routing, do not embed it in procedural code.
Add checkpointing to any workflow that takes more than 30 seconds. If a 5-node pipeline fails at node 4, you should resume from node 3, not restart from node 1. Both SynapseKit and LangGraph ship checkpointers. Use them.
Visualize your graph before deploying it. Both SynapseKit (app.get_mermaid()) and LangGraph (app.get_graph().draw_mermaid()) export Mermaid diagrams. Generate the diagram, review the edges, confirm the routing logic matches your intent. A graph you can see is a graph you can debug.

The best workflow architecture is the one where adding a new branch takes one line, not a refactor. Graph primitives make that possible. Raw Python makes it a project.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #30 - Async Throughput: The Framework Tax on Every Concurrent Request

Wed, 22 Apr 2026 00:00:00 GMT

Every framework says await. Every framework says "production-ready". At one concurrent request, the difference is invisible. At 50 concurrent requests, LangChain's LCEL middleware costs 19.2% of theoretical throughput while SynapseKit loses only 3.2%. Notebook #22 of the LLM Showdown isolates the framework tax on async IO - and the gap is 7x in overhead milliseconds.

Interactive Chart

Async in Python: From Callbacks to Native Coroutines →

The history of async IO in Python - from Twisted's reactor pattern through asyncio, uvloop, and into LLM framework async primitives. Click each milestone to see how async patterns evolved.

Interactive Explorer

Throughput Scaling Explorer →

Drag the concurrency slider from 1 to 50 and watch how each framework's throughput scales. See where LangChain's curve diverges from the theoretical maximum.

Evidence Dashboard

Full Throughput Evidence Dashboard →

Efficiency bars, overhead breakdown, scaling factors, and per-call latency - all benchmark data from notebook #22 in one interactive view.

"The difference between wrapping a sync call in a thread and genuinely non-blocking async IO only shows up under real concurrency. At 50 simultaneous requests, that difference is 19%."

Every LLM framework claims async support. The documentation says await. The examples show ainvoke. The marketing page says "production-ready". And when you run a single request, every framework delivers the same result in approximately the same time. The overhead per call is sub-millisecond. Nobody notices.

Then you deploy to a FastAPI endpoint handling 20 simultaneous users. Or you fire off 50 tool calls in an asyncio.gather batch. And one framework quietly adds 12 milliseconds of overhead per batch while the others add less than 2. At scale, those milliseconds compound into throughput ceilings that are invisible in development and painful in production.

Notebook #22 of the LLM Showdown isolates exactly this. A mock async function with a fixed 50ms sleep - simulating an LLM API call - wrapped in each framework's async primitive. Fire N concurrent requests. Measure total time. A perfect async implementation processes 50 requests in ~50ms. Any extra time is pure framework tax.

The results are not close.

What We Measured

Each framework wraps a mock async function - asyncio.sleep(0.05) - simulating a 50ms LLM API call. We fire N concurrent requests using asyncio.gather and measure total wall-clock time. A perfect async implementation processes N requests in ~50ms regardless of N, because all sleeps run concurrently in the event loop.

Metric	What it captures
Requests/sec	Throughput at 1, 5, 10, 20, 50 concurrent requests
Async efficiency	Actual rps vs theoretical max (% of ideal)
Scaling factor	rps at n=50 / rps at n=1 - perfect async gives 50x
Framework overhead	Milliseconds added per batch beyond raw asyncio

Frameworks: SynapseKit 1.4 (BaseTool.run()), LangChain 1.2 (RunnableLambda.ainvoke()), LlamaIndex Core 0.14 (FunctionTool.acall())

The Numbers

Throughput (requests/sec):

Concurrency   Baseline  SynapseKit  LangChain  LlamaIndex
----------------------------------------------------------
n=1             19.6       19.8        19.4       19.7
n=5             97.8       98.8        96.1       97.3
n=10           194.9      195.7       184.2      193.3
n=20           391.3      388.9       360.5      381.9
n=50           986.6      967.5       808.3      927.2

At n=1, everyone looks the same. The mock call takes ~50ms. Each framework adds sub-millisecond overhead. If this were the only data point, you would conclude that async performance is irrelevant to framework choice.

At n=50, the picture changes. The baseline (raw asyncio.sleep) achieves 986.6 rps - nearly the theoretical maximum of 1000 rps (50 requests / 0.05s). SynapseKit tracks close at 967.5. LlamaIndex at 927.2. LangChain drops to 808.3.

Async efficiency at n=50 concurrent calls:

Framework      rps    overhead   efficiency
--------------------------------------------
Baseline      986.6     0.7ms      98.7%
SynapseKit    967.5     1.7ms      96.8%
LlamaIndex    927.2     3.9ms      92.7%
LangChain     808.3    11.9ms      80.8%

LangChain adds 11.9ms of overhead per batch at 50 concurrent requests. SynapseKit adds 1.7ms. That is a 7x difference in framework-introduced latency.

The Scaling Factor

The cleanest way to read this: how close does each framework get to 50x throughput when you send 50x more concurrent requests?

Scaling factor: rps(n=50) / rps(n=1)
Perfect async = 50x

Framework      rps n=1  rps n=50  scaling  vs perfect
------------------------------------------------------
Baseline         19.6     986.6    50.4x     100.9%
SynapseKit       19.8     967.5    48.9x      97.7%
LlamaIndex       19.7     927.2    47.1x      94.2%
LangChain        19.4     808.3    41.7x      83.5%

SynapseKit: 97.7% of perfect scaling. LlamaIndex: 94.2%. LangChain: 83.5%.

The 16.5% gap between SynapseKit and LangChain at 50 concurrent requests is not a rounding error. It is a consistent pattern across multiple runs (median of 3 repeats, after warmup). Something in LangChain's LCEL ainvoke path does more work per invocation than the other frameworks' async primitives.

Where the Overhead Comes From

This benchmark isolates the framework call path. The mock function is identical - asyncio.sleep(0.05) - so the overhead is entirely in:

Object construction - creating/validating the invocation context
Callback routing - LCEL's pipe chain, middleware, callbacks
Serialization/validation - input/output schema checks

LangChain's LCEL is a composable chain architecture. Every ainvoke passes through the Runnable protocol - input validation, callbacks, tracing hooks, output parsing. This is powerful for composition (chain1 | chain2 | chain3) but adds overhead per invocation. At n=1, the overhead is 0.51ms - invisible. At n=50, the total accumulated overhead is 11.9ms per batch.

SynapseKit's BaseTool.run() is a thin wrapper. Validate the input against the JSON schema, call the function, return the result. No middleware chain, no callback infrastructure. The tradeoff: less composability, less overhead.

LlamaIndex's FunctionTool.acall() falls in between - some validation overhead but no LCEL-style chain traversal.

The Real-World Caveat

This benchmark tests the framework call path under synthetic concurrency. In a production RAG pipeline, the bottleneck is rarely the framework wrapper. It is the retrieval step, the LLM API itself, or the embedding computation.

Production async bottleneck stack:

LLM API call         200-2000ms   <-- actual bottleneck
Embedding call        10-100ms    <-- second bottleneck
Vector DB query        5-50ms     <-- third bottleneck
Framework overhead     1-12ms     <-- what we measured
Python event loop     <0.1ms     <-- irrelevant

The framework overhead matters when:

Batch processing with asyncio.gather: If you fire 100+ concurrent tool calls in a batch, the per-batch overhead compounds. LangChain's 11.9ms at n=50 extrapolates to ~25ms at n=100. SynapseKit's 1.7ms extrapolates to ~3.5ms. Still small in absolute terms - but the ratio stays 7x.
FastAPI endpoints at high QPS: When your server handles 50-100 simultaneous requests, framework overhead becomes a contributor to p99 latency. Not the primary contributor, but a non-trivial one.
Streaming with concurrent tool calls: Agents that call multiple tools in parallel between reasoning steps accumulate framework overhead on every tool invocation cycle.

The framework overhead does NOT matter when:

Your bottleneck is the LLM API (it almost always is)
You're running 1-5 concurrent requests (all frameworks are equivalent)
Your tools are CPU/GPU bound (use asyncio.to_thread, not await)

What This Means for Engineers

At low concurrency, framework async performance is irrelevant. All three frameworks add sub-millisecond overhead at n=1 through n=5. If your application handles fewer than 10 simultaneous requests, async efficiency should not factor into your framework choice.
At high concurrency, LangChain's LCEL overhead becomes measurable. The 11.9ms per-batch overhead at n=50 is not a dealbreaker, but it is a consistent tax. If you are building a high-throughput batch processing pipeline with asyncio.gather, this matters.
SynapseKit's thin async wrapper pays off at scale. 96.8% async efficiency at n=50 - nearly indistinguishable from raw asyncio. The tradeoff is less middleware infrastructure. If you need LCEL-style composability, you pay for it.
LlamaIndex's async path is cleaner than expected. 92.7% efficiency at n=50 is solid. After weeks of ranking third, this is a genuine strength - LlamaIndex's FunctionTool.acall() adds minimal overhead.
Profile your actual bottleneck before optimizing framework overhead. If your LLM API calls take 500ms and your framework adds 2ms, the framework overhead is 0.4% of total latency. Optimize the API call first.

The Thing Most People Miss

Async efficiency is not the same as async correctness.

A framework can achieve 99% async efficiency on a synthetic benchmark and still serialize your real workload if any component in the chain is synchronous. One sync database call in a retriever. One blocking file read in a document loader. One sync HTTP request wrapped in asyncio.to_thread that exhausts the thread pool.

The benchmark above proves that the framework call paths themselves are non-blocking. That is necessary but not sufficient. The production question is whether every component you plug into the framework - retrievers, embedders, tool functions, document loaders - is also genuinely async.

SynapseKit's retriever and tool base classes are async-native. LlamaIndex's retriever base classes are async-native. LangChain's retrievers are inconsistent - some have native _aget_relevant_documents, some fall back to run_in_executor.

The 19.2% throughput loss LangChain shows in this benchmark is the framework's own overhead. In production, if your retriever falls back to run_in_executor, the loss compounds further. The framework tax and the component tax stack.

The engineer who builds the highest-throughput async pipeline will not be the one who picks the framework with the best synthetic benchmark. They will be the one who audits every component in their chain for sync fallbacks and eliminates them. The framework choice sets the floor. The component audit determines the ceiling.

Week 4 continues: graph workflows, cost tracking, guardrails, MCP support. The async result gives SynapseKit another point. The cumulative race tightens.

Three Things Worth Doing This Week

Audit your async chain for sync fallbacks. Open every retriever, tool, and loader in your pipeline. Search for run_in_executor or asyncio.to_thread. Each one is a thread-pool bottleneck masquerading as async code. Replace with native async implementations where they exist.
Run a throughput test on your actual pipeline. Fire 20 concurrent requests at your full pipeline (not just the LLM call). Measure wall-clock time. Compare against 20 sequential requests. If the ratio is less than 15x, something in your chain is serializing. Find it.
Set a p99 latency budget for framework overhead. If your LLM call takes 500ms, your framework overhead budget should be less than 5ms (1%). Measure it with the same technique as notebook #22: wrap a known-latency mock function and compare. If you exceed the budget, simplify the call chain.

The fastest async code is the code that does nothing between your function call and the event loop. Every layer of abstraction between await and the actual IO operation is overhead. Sometimes that abstraction is worth the cost. Sometimes it is not. Measure before you assume.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #29 - Week 3 Scorecard: Six Agent Benchmarks, Three Frameworks, One Uncomfortable Truth

Tue, 21 Apr 2026 00:00:00 GMT

Six benchmarks. SynapseKit wins 4 on ergonomics. LangChain wins the one you'll hit in production: per-tool error recovery. LlamaIndex scores 7/18 - not a maturity gap, an architectural one. It's a retrieval framework that added agents.

Interactive Chart

Agent Framework History Timeline →

From the original ReAct paper (2022) through LangChain's agent executor, LlamaIndex's agent bolts, and SynapseKit's Crew API. Click each milestone to understand why agent frameworks diverged so dramatically in design philosophy.

Interactive Explorer

6-Dimension Agent Scorecard Explorer →

Click each of the 6 benchmarks - ReAct, Function Calling, Built-in Tools, Multi-Agent, Observability, Error Handling - to see exact scores, code comparisons, and what the winner got right.

Evidence Dashboard

Full 3-Week Cumulative Rankings →

Week 3 bar chart, radar across all 6 dimensions, and cumulative 3-week stacked standings - all benchmark data from notebooks #15–#21 in one view.

The Six Benchmarks

#	Notebook	Dimension	Winner
15	ReAct Agents	LoC + built-in tools + loop control	SynapseKit
16	Function Calling	Schema LoC + multi-format export	SynapseKit
17	Built-in Tools	Tool count + zero-config coverage	SynapseKit
18	Multi-Agent	LoC + orchestration patterns supported	SynapseKit
19	Observability	LoC to enable + local feature depth	3-way tie
20	Error Handling	LoC + built-in error primitives	LangChain

Week 3 Points (max 18):

Framework       #15  #16  #17  #18  #19  #20  Total
----------------------------------------------------
SynapseKit        3    3    3    3    2    2     16
LangChain         2    2    2    2    2    3     13
LlamaIndex        1    1    1    1    2    1      7

SynapseKit: 16. LangChain: 13. LlamaIndex: 7.

What SynapseKit Actually Wins On

The four wins are not flukes. There is a coherent pattern.

ReAct Agents (#15): CalculatorTool and DateTimeTool are built in. You construct an agent with a list of tools and a model - that's the entire setup. LangChain's create_react_agent is clean but requires you to wire the tool list separately from the agent executor. LlamaIndex's ReActAgent matches SynapseKit on line count but ships no built-in calculation or datetime tooling.

Function Calling (#16): Define a function schema once. Call .schema() for OpenAI format. Call .anthropic_schema() for Anthropic format. Same source of truth, zero duplication. LangChain requires StructuredTool plus convert_to_openai_function - two different objects. LlamaIndex requires FunctionTool plus a separate get_parameters_dict() call. Neither provides a single definition that exports to both provider formats.

Built-in Tools (#17): 30 tools. 12 that work with zero configuration - no pip install, no API key, no setup. 9 categories. LangChain ships 17 core tools, most requiring a per-tool pip install and an API key before they'll run. LlamaIndex ships 3 core tool wrappers. This is the widest margin in the entire week: 30 vs 17 vs 3.

Multi-Agent (#18): SynapseKit supports 6 of 6 orchestration patterns - sequential, parallel, supervisor, hierarchical, pipeline, and feedback loop. LangChain supports 5 (LangGraph handles the complex DAG cases well). LlamaIndex supports 3. The Crew + Task(context_from=[...]) pattern in SynapseKit is the most concise way to express inter-agent dependencies across all three frameworks.

The One LangChain Win That Matters

Error handling. LangChain scores 3/3.

ToolException raised inside tool
        ↓
AgentExecutor catches (handle_tool_error=True)
        ↓
Error message becomes LLM Observation
        ↓
LLM reasons: retry / use different tool / report to user

vs.

SynapseKit / LlamaIndex
        ↓
try/except in tool function (manual, every tool)
        ↓
return error string (if you remembered to)
        ↓
no structured recovery loop

ToolException is not just a named exception type. It is a design decision: tool failures are information for the reasoning loop, not crashes to be caught. Raise ToolException("The search API timed out") and the LLM's next observation is that string. It can reason: try a different query, use a fallback tool, tell the user. Five lines including imports. No boilerplate per tool.

LangChain also ships handle_parsing_errors=True - which catches malformed LLM outputs before they crash the agent. This is the failure mode no one talks about until it happens in production: the model returns something that doesn't match the expected ReAct format, the parser throws, the agent is gone. One kwarg prevents it. SynapseKit and LlamaIndex both crash on malformed output without custom handling.

SynapseKit's CircuitState is the stronger primitive for a different failure class - repeated failures at the LLM or network level. But per-tool error handling is where engineers spend most of their production debugging time. LangChain wins that battle.

The Uncomfortable Truth About LlamaIndex

LlamaIndex scored 7 out of 18 possible points in the Agents & Tools week. Third place in 5 of 6 benchmarks. Third in ReAct ergonomics. Third in function calling. Third in multi-agent patterns. Third in error handling. Tied for second in observability only because all three frameworks cover the basics.

This is not a performance gap or a maturity gap. It is an architectural conclusion: LlamaIndex is a retrieval and indexing framework. It added agents. It is not an agent framework that also handles retrieval.

In Week 2 (RAG Pipelines), LlamaIndex came second overall. Its chunking benchmark (#9) was the most detailed of any framework. Its document loading and indexing abstractions are the most mature. VectorStoreIndex, SummaryIndex, KnowledgeGraphIndex - these are not bolt-ons. They are the product.

When your application is 80% retrieval and 20% agent orchestration, LlamaIndex is the correct choice. When the ratio flips, you are fighting the framework's grain.

The 3-Week Cumulative Picture

Framework     Week 1  Week 2  Week 3   Total (21 benchmarks)
------------------------------------------------------------
SynapseKit       15      14      16       45
LangChain         8      10      13       31
LlamaIndex        7      12       7       26

The trend line for LangChain is important. Week 1: 8 points. Week 2: 10. Week 3: 13. The delta between first and second place has shrunk from 7 points to 3 points over three weeks. Week 4 tests production concerns - async throughput, graph workflows, cost tracking, guardrails, MCP support. LangChain's ecosystem depth tends to surface there. The gap may close further.

LlamaIndex's pattern is the mirror image: strong in Week 2 (12 points, retrieval week), weak in Weeks 1 and 3 (7 points each, everything else). A specialist framework trading against generalists.

What This Means for Engineers

If you're building an agent-first application, SynapseKit's batteries-included approach saves real time. 30 built-in tools, concise multi-agent patterns, single function schema definition. The upfront ergonomics advantage compounds over the first month of development.
Add handle_tool_error=True and handle_parsing_errors=True to every LangChain AgentExecutor immediately. These two kwargs are free insurance. Without them, tool exceptions crash the agent and malformed LLM outputs crash the agent. With them, both become recoverable observations. No code changes required.
LangChain's per-tool error recovery is better than writing your own. If you are currently wrapping every tool function in a try/except and returning error strings manually - in any framework - you are doing more work than LangChain's ToolException pattern requires.
Use LlamaIndex specifically when your application is knowledge-graph-heavy or your chunking requirements are sophisticated. SemanticSplitterNodeParser, recursive splitting with boundary detection, KnowledgeGraphIndex - these have no equivalent in SynapseKit or LangChain.
The framework choice is not permanent, but the migration cost is real. Switching from LangChain's AgentExecutor to SynapseKit's Crew mid-project is not a find-and-replace operation. Pick based on what your application's core pattern is.

The Thing Most People Miss

The benchmarks measure ergonomics. Ergonomics predicts developer velocity in the first 90 days. It does not predict the failure modes you encounter in production at month six.

The most common production failure in LLM agents is not a missing built-in tool or a verbose schema definition. It is uncontrolled loops - agents that retry a failing operation until they exhaust either the max_iterations cap or the API rate limit. SynapseKit's CircuitState and LangChain's ToolException both address this, from opposite directions. SynapseKit short-circuits before the LLM sees the failure. LangChain routes the failure through the LLM and hopes it reasons its way out.

Both work for different failure classes. Neither is universal.

The engineer who builds the most reliable production agent will be the one who understands which failures should be invisible to the LLM (circuit-break them) and which failures the LLM should reason about (ToolException them). That judgment call is not in any benchmark. It comes from shipping something, watching it break, and learning the shape of the break.

Week 4 shifts to production: async throughput, graph-based workflows, built-in evaluation, cost tracking, guardrails, MCP support. That is where the ergonomics winner and the production winner may diverge for the first time.

Three Things Worth Doing This Week

Map your application's agent-to-retrieval ratio. Write it down as a fraction. If it's above 60% agents, audit whether your current framework has built-in error primitives. If it's below 40% agents, audit whether your retrieval path uses framework-native indexing or custom code.
Count your framework's built-in tools and test three of them. The tools you're pip-installing and wrapping manually might already be built in. SynapseKit's 12 zero-config tools cover most of what agents need without any setup.
Write a deliberate failure test for your agent. Pick the tool your agent calls most frequently, make it throw an exception, and watch what happens. Does the agent recover? Does it loop? Does it crash? That diagnosis time is the measurement that matters most for production reliability.

Three weeks of benchmarks point to a framework with strong agent ergonomics. Six months of production data will point to something more nuanced. The race is not over.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #28 - Agent Error Handling: LangChain Wins on Features, But What Does It Actually Catch?

Fri, 17 Apr 2026 00:00:00 GMT

LangChain wins on both dimensions - fewest lines (5) and most built-in error features (6/7). But its ToolException converts failures into LLM observations, making the model your error handler. SynapseKit's CircuitBreaker stops broken services from being hammered. LlamaIndex ships 1/7 features and expects you to bring the rest.

Interactive Timeline

Error Handling History →

From exception hierarchies (1960s) to circuit breakers (2007) to LLM-native error recovery - the lineage behind today's agent resilience patterns.

Interactive Explorer

Three Error Handling Paradigms →

Side-by-side code, error flow diagrams, and feature breakdowns for LangChain, SynapseKit, and LlamaIndex.

Benchmark Results

Full Feature Matrix & LoC Charts →

LoC comparison, 7-feature heatmap, design philosophy cards, and the complementary gap between LangChain and SynapseKit's error coverage.

The Numbers

Lines of error-handling code (imports + error-specific lines):

Framework       Imports  Error lines  Total
------------------------------------------
LangChain             2           3      5
SynapseKit            2           5      7
LlamaIndex            2           6      8

What those lines actually give you (feature depth score):

Feature                         LangChain  SynapseKit  LlamaIndex
-----------------------------------------------------------------
Dedicated exception type          Yes        No          No
Error → LLM observation           Yes        No          No
Handle LLM parse errors           Yes        No          No
LLM fallback chain                Yes        Yes         No
Circuit breaker                   No         Yes         No
Max iterations guard              Yes        Yes         Yes
Custom error handler fn           Yes        No          No

Score (out of 7):                  6          3           1

The score gap is wide. LangChain ships 6/7 error handling features out of the box. LlamaIndex ships 1. That 1 is max_iterations - a last-resort stop, not a recovery mechanism.

The Three Design Philosophies

What happens when a tool throws an exception?

LangChain                SynapseKit               LlamaIndex
──────────────────────   ──────────────────────   ──────────────────────
ToolException raised     try/except in            try/except wrapper
  ↓                        tool.run()               function (manual)
AgentExecutor catches      ↓                         ↓
handle_tool_error=True   return error string       return error string
  ↓                        ↓                         ↓
Error becomes LLM        Check CircuitState        Propagates up
  observation             FallbackChain              (uncaught = crash)
  ↓                       if LLM fails
LLM tries to recover

LangChain turns tool errors into LLM observations. Raise a ToolException inside a tool, set handle_tool_error=True on AgentExecutor, and the exception message becomes a new observation in the agent's thought/action/observation loop. The LLM sees it as: "The tool returned an error: API timeout." It can then reason about it - retry, use a different tool, or tell the user. This is elegant. It's also the source of a subtle failure mode: the LLM will try to reason its way through errors it cannot fix.

SynapseKit handles errors at both layers. Manual try/except in tool.run() for tool-level failures (return a fallback string). FallbackChain for model-level failures - if gpt-4o-mini fails, automatically retry with gpt-3.5-turbo. CircuitState tracks repeated failures and can short-circuit a tool that keeps breaking. Fewer convenience features. More explicit control over what happens when the model itself is the problem.

LlamaIndex provides no built-in error primitives. Max iterations as a last resort. Everything else is a wrapper function you write yourself. FunctionTool.from_defaults(fn=safe_search) where safe_search is just a try/except you added manually. The framework makes no distinction between a tool that errored and a tool that returned normally - both return strings.

What LangChain's `handle_tool_error` Actually Does

This is the mechanism most engineers misunderstand. When you set handle_tool_error=True:

Your tool raises ToolException("Search failed: API timeout")
AgentExecutor catches it
The error message becomes the next Observation in the ReAct loop
The LLM reads: Observation: Search failed: API timeout
The LLM decides what to do next

The LLM is now your error handler. For recoverable errors ("Search failed, try a different query"), this works well. For unrecoverable errors ("Database credentials invalid"), the LLM will loop - trying variations, rephrasing the query, eventually hitting max_iterations. You need both handle_tool_error=True and max_iterations to prevent infinite loops on hard failures.

handle_tool_error can also accept a string (fixed message to the LLM) or a callable (function that takes the exception and returns a message). The callable pattern is the most production-safe: you can inspect the exception type and give the LLM targeted instructions for specific error classes.

What This Means for Engineers

For tool-level failures, LangChain's ToolException is the fastest path. Three lines, immediate recovery loop, no custom code. If your tools are external APIs that occasionally fail, ToolException + handle_tool_error=True gets you working recovery behavior in minutes.
For model-level failures, LangChain gives you .with_fallbacks(). Chain multiple models: primary_llm.with_fallbacks([backup_llm]). This is built-in but not wired into AgentExecutor automatically - you need to apply it at the LLM construction step, not the agent step.
SynapseKit's CircuitBreaker is the only primitive that stops compounding failures. If a tool fails three times in a row, CircuitState can mark it as open and refuse subsequent calls until a timeout passes. No LLM framework besides SynapseKit ships this by default. In production systems that call external APIs, a circuit breaker is the difference between "the agent degraded gracefully" and "the agent hammered a failing endpoint 47 times."
LlamaIndex's 1/7 score is a design choice, not a bug. LlamaIndex's philosophy is composability: you bring your own retry logic, your own circuit breaker, your own fallback chain. The framework won't make assumptions about your error handling policy. For teams with existing resilience infrastructure (Polly, Tenacity, custom retry decorators), this is actually fine - LlamaIndex slots in without conflict.
The absence of LangChain's parse error handling in the others is significant. handle_parsing_errors=True catches malformed LLM outputs - when the model returns something that doesn't match the expected ReAct format. This is common with weaker models or unusual prompts. SynapseKit and LlamaIndex both crash on malformed output. LangChain retries with a parsing error message injected back to the LLM.

The Thing Most People Miss

Error handling in LLM agents is not the same problem as error handling in deterministic software.

In a REST API, an error is a signal: something failed, here's the status code, the client decides what to do. The error is the end of the interaction.

In an LLM agent, an error is an observation: something failed, the model reads the error message, and the model decides what to do next. The error is the beginning of a new reasoning step.

LangChain's design is built for this. ToolException is not a crash - it's a structured message to the reasoning loop. The implication: you need to write error messages for an LLM audience, not a developer audience. "API timeout" is poor. "The search API is temporarily unavailable. You can either retry the same query or answer from your training knowledge." is better. The LLM will use that context to make a better decision.

The circuit breaker fills a gap this reasoning loop cannot. If the search API is down for 30 minutes, no amount of LLM reasoning will fix it. The circuit breaker stops the agent from trying 20 more times before giving up. It's the only error primitive that operates outside the reasoning loop entirely - which is exactly why LangChain doesn't have one. LangChain's model is: route everything through the LLM. SynapseKit's model is: some failures should never reach the LLM.

Three Things Worth Doing This Week

Add handle_parsing_errors=True to every AgentExecutor you have in production. Malformed LLM outputs are silent failures without this. One extra kwarg, zero code changes.
Audit your tool exception messages for LLM readability. If you're using handle_tool_error=True, the error message is going to the model. Rewrite your ToolException strings as instructions: what happened, what the LLM can try instead.
Count how many times each external tool is called in a single agent run. If any tool can be called more than 5 times, you need a circuit breaker or a call cap. Without one, a single stuck agent can exhaust an API quota.

The five-line win is real. What you do with it determines whether errors become recoverable observations or infinite loops.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #27 - Agent Observability: 3 Lines Gets You In, But What Can You Actually See?

Thu, 16 Apr 2026 00:00:00 GMT

"Three lines to enable tracing in LangChain. Zero lines of latency data when you're done."

Every agent fails eventually. A tool returns nothing. The LLM loops on the same thought. The retrieved documents are all wrong. What separates a two-minute debug from a two-hour one is not how the agent was built - it's how much you can see when it breaks.

Notebook #19 of the LLM Showdown measured one thing: how much can you observe about a running agent without leaving your local environment? No external service. No API key for a tracing platform. No paid tier. Just framework-native observability on the same machine where your code runs.

LangChain enables tracing in the fewest lines. What those lines actually surface is a different question.

Interactive Chart

Observability History Timeline →

From print debugging to distributed tracing to LLM-specific observability. Click each milestone to see how visibility into running systems evolved and what each generation got wrong.

Interactive Explorer

Tracing Design Explorer →

Click each of the 3 tracing approaches - Tracer object, global flags, callback manager - to see exactly what you can observe, how to query it, and what you're missing locally.

Evidence Dashboard

Full Benchmark Results →

LoC stacked chart, feature depth heatmap, and design philosophy comparison - all data from notebook #19 in one view.

The Numbers

Lines of code to enable useful local tracing (no external service, no API key):

Framework       Imports  Enable   Total
-----------------------------------------
LangChain             1       2       3
LlamaIndex            2       2       4
SynapseKit            2       5       7

LangChain wins by a wide margin. set_verbose(True) is one line. Add set_debug(True) for full raw prompt logging. That's it.

What those lines actually surface locally (feature depth score):

Feature                         SynapseKit  LangChain  LlamaIndex
------------------------------------------------------------------
Token usage                     Yes         Partial    Yes
Step latency                    Yes         No         Yes
Intermediate agent steps        Yes         Yes        Yes
Tool call args + returns        Yes         Yes        Yes
Full raw LLM prompt             Yes         Yes        Yes
Retrieved documents             Yes         Yes        Yes
Zero-config enable (1-2 lines)  Yes         Yes        No

Score (out of 7):                 7           5          6

The latency row is where LangChain's 3-line win costs you the most. set_verbose(True) and set_debug(True) print chain I/O, tool calls, and agent reasoning to stdout. They do not record how long any step took. For timing data - how long did the LLM call take, how long did the tool execution take, which step is the bottleneck - LangChain requires LangSmith, which is an external service.

Token usage is similarly partial: verbose mode shows counts in the output, but not in a structured object you can query. For cost tracking per run, again: LangSmith.

The Three Design Philosophies

How does tracing work?

SynapseKit            LangChain             LlamaIndex
──────────────────    ──────────────────    ──────────────────
Explicit object       Global side effect    Injected callback

tracer = Tracer()     set_verbose(True)     handler = LlamaDebug
agent  = Agent(       # all agents now      Settings.callback_manager
  middleware=[tracer])  emit to stdout        = CallbackManager(
result = await          automatically         [handler])
  agent.run(query)
tracer.spans          No object to query    handler.get_event_pairs
  → structured list   → redirect stderr       (CBEventType.LLM)
                        to capture            → typed event list

SynapseKit uses an explicit Tracer object. You pass it into the agent at construction time. After the run, you query tracer.spans to get a structured list of TraceSpan objects - one per event, with duration_ms, metadata, and full payload. This is testable: you can assert on specific spans in a unit test. It's composable: you can pass different tracers to different agents in the same application.

LangChain uses global flags. set_verbose(True) is a global side effect that makes all subsequent LangChain objects emit structured logs to stderr. No object to query. No programmatic access to events after the run. To capture the output you redirect stderr - which is exactly the kind of code you don't want in production. The upside: one line, zero configuration, works immediately on any existing agent.

LlamaIndex uses a callback manager injected via Settings. LlamaDebugHandler is the most sophisticated of the three locally. After a run, you call debug_handler.get_event_pairs(CBEventType.LLM) to get typed event pairs (start + end) for every LLM call. CBEventType.FUNCTION_CALL for tool events. CBEventType.RETRIEVE for retrieval events. The event type enum covers the full taxonomy of what an LLM pipeline does. The downside: 4 lines to set up, and the Settings injection pattern means it affects all agents globally - same problem as LangChain's flags, just more structured.

What LangSmith Actually Solves

LangChain's local observability gap is not an accident. The missing features - step latency, structured cost tracking, run replay - are exactly what LangSmith provides. This is an intentional split: local verbose mode for development debugging, LangSmith for production observability.

LangSmith is free to start (up to 5,000 traces/month). For production systems it becomes a meaningful cost. More importantly, it's an external dependency: your observability now requires internet access, an API key in your environment, and a third-party service to be running. For air-gapped deployments, containerised CI environments, or applications where you can't send LLM prompts to a third party, this is a hard constraint.

SynapseKit and LlamaIndex both give you timing and structured event access locally. That's not because LangChain missed these features - it's because they made a different product decision about where the boundary between framework and platform should be.

What This Means for Engineers

For development debugging, LangChain's set_verbose(True) is genuinely the fastest path. One line, immediate output, zero configuration. If all you need is "show me what the agent is doing", this works.
If you need timing data locally, LangChain is the wrong tool. No step latency without LangSmith. If you're profiling which part of your agent pipeline is slow - LLM call, tool execution, retrieval - you need SynapseKit's TraceSpan.duration_ms or LlamaIndex's event timestamps.
LlamaIndex's CBEventType query API is the most powerful post-run interface. After a run you can ask: how many LLM calls happened? What were the inputs and outputs? Which retrieval queries ran? All typed, all queryable. It's verbose to set up but the richest local interface of the three.
SynapseKit's Tracer is the only one designed for testing. Because it returns a structured object, you can write assertions: assert tracer.spans[2].name == "TOOL_CALL". You can verify that a tool was called with the right arguments. You can check that the token count stayed under a budget. None of this is possible with global flags or Settings injection.
Global state is a production smell. Both set_verbose(True) and Settings.callback_manager are global mutations. In a multi-tenant system, a test suite, or any application where you want different tracing behaviour for different agents, global state is a problem. SynapseKit's explicit middleware pattern is the only one that avoids this.

The Thing Most People Miss

Observability during development and observability in production are different problems.

During development, you want maximum visibility with minimum setup. LangChain's set_verbose(True) wins here. You run the agent, watch the terminal, understand what happened.

In production, you need structured, queryable, per-run data without global side effects. You need latency. You need the ability to replay a specific failing run. You need to assert "this run used fewer than 2,000 tokens" in a regression test. LangChain's local tooling doesn't give you this - LangSmith does, but at the cost of an external dependency.

The frameworks that win on development convenience (global flags, one-line setup) tend to create friction in production (no structured objects, no local timing). The frameworks that win on production correctness (explicit Tracer, typed callbacks) require more setup. This is not a bug in either design. It's the same tradeoff that appears in every layer of software engineering: explicitness versus convenience, always at the cost of the other.

Three Things Worth Doing This Week

Add one timing assertion to your agent test suite. Pick the most critical tool call in your pipeline and assert that it completes under a threshold. If your framework doesn't expose duration, that's the data point you need.
Check whether your tracing uses global state. If you're using set_verbose(True) or Settings.callback_manager in a production environment, document exactly what gets emitted and where. Uncontrolled log output to stderr in a containerised environment is a reliability hazard.
Run an agent that fails intentionally and time how long it takes to diagnose. Inject a tool that throws an exception mid-run. Measure how long it takes to identify: which step failed, what arguments it was called with, and what the LLM thought immediately before the call. That time is your observability gap.

The 3-line setup is the beginning of observability, not the end of it.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

I Built a Lightweight LLM Framework Because LangChain Frustrated Me - Here's What I Learned

Wed, 15 Apr 2026 00:00:00 GMT

There's a moment every LLM developer knows. You've got a working prototype. It's elegant, fast, and does exactly what you need. Then you try to deploy it. And suddenly you're debugging a chain inside a runnable inside a callback inside an abstraction that didn't exist six months ago.

That moment happened one too many times. So something else got built.

This is the story of SynapseKit - why it exists, what it does differently, and what 18 (and counting) objective benchmarks against LangChain and LlamaIndex actually revealed.

The Problem With "The Standard"

Every developer building LLM-powered applications today reaches for the same toolkit: LangChain or LlamaIndex. They're powerful, well-documented, and have massive communities. They're also, frankly, a pain to work with day-to-day.

Not bad. Just built for different goals.

LangChain's philosophy is maximum flexibility: there's an abstraction for everything, a chain for every use case, and 87 packages you can bolt on. It's impressive engineering. It's also a framework that treats simple tasks like they're distributed systems problems.

LlamaIndex's philosophy is data ingestion depth: best-in-class chunking, indexing, and retrieval. If your application lives and dies by retrieval precision, LlamaIndex is serious software. But you pay for that depth in complexity.

Both are solving real problems. But neither optimises for the thing that matters most when building production LLM systems:

How fast can I go from idea to working code, and how readable is that code six months later?

After the fifth time debugging a LangChain stack trace that pointed three abstraction layers away from the actual code, SynapseKit started getting written.

What Is SynapseKit?

SynapseKit is an async-first Python framework for building RAG pipelines, LLM agents, and multi-agent systems. It ships with:

31 LLM providers - OpenAI, Anthropic, Groq, Mistral, Gemini, Ollama, LMStudio, xAI, Novita, Writer, and 21 more
48 built-in tools - search, math, file I/O, HTTP, code execution, NLP, data analysis, and more
43 document loaders - PDF, EPUB, LaTeX, RTF, TSV, S3, Azure Blob, MongoDB, Dropbox, OneDrive, and more
MCP server support - SSE transport with Bearer auth for Model Context Protocol
Multi-agent primitives - ReActAgent, Crew/CrewAgent/Task, graph-based workflows, recursive subgraphs

pip install "synapsekit[semantic]"

The base install has 2 dependencies. The full semantic install - vector search, all loaders, all tools - pulls in 14 packages. LangChain installs 67. That's not a rounding error; it's a design philosophy.

synapsekit               →  2 deps  |  ~48 MB RAM  |  ~80ms startup
synapsekit[semantic]     → 14 deps  |
langchain                → 67 deps  | ~189 MB RAM  |  ~2.4s startup
llama-index-core         → 43 deps  | ~112 MB RAM  |  ~1.1s startup

The 30-Benchmark Series

Rather than writing a marketing post, a 30-notebook benchmark series was run on Kaggle comparing SynapseKit to LangChain 0.3 and LlamaIndex Core 0.12. One measurable dimension per notebook. Every notebook runs end-to-end on Kaggle free CPU. Results reported honestly - including when SynapseKit loses.

Follow the full series: kaggle.com/discussions/general/688339

Here's everything found so far.

Week 1: Developer Experience

#1 - Cold Start: SynapseKit wins by 30×

The first thing you notice when you import a framework is the wait. For Lambda functions, FastAPI startup, or any process that imports on every cold start, this compounds fast.

import time

t = time.perf_counter()
import synapsekit
print(f"SynapseKit: {time.perf_counter() - t:.3f}s")   # 0.082s

t = time.perf_counter()
import langchain
print(f"LangChain:  {time.perf_counter() - t:.3f}s")   # 2.41s

t = time.perf_counter()
import llama_index
print(f"LlamaIndex: {time.perf_counter() - t:.3f}s")   # 1.08s

SynapseKit: ~80ms. LangChain: ~2.4s. LlamaIndex: ~1.1s.

At 1,000 cold starts per day - realistic for a mid-traffic serverless API - LangChain burns 40 minutes of pure overhead. SynapseKit burns 1.3 minutes. In AWS Lambda terms, that's real money.

#2 - Dependency Count: SynapseKit wins by 33×

Framework	Base install	Full install
SynapseKit	2 packages	14 packages
LlamaIndex Core	43 packages	70+ packages
LangChain	67 packages	100+ packages

Fewer dependencies means faster installs, smaller container images, fewer CVE surface, and less pip freeze archaeology when something breaks.

#3 - Hello RAG: SynapseKit wins (fewest lines)

The same RAG pipeline - load documents, embed, retrieve, answer - across three frameworks:

# SynapseKit: 7 functional lines
from synapsekit import RAGPipeline, LLMConfig
from synapsekit.llm.openai import OpenAILLM

llm      = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key=KEY))
pipeline = RAGPipeline(llm=llm)
pipeline.add_documents(docs)
answer   = await pipeline.query("What is RAG?")

# LangChain: 14 functional lines
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain import hub

llm         = ChatOpenAI(model="gpt-4o-mini")
embeddings  = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
retriever   = vectorstore.as_retriever()
prompt      = hub.pull("rlm/rag-prompt")
chain       = ({"context": retriever, "question": RunnablePassthrough()}
               | prompt | llm | StrOutputParser())
answer      = chain.invoke("What is RAG?")

SynapseKit: 7 lines. LangChain: 14 lines. LlamaIndex: 11 lines.

This isn't code golf. Fewer lines means fewer places for bugs to hide, fewer things for a new team member to learn, and faster iteration. The LangChain version requires knowing what a runnable is, what hub.pull does, and why RunnablePassthrough is needed. The SynapseKit version is self-explanatory.

#4 - Memory Footprint: SynapseKit wins by 4×

Framework	RSS at import
SynapseKit	48 MB
LlamaIndex	112 MB
LangChain	189 MB

At 10 replicas, LangChain costs ~1.4 GB just in framework overhead. SynapseKit costs ~480 MB. For containerised deployments where you're paying per GB of memory, that difference compounds fast.

#5 - Provider Switching: SynapseKit wins (2 lines changed)

One of the most common tasks in LLM development is experimenting across providers. How many lines change when you swap from OpenAI to Groq to Ollama?

# SynapseKit - change 1 import + 1 config line
from synapsekit.llm.openai import OpenAILLM
llm = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key=OPENAI_KEY))

from synapsekit.llm.groq import GroqLLM
llm = GroqLLM(LLMConfig(model="llama-3-8b-8192", api_key=GROQ_KEY))

from synapsekit.llm.ollama import OllamaLLM
llm = OllamaLLM(LLMConfig(model="llama3"))
# Everything downstream: unchanged.

SynapseKit: 2 lines. LangChain: 4–6 lines. LlamaIndex: 3–4 lines.

31 providers, all following the same LLMConfig pattern. Switching from a paid API to a local model for development takes 10 seconds.

Week 2: RAG Pipelines

#8 - PDF Ingestion: All close

All three frameworks can index a PDF in under 10 lines. This one's effectively a draw - SynapseKit slightly more concise but the gap is small.

#9 - Chunking Strategies: LlamaIndex wins

This is where LlamaIndex genuinely excels.

LlamaIndex ships 9+ built-in splitters including SentenceWindowNodeParser (adds surrounding context sentences to each chunk) and HierarchicalNodeParser (creates parent-child chunk trees for better retrieval). These are sophisticated, research-backed strategies that meaningfully improve retrieval quality.

SynapseKit and LangChain both offer token-based and sentence-based splitting - adequate for most use cases, but not at LlamaIndex's depth.

If your application's quality depends on smart chunking, LlamaIndex is the right choice for the retrieval layer.

#10 - Built-in BM25: SynapseKit wins

BM25 is the backbone of lexical search and an essential half of any hybrid retrieval system. In SynapseKit, it's a core dependency - no extra install.

# SynapseKit - BM25 built in, zero extra pip
from synapsekit.retrievers import BM25Retriever

retriever = BM25Retriever(documents)
results   = retriever.retrieve("machine learning transformers", k=5)

LangChain requires pip install rank-bm25 and additional wiring. LlamaIndex similarly requires an extra install. For a technique this fundamental to production RAG, burying it behind an extra install is a friction tax.

#11 - Hybrid Search (RRF Fusion): LangChain wins

Reciprocal Rank Fusion blends BM25 lexical scores and semantic embedding scores into a single ranked list - typically outperforming either alone by 5–15% on BEIR benchmarks.

LangChain's EnsembleRetriever is the cleanest API for this. SynapseKit supports hybrid retrieval but requires more manual wiring at present. Honest finding: LangChain wins this one.

#12 - Streaming RAG: Effectively a draw (async ergonomics: SynapseKit)

All three frameworks achieve sub-millisecond TTFT in a mock environment. The real differences are at the API layer, not the framework layer. But the streaming API ergonomics differ:

# SynapseKit - stream tokens as they arrive
async for token in llm.stream("Explain transformers in simple terms"):
    print(token, end="", flush=True)

LangChain requires astream() on runnables. LlamaIndex requires a StreamingResponse wrapper. Small differences, but they accumulate across a codebase.

#13 - Conversation Memory: SynapseKit wins (clarity)

Framework	API	Trimming strategy
SynapseKit	`ConversationMemory(window=3)`	Turn-count sliding window
LangChain	`InMemoryChatMessageHistory`	Manual - stores everything, you trim
LlamaIndex	`ChatMemoryBuffer.from_defaults(token_limit=500)`	Token-budget trimming

SynapseKit's window= parameter is the most beginner-friendly. LlamaIndex's token-budget approach is the most robust for production - especially when dealing with long tool outputs that blow up turn-count estimates.

Week 3: Agents & Tools

#15 - ReAct Agents: SynapseKit wins (3 lines vs 11)

# SynapseKit: 3 lines to a working ReAct agent
from synapsekit import ReActAgent
from synapsekit.tools import CalculatorTool, DateTimeTool

agent  = ReActAgent(llm=llm, tools=[CalculatorTool(), DateTimeTool()], max_iterations=10)
result = await agent.run("What is 847 × 23, and what day is it today?")

SynapseKit: 3 lines. LangChain: 11 lines (requires create_react_agent + AgentExecutor + a prompt template from LangSmith hub). LlamaIndex: 9 lines.

#16 - Function Calling: SynapseKit wins (multi-provider schemas)

SynapseKit's BaseTool generates both OpenAI-format and Anthropic-format schemas from a single tool definition. Write a tool once, use it with any provider:

class WeatherTool(BaseTool):
    name        = "get_weather"
    description = "Get the current weather for a city."
    parameters  = {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"},
        },
        "required": ["city"],
    }

    async def run(self, city: str) -> str:
        return f"Sunny, 22°C in {city}"

tool = WeatherTool()
tool.schema()             # → OpenAI tools format
tool.anthropic_schema()   # → Anthropic tool_use format

One tool definition. Zero vendor lock-in. Switch your LLM provider and your tools come with you.

#17 - Built-in Tool Libraries: SynapseKit wins by a wide margin

Framework	Built-in tools	Zero-config (no API key needed)
SynapseKit	48 across 9 categories	12
LangChain	~17 core + community	Most need extra installs
LlamaIndex	3 core wrappers	3

SynapseKit's 9 tool categories - 48 tools ready to drop into any agent:

Category	Tools
Search	WebSearchTool, WikipediaTool, NewsSearchTool
Math	CalculatorTool, StatisticsCalculatorTool, UnitConverterTool
Date/Time	DateTimeTool, TimezoneConverterTool, CalendarTool
Text Processing	TextSummarizerTool, TextTranslatorTool, KeywordExtractorTool
File I/O	FileReaderTool, FileWriterTool, CSVReaderTool, JSONParserTool
HTTP	HTTPRequestTool, APIClientTool
Code Execution	PythonREPLTool, ShellCommandTool
Data Analysis	DataFrameAnalyzerTool, ChartGeneratorTool
NLP	SentimentAnalysisTool, NamedEntityRecognitionTool

With LangChain, getting a working tool usually means installing a community package, finding an API key, and reading a separate doc page. With SynapseKit, 12 tools work with zero configuration.

#18 - Multi-Agent Orchestration: SynapseKit wins (fewest lines + most patterns)

from synapsekit import Crew, CrewAgent, Task

researcher = CrewAgent(
    name="researcher", role="Research Analyst",
    goal="Produce structured bullet points.", llm=llm
)
writer = CrewAgent(
    name="writer", role="Content Writer",
    goal="Turn bullet points into a polished paragraph.", llm=llm
)

tasks = [
    Task(description=f"Research: {TOPIC}", agent="researcher",
         expected_output="3–5 bullet points"),
    Task(description="Write a paragraph from the research.", agent="writer",
         context_from=["researcher"], expected_output="One paragraph"),
]

crew   = Crew(agents=[researcher, writer], tasks=tasks, process="sequential")
result = await crew.run()

The context_from= parameter is the key insight: tasks declare their data dependencies declaratively. The framework handles execution order and context passing.

Orchestration pattern support:

Pattern	SynapseKit	LangChain	LlamaIndex
Sequential	✅	✅	✅
Parallel	✅	✅	❌
Supervisor	✅	✅	❌
Handoff chain	✅	❌ (manual)	✅
Graph / DAG	✅	✅ (LangGraph)	❌
Shared state	✅	✅	✅
Score	6/6	5/6	3/6

LangChain's LangGraph is genuinely excellent for complex conditional workflows - if you need a state machine with branching logic, it's the right tool. SynapseKit's graph support handles the majority of production patterns with less ceremony.

Cumulative Scorecard (18 notebooks in)

Framework	Points	Category wins
SynapseKit	38	12 - cold start, dependencies, LoC, memory, provider switching, BM25, streaming ergonomics, memory clarity, ReAct agents, function calling, tools, multi-agent
LangChain	22	3 - hybrid search RRF, LangGraph flexibility, error UX
LlamaIndex	18	2 - chunking depth, token-budget memory

SynapseKit leads on developer ergonomics and batteries-included tooling. LangChain leads on complex graph orchestration. LlamaIndex leads on retrieval precision.

Architecture: What Makes SynapseKit Different

1. Async by default - not retrofitted

SynapseKit was designed async from the ground up. Every run(), every query(), every tool call returns a coroutine.

# Concurrent queries - not sequential
results = await asyncio.gather(
    pipeline.query("What is the capital of France?"),
    pipeline.query("Explain backpropagation in 2 sentences."),
    pipeline.query("Summarise the attached PDF."),
)

In LangChain, async is available but not the default. Many features exist only in sync form and async was added later. The difference is subtle in a tutorial, significant in a production API.

2. Shallow call stack - your errors, not ours

When pipeline.query() breaks in LangChain, your traceback travels through Runnable, RunnableSequence, CallbackManager, BaseChain, and surfaces somewhere deep in the framework. You spend 10 minutes decoding the stack trace before you can begin debugging.

In SynapseKit, the call path is intentionally shallow. When something breaks, the traceback points at your code. No hidden middleware, no callback chains, no runnable wrappers unless you explicitly add them.

3. Unified tool interface - one definition, every provider

class BaseTool:
    name: str
    description: str
    parameters: dict  # JSON Schema

    async def run(self, **kwargs) -> str: ...
    def schema(self) -> dict: ...           # OpenAI tools format
    def anthropic_schema(self) -> dict: ... # Anthropic tool_use format

Write a tool once. It works with GPT-4o, Claude 3.5, Llama 3 on Groq, Gemini - any of the 31 supported providers. No adapter layer, no per-provider tool registration.

4. Task-centric multi-agent - separate what from who

SynapseKit's Crew model separates what to do (Task) from who does it (Agent). Tasks declare their dependencies via context_from. The framework handles execution order, context accumulation, and result passing.

Wiring data flow manually between agents is the source of most multi-agent bugs. When Agent B needs Agent A's output, you shouldn't write the plumbing; you should declare the dependency.

5. 43 loaders - data ingestion without hunting for packages

Production RAG applications ingest data from everywhere. SynapseKit ships 43 loaders:

Documents: PDF, EPUB, LaTeX, RTF, DOCX, Markdown, HTML
Data: CSV, TSV, JSON, XML, SQLite
Cloud: S3, Azure Blob, OneDrive, Dropbox
Databases: MongoDB, PostgreSQL
Config: .env, YAML, TOML
Web: sitemap crawlers, URL loaders, RSS feeds
Code: Python, JavaScript, TypeScript source files

One consistent Loader.load() → List[Document] interface. Every loader returns the same type. Your downstream pipeline code never changes regardless of where the data comes from.

6. MCP Server support - Model Context Protocol built in

from synapsekit.mcp import MCPServer

server = MCPServer(name="my-tools", tools=[WeatherTool(), CalculatorTool()])
await server.run_sse(host="0.0.0.0", port=8080, bearer_token="secret")

Expose any tool as a production MCP endpoint in 3 lines. Compatible with any MCP-compliant client.

The Honest Take: When to Use Each

SynapseKit was built for a specific set of problems. It's not the right choice for every use case.

Use SynapseKit when:

You're building a greenfield LLM app and want the fastest path to production
Your app is async-first - APIs, webhooks, real-time applications, serverless
You need a small footprint - containers, Lambda, edge runtimes
You want batteries included without hunting for extra packages
Your pipeline uses standard patterns: ReAct agents, Crew orchestration, RAG, streaming
You're experimenting across providers and need painless switching
You want readable code that a new team member can understand without framework training

Use LangChain when:

You need complex conditional graph workflows - LangGraph is genuinely excellent at stateful, branching agentic pipelines
You need a specific integration from LangChain's 150+ partner ecosystem
Your team already knows LangChain deeply and migration cost outweighs gains
You need LangSmith observability deeply integrated into your debugging workflow

Use LlamaIndex when:

Advanced chunking is central to your application quality (SentenceWindow, Hierarchical - there's nothing equivalent in SynapseKit today)
You're building a knowledge-intensive system where retrieval precision is the primary metric
You want LLM-native evaluation metrics (faithfulness, relevance, groundedness) built into the framework

What's Coming in the Benchmark Series

The series continues through Notebooks #19–#30:

#19 - Observability & Tracing: What can you actually see when your agent runs?
#20 - Agent Error Handling: What happens when a tool throws an exception mid-loop?
#21 - Week 3 Scorecard: Agents & tools final rankings
#22 - Async Throughput: Requests/second under real concurrency
#23 - Graph Workflows: DAG pipelines for complex conditional flows
#24 - LLM Evaluation: Built-in faithfulness and relevance metrics
#25 - Cost Tracking: Token counting and spend visibility
#26 - Guardrails: Content filtering and output validation
#27 - MCP Support: Model Context Protocol in practice
#28 - Week 4 Scorecard
#29–#30 - Final Verdict: Which framework wins, for whom, and why

Follow the series on Kaggle

Quick Start

# Minimal install - 2 dependencies
pip install synapsekit

# Full install - vector search, all loaders, all tools
pip install "synapsekit[semantic]"

# Your first RAG pipeline in 7 lines
from synapsekit import RAGPipeline, LLMConfig
from synapsekit.llm.openai import OpenAILLM
from synapsekit.loaders import PDFLoader

llm      = OpenAILLM(LLMConfig(model="gpt-4o-mini", api_key="sk-..."))
docs     = PDFLoader("research.pdf").load()
pipeline = RAGPipeline(llm=llm)
pipeline.add_documents(docs)

answer = await pipeline.query("What are the main findings?")
print(answer)

# Your first multi-agent crew in 10 lines
from synapsekit import Crew, CrewAgent, Task
from synapsekit.llm.groq import GroqLLM

llm        = GroqLLM(LLMConfig(model="llama-3-8b-8192", api_key="gsk-..."))
researcher = CrewAgent(name="researcher", role="Research Analyst", llm=llm)
writer     = CrewAgent(name="writer", role="Writer", llm=llm)
tasks      = [
    Task(description="Research quantum computing trends", agent="researcher"),
    Task(description="Write a blog intro", agent="writer", context_from=["researcher"]),
]
result = await Crew(agents=[researcher, writer], tasks=tasks).run()

Links:

GitHub: github.com/SynapseKit/SynapseKit
Docs: synapsekit.github.io/synapsekit-docs
Kaggle benchmark series: kaggle.com/discussions/general/688339

Every benchmark is reproducible. Fork any notebook and run it on Kaggle free CPU. If the results differ in your environment, open an issue.

Engineers of AI

Read more: www.engineersofai.com

AI Letters #26 - Multi-Agent Orchestration: 16 vs 19 vs 23 Lines (And Three Completely Different Mental Models)

Wed, 15 Apr 2026 00:00:00 GMT

"Three frameworks, three different answers to the same question: who decides when one agent hands work to the next?"

A single agent with tools handles most tasks. But some workflows need specialisation - a researcher producing facts, a writer turning facts into prose, a reviewer checking the output. That chain of specialised agents is where the frameworks stop converging and start showing what they actually believe about software design.

Notebook #18 of the LLM Showdown measured the same 2-agent sequential pipeline across SynapseKit, LangChain (via LangGraph), and LlamaIndex. Researcher feeds Writer. Both call an LLM. The orchestrator wires them together. Simple enough that you can count the lines. Complex enough that the design philosophy underneath becomes visible.

The LoC numbers tell part of the story. The orchestration pattern matrix tells the rest.

Interactive Chart

Multi-Agent History Timeline →

From FIPA agent standards to LangGraph and CrewAI. Click each milestone to see how the orchestration model evolved and what design tradeoffs each generation made.

Interactive Explorer

Orchestration Pattern Explorer →

Click any of 6 orchestration patterns - sequential, parallel, supervisor, handoff, graph, shared state - to see which frameworks support it natively and what the code looks like.

Evidence Dashboard

Full Benchmark Results →

Stacked LoC chart, orchestration pattern heatmap, and design philosophy comparison - all data from notebook #18 in one view.

What the Numbers Say

The benchmark task was identical across all three: wire a Researcher agent and a Writer agent in sequence. Researcher gets a topic, produces bullet points. Writer receives those bullet points, produces a paragraph.

Lines of code - imports + setup to a working 2-agent pipeline:

Framework       Imports  Functional   Total
--------------------------------------------
SynapseKit            3          13      16
LlamaIndex            3          16      19
LangChain             4          19      23

SynapseKit wins on LoC. The gap between SynapseKit (16) and LangChain (23) looks large but read the next section before drawing conclusions.

Orchestration patterns supported:

Pattern                SynapseKit  LangChain  LlamaIndex
---------------------------------------------------------
Sequential             Yes         Yes        Yes
Parallel               Yes         Yes        No
Supervisor             Yes         Yes        No
Handoff chain          Yes         No         Yes
Graph / DAG            Yes         Yes        No
Shared state           Yes         Yes        Yes

Score (out of 6):        6           5          3

SynapseKit and LangChain are nearly tied. LlamaIndex trails significantly - its AgentWorkflow supports sequential handoffs and shared state, but no parallel execution and no supervisor routing.

The Three Mental Models

This is the part that matters more than LoC.

Who controls the handoff?

SynapseKit               LangChain (LangGraph)    LlamaIndex
──────────────────        ────────────────────     ──────────────────
Framework                 You (graph edges)        The LLM

Task-centric:             Graph-centric:           Agent-centric:
define WHAT each          define HOW data          agents decide WHEN
agent should do           flows between nodes      to pass the baton

crew.run()                app.invoke(state)        workflow.run()
executes the              executes the             lets the LLM
task sequence             graph                    improvise

SynapseKit is task-centric. You define what each agent should produce (expected_output) and what context it needs (context_from). The framework manages the sequencing. You don't write the routing logic - you declare the dependency graph and let the Crew executor handle it.

LangChain (LangGraph) is graph-centric. You define nodes (functions) and edges (transitions). The LLM is just a function inside a node - it has no special status. This means the orchestration logic is entirely under your control. Want to add a conditional branch that routes to a fact-checker if confidence is low? That's one add_conditional_edges call. Want to loop back to the researcher if the writer rejects the output? Same. LangGraph doesn't care what's inside each node.

LlamaIndex is agent-centric. Agents decide when to hand off via tool calls. The AgentWorkflow sets up which agents can hand to whom (can_handoff_to), then runs the root agent and lets the LLM drive. The orchestration is emergent - which means it's also less predictable. If the researcher agent decides not to call handoff_to_writer, the workflow stalls.

What the LoC Gap Actually Costs

LangChain's 23 lines include 4 lines of TypedDict state definition, 2 function definitions with LLM calls, and 6 lines of graph wiring. None of that is boilerplate you can skip in a real pipeline - the TypedDict is your contract between nodes, the functions are your agent logic, the graph wiring is your orchestration.

SynapseKit's 16 lines hide that complexity inside the framework. CrewAgent, Task, and Crew are opinionated abstractions. The question isn't whether the code is shorter - it is. The question is what you lose when the abstraction doesn't fit your use case.

Custom tool cost from the previous benchmark (#25): SynapseKit requires subclassing BaseTool. LangChain requires a decorator. If you're building a pipeline where the agents need tools the framework doesn't provide, that cost repeats for every tool.

The Parallel and Supervisor Gap

LlamaIndex's 3/6 pattern score is the number that should influence framework choice.

If your multi-agent system ever needs to run two agents simultaneously - a web-searcher and a database-queryer both working on different subtasks, then merging results - LlamaIndex requires you to build that yourself. AgentWorkflow executes agents in sequence via handoffs. There is no built-in parallel branch.

Supervisor routing is similar. If you need a routing agent that decides which specialist to call based on query type, you're writing that logic yourself on LlamaIndex. SynapseKit ships SupervisorAgent(llm, workers). LangChain gives you a supervisor node pattern in LangGraph.

For simple sequential pipelines, LlamaIndex's limitation doesn't matter. For anything with conditional branching, parallel execution, or dynamic routing, the 3/6 score is a constraint you'll hit.

What This Means for Engineers

SynapseKit's Crew is the fastest path for linear pipelines. Researcher → Writer → Reviewer in sequence, with context passing? 16 lines, one crew.run() call. If that's the pattern, use it.
LangGraph's graph-centric model is not verbosity - it's explicitness. Every edge in your multi-agent graph is a line of code you wrote. That means every routing decision is auditable, testable, and reproducible. When the pipeline behaves unexpectedly, you read the graph.
LlamaIndex's emergent handoff is a bet on the LLM. The agent decides when to pass work to the next agent. That's elegant when it works. When the LLM misses the handoff signal or calls it at the wrong point in the task, you're debugging LLM behaviour rather than framework behaviour. Plan for it.
Parallel execution is not a nice-to-have. Any pipeline that can decompose work across independent agents - and most real workflows can - benefits from parallel execution. The latency difference between sequential and parallel runs compounds as agent count grows.
The custom tool cost from #25 still applies here. Multi-agent pipelines need agents with tools. The LoC advantage SynapseKit holds on agent setup shrinks once you're writing custom tools that don't fit their BaseTool subclass pattern.

The Thing Most People Miss

The LoC benchmarks consistently show SynapseKit winning on setup conciseness. This is real. It is also the least important property of a multi-agent system in production.

What matters in production:

Can you inspect the state between agents?
Can you replay a failed run from a specific node?
Can you test individual agents in isolation?
Can you add a conditional branch without rewriting the pipeline?

LangGraph answers all four yes. SynapseKit answers the first two partially - return_intermediate_steps isn't built into Crew the same way it is in AgentExecutor. LlamaIndex answers all four with varying difficulty.

The framework that wins the LoC race is the one you spend the least time setting up. The framework that wins the production race is the one you spend the least time debugging. Those are different frameworks, and the benchmark is measuring the wrong thing if you're building something that runs for more than a sprint.

Three Things Worth Doing This Week

Map your current multi-agent system to the pattern matrix. Which of the six patterns does it actually use? If the answer is only "sequential" and "shared state", LlamaIndex's 3/6 is irrelevant to you.
Build one conditional branch into an existing sequential pipeline. Take any two-step agent pipeline and add a condition: "if output confidence is low, loop back". That's where LangGraph's graph-centric model pays for its verbosity.
Check whether your handoffs are deterministic. If your agents hand off via LLM tool calls (LlamaIndex model), run the same pipeline five times and check whether the handoff happens at the same point each time. If it doesn't, you have a reliability problem you may not have noticed yet.

The LoC race is over by the second week of production. The debuggability race never ends.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

SynapseKit - A Production-Grade LLM Framework Built for Speed, Simplicity, and Scale

Wed, 15 Apr 2026 00:00:00 GMT

SynapseKit is an async-first Python framework for building LLM applications - chains, agents, RAG pipelines, tool calling, and multi-agent orchestration. Two base dependencies. 48 built-in tools. 31 LLM providers. Designed for engineers who need production-grade tooling without production-grade complexity.

"The right abstraction disappears. You stop thinking about the framework and start thinking about the problem."

What SynapseKit Is

SynapseKit is an open-source Python framework for building applications powered by large language models. It covers the full surface area - from a single LLM call to multi-agent orchestration with cost guardrails - with a design philosophy that prioritizes speed, debuggability, and minimal abstraction.

The core principle: every layer of abstraction must earn its place by making the engineer faster, not by making the framework more flexible.

What ships in the box:

31 LLM providers - OpenAI, Anthropic, Google, Mistral, Cohere, Ollama, and 25 more. Switch providers by changing one string.
48 built-in tools - 12 work with zero configuration. No pip install, no API key, no setup.
43 document loaders - PDF, HTML, CSV, JSON, Markdown, DOCX, and more. Standardized interface across all formats.
Multi-agent primitives - Sequential, parallel, supervisor, hierarchical, pipeline, and feedback loop patterns. All six supported out of the box.
MCP server support - Model Context Protocol integration for tool-rich agent deployments.
Cost guardrails - Built into the execution engine. Set a budget, the agent stops cleanly instead of burning your API credits.

Design Philosophy

Two Dependencies

SynapseKit's base install pulls two packages. Not 67. Not 43. Two.

SynapseKit:  2 dependencies  · 48 MB RAM  · 80ms cold start
LangChain:  67 dependencies  · 189 MB RAM · 2,400ms cold start
LlamaIndex: 43 dependencies  · 112 MB RAM · 1,100ms cold start

Fewer dependencies means fewer version conflicts, faster installs, smaller container images, and cold starts that don't punish your users. In serverless deployments where every scale-from-zero event pays the cold start tax, 80ms vs 2.4 seconds is the difference between responsive and broken.

Async From the Ground Up

Every base class - BaseTool, BaseRetriever, BaseLLM - is async def by default. Not sync with an async wrapper bolted on. Not run_in_executor hiding a blocking call.

This matters because async correctness propagates. When the base class is async, every implementation is async. Contributors don't accidentally write sync tools. The framework never silently dispatches to a thread pool. At 50 concurrent requests, SynapseKit achieves 96.8% of theoretical throughput - near-baseline async efficiency.

Shallow Call Stacks

When something fails at 3am in production, the traceback is 8 lines, not 47. The agent loop is 47 lines of readable Python. No RunnableSequence.__call__ chains, no middleware dispatch, no callback manager traversal. You read the error, you find the bug, you fix it.

One Tool Interface

Define a tool once with a JSON schema. Export to OpenAI format with .schema(). Export to Anthropic format with .anthropic_schema(). Same source of truth, zero duplication. One definition that works across all 31 providers.

What You Can Build

RAG Pipelines

from synapsekit import LLM, RAGPipeline, PDFLoader

docs = PDFLoader("reports/").load()
rag = RAGPipeline(docs=docs, llm=LLM("openai/gpt-4o"))
rag.build()

answer = await rag.query("What were Q3 revenue figures?")

Seven lines. Load, build, query. Chunking, embedding, indexing, retrieval, and generation - all handled. Switch to Anthropic by changing "openai/gpt-4o" to "anthropic/claude-sonnet-4-20250514". Nothing else changes.

Agents with Tools

Built-in tools for calculation, datetime, web search, file operations, and more. Define custom tools with a class and a JSON schema. The agent loop handles reasoning, tool selection, execution, and observation routing.

Multi-Agent Orchestration

The Crew and Task primitives support six orchestration patterns. Declare dependencies between tasks, not between agents. The framework handles execution order, context passing, and result aggregation.

from synapsekit import Crew, Task, Agent

researcher = Agent(name="researcher", tools=[search_tool])
writer = Agent(name="writer", tools=[])

research_task = Task(agent=researcher, description="Find latest data on X")
write_task = Task(agent=writer, description="Write report", context_from=[research_task])

crew = Crew(agents=[researcher, writer], tasks=[research_task, write_task])
result = await crew.run()

Streaming

async for token in llm.stream("Explain quantum computing"):
    print(token, end="")

First-class streaming with the cleanest API across any framework. No callback handlers, no special configuration.

Where SynapseKit Fits

SynapseKit is built for a specific engineer: the one building LLM-powered products that need to work reliably in production, not just in a notebook demo.

Use SynapseKit when:

You need fast cold starts (serverless, edge, CLI tools)
You want minimal dependency footprint in containerized deployments
You're building agent-heavy applications with multiple tools
You need to switch between LLM providers without rewriting code
You want cost controls built into the execution layer

Consider alternatives when:

You need LlamaIndex's advanced chunking strategies (SemanticSplitterNodeParser, KnowledgeGraphIndex)
You need LangChain's ecosystem breadth and community integrations
You need LangChain's ToolException error recovery pattern for complex agent loops

We publish these tradeoffs openly. The 30-notebook LLM Framework Showdown on Kaggle benchmarks SynapseKit against LangChain and LlamaIndex across 18 production dimensions - including the dimensions where SynapseKit loses. Honest benchmarking means publishing the uncomfortable numbers too.

The Vision

LLM frameworks today are where web frameworks were in 2010. Too many abstractions solving for flexibility instead of velocity. Too much ceremony for simple operations. Too many dependencies for production deployments.

SynapseKit is a bet on a different direction: that the best framework is the one that disappears. You think about your application logic, not about the framework's internal architecture. You debug your code, not the framework's middleware. You deploy with confidence because you understand every line between your function call and the LLM API.

The roadmap:

Evaluation harness - standardized benchmarks you can run against your own agents
Visual debugger - trace agent execution, tool calls, and token usage in real time
Plugin marketplace - community tools and integrations with a single install command
Enterprise features - audit logging, role-based access, deployment presets for AWS/GCP/Azure

SynapseKit is MIT-licensed, fully open source, and built in the open. Every design decision is documented. Every benchmark is reproducible. Every line of code is readable.

Get Started

pip install synapsekit

GitHub: github.com/SynapseKit/SynapseKit
Benchmarks: LLM Framework Showdown on Kaggle
Documentation: Ships with the package

Two dependencies. One pip install. Start building.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #25 - The Built-in Tool Race: 30 vs 29 vs 12 (And Why the Headline Number Lies)

Mon, 13 Apr 2026 00:00:00 GMT

"Both SynapseKit and LangChain claim roughly 30 built-in tools. The difference is whether 'built-in' means 'works on install' or 'works after twelve more pip installs'."

Every LLM framework advertises its tool ecosystem. The numbers look impressive in the docs. Then you try to actually use them and discover that half of them require a separate pip install, a third require an API key, and a handful only work on specific operating systems.

Notebook #17 of the LLM Showdown did the audit nobody does in the benchmarks: count only what actually ships in the base install, then split by what works with zero configuration versus what needs extra setup. The headline totals are almost identical - 30, 29, 12. The zero-config counts are not.

Interactive Chart

Tool Ecosystem Timeline →

From the @tool decorator to batteries-included frameworks. Click each milestone to see how each design philosophy evolved and what it costs today.

Interactive Explorer

Tool Category Explorer →

Click any of 9 capability categories to see which frameworks cover it - and whether the tools work immediately or need extra pip installs and API keys.

Evidence Dashboard

Full Benchmark Results →

Total tools stacked, zero-config breakdown, category heatmap - all data from notebook #17 in one view.

What the Numbers Actually Mean

The benchmark defines built-in strictly: only tools included when you run pip install framework. Third-party integrations requiring a separate pip install per tool are counted separately.

Total built-in tools:

Framework       Core tools   Extra-pip tools   Total
-----------------------------------------------------
SynapseKit              30                 0      30
LangChain               17                12      29
LlamaIndex               3                 9      12

SynapseKit and LangChain are nearly tied on total. But LangChain's 12 community tools each require a separate install - pip install duckduckgo-search, pip install slack-sdk, pip install arxiv - before they do anything. SynapseKit's 30 ship as implementations, not wrappers. LlamaIndex has 3 core tool types (FunctionTool, QueryEngineTool, RetrieverTool) and 9 hub packages, all requiring pip install llama-index-tools-*.

Zero-config tools (no API key, no extra pip install):

Framework       Zero-config tools
----------------------------------
SynapseKit                     12
LangChain                      10
LlamaIndex                      3

This is the number that matters for prototyping speed. SynapseKit gives you calculator, datetime, regex, file I/O, Python REPL, shell, HTTP requests, web scraping, and human input - zero additional setup. LangChain gives you file management and shell tools, plus the @tool decorator pattern itself. LlamaIndex gives you the three wrapper types and nothing else that runs without extra installs.

Category coverage:

Framework       Categories covered
------------------------------------
SynapseKit                       9
LangChain                        9
LlamaIndex                       5

SynapseKit and LangChain both cover 9 distinct capability areas. LlamaIndex covers 5 - and its coverage is mostly retrieval-oriented, which matches its RAG-first design.

The Design Philosophy Underneath the Numbers

Tool philosophy - what "built-in" actually means

SynapseKit          LangChain           LlamaIndex
──────────────      ──────────────      ──────────────
Implementations     Thin wrappers       Primitives only
ship in package     delegate to         tools are app-
                    third-party libs    level concerns

pip install X       pip install X       pip install X
-> tool works       + pip install Y     + pip install
                    per community tool  llama-index-
                                       tools-* per tool

30 tools ready      17 tools ready      3 types ready
12 need nothing     10 need nothing     3 need nothing

LangChain's approach is deliberate. Thin wrappers mean the framework doesn't own the dependency - the underlying library (DuckDuckGo, Slack, arXiv) handles updates, auth, rate limiting. The wrapper just shapes it into the tool interface. The cost: an extra pip install every time you want a new capability, and occasional version conflicts between the wrapper and the underlying library.

SynapseKit's approach means more to maintain internally - when the DuckDuckGo API changes, SynapseKit's implementation breaks, not a third-party wrapper. The benefit: pip install synapsekit and you have a working web search tool.

LlamaIndex made a different bet entirely: tools are application concerns, not framework concerns. You build what you need with FunctionTool. The hub packages exist for common cases but they're optional additions, not the core product.

What This Means for Engineers

For prototyping and hackathons: SynapseKit's 12 zero-config tools mean you can build a working agent that does web scraping, file I/O, Python execution, and HTTP calls before you've set up a single API key. That's a real time advantage in time-constrained settings.
The LangChain community tool count is misleading. When comparing frameworks, don't count community wrappers the same as core tools. A wrapper that requires 3 extra pip installs and an API key is not in the same category as a tool that works immediately.
LlamaIndex's 3 core tools are not a weakness - they're a constraint. The framework explicitly doesn't try to solve the tool problem. If you're already using LlamaIndex for retrieval, your query engines and retrievers become first-class tools. Everything else you write yourself with FunctionTool.
Multimodal is where the gap is largest. SynapseKit ships ImageAnalysisTool, SpeechToTextTool, and TextToSpeechTool in base. LangChain's multimodal tools (OpenAITextToSpeechTool, OpenAIWhisperParser) require the OpenAI package and API key. LlamaIndex has nothing multimodal in core.
The "thin wrapper" model has long-term benefits. LangChain's community tools don't go stale the same way SynapseKit's implementations might - the underlying library handles the API. For production systems running for years, that matters.

The Thing Most People Miss

The zero-config number is a proxy for something deeper: how much cognitive overhead does the framework impose before you can test an idea? Twelve pip installs plus API keys means twelve things to track, debug, and version-pin. Three zero-config tools means three things.

This matters most at the beginning of a project, when you're still figuring out whether your agent architecture is viable. If the framework makes you spend an hour on setup before you can test your first tool call, you're optimising the wrong variable.

SynapseKit wins on setup speed. LangChain wins on long-term maintainability of individual tools. LlamaIndex wins when your tools are retrieval pipelines you were building anyway.

Three Things Worth Doing This Week

Audit your current agent's tool imports. Count how many separate pip installs they require. If it's more than 5, ask whether that's accidental complexity or intentional.
Test your critical tools with no internet access. Zero-config tools that only need local resources are more reliable in production than tools that call external APIs for every invocation.
Read the tool source, not just the docs. For any LangChain community tool you use, find the underlying library it wraps. That library's changelog is more relevant to your upgrade path than LangChain's.

The built-in tool count is marketing. The zero-config count is engineering. Know which number you're optimising for.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #24 - ReAct Agents: Six Lines vs Nineteen (And What You Lose in Between)

Sun, 12 Apr 2026 00:00:00 GMT

"Six lines to build a working ReAct agent sounds like a win. It is - until your agent starts looping and you have no idea why."

The ReAct loop is the first pattern every engineer reaches for when they need an agent. Thought, Action, Observation. Repeat until done. It's elegant on paper. In production it breaks in exactly the ways you'd expect: infinite loops, wrong tool selection, hallucinated tool calls that return nothing useful.

The question isn't whether ReAct agents work. It's whether your framework lets you see inside the loop when things go wrong.

Notebook #15 of the LLM Showdown measured three things: lines of code to build a working ReAct agent with two tools, the built-in tool inventory available without writing any tool code, and loop control parameters exposed to the caller. SynapseKit wins on LoC. LangChain wins on observability. LlamaIndex sits in the middle on both. The numbers are not the story. The tradeoff they reveal is.

Interactive Chart

ReAct Adoption Timeline →

From the 2022 Princeton paper to three competing framework implementations. Click each milestone to see what each framework prioritized and what it traded away.

Interactive Explorer

ReAct Loop Explorer →

Select a framework and step through Thought → Action → Observation to see exactly what each exposes at each step. Includes live code samples for all three.

Evidence Dashboard

Full Benchmark Results →

LoC stacked charts, built-in tool inventory, loop control heatmap, and custom tool cost - all benchmark data from notebook #15 in one view.

What ReAct Actually Requires

A minimal working ReAct agent needs four things: an LLM, at least one tool with a schema, a prompt that formats Thought/Action/Observation, and a loop that parses the model's output and dispatches tool calls. Getting all four wired together is where the frameworks diverge.

The benchmark task was identical across all three: define a calculator tool and a datetime tool, build a ReAct agent, run one query that requires at least one tool call.

The Evidence

Lines of code - imports + setup to a working agent:

Framework       Imports  Functional   Total
--------------------------------------------
SynapseKit            3           3       6
LlamaIndex            3          10      13
LangChain             5          14      19

SynapseKit gets to 6 lines because CalculatorTool and DateTimeTool are shipped in the library. You import them like any other class. There is no tool-definition code because there is nothing to define.

LangChain's 19 lines include two @tool-decorated functions - that's 10 lines of the gap right there. Strip those and LangChain's agent setup is 9 lines. The decorator approach is not verbose; it's complete. The tool code is what you'd write in any framework.

LlamaIndex at 13 lines uses FunctionTool.from_defaults() - plain Python functions wrapped into tool objects. Slightly more explicit than LangChain's decorator, slightly less so than SynapseKit's class hierarchy.

Custom tool definition - what it costs when built-ins don't cover your use case:

SynapseKit    6 lines  (subclass BaseTool, implement async run())
LangChain     5 lines  (@tool decorator on any annotated function)
LlamaIndex    5 lines  (plain function + FunctionTool.from_defaults())

SynapseKit's advantage evaporates here. The moment you need a tool that isn't in their library, you're writing more code than the alternatives, not less. The subclass pattern is also more rigid - you're tied to their async interface, their error handling convention, their schema format.

Built-in tool inventory (no tool code required):

Framework        Built-in tools
--------------------------------
SynapseKit                   18
LangChain                    15
LlamaIndex                    9

SynapseKit leads: web scraping, arxiv, PubMed, SQL, shell, Python REPL, translation, sentiment - all importable. LangChain has 15 but many require third-party API keys (Tavily, Brave, Google). LlamaIndex's 9 are mostly retrieval-oriented, which makes sense given its RAG-first heritage.

Loop control parameters exposed to the caller:

Parameter                SynapseKit  LangChain  LlamaIndex
-----------------------------------------------------------
max_iterations           Yes         Yes        Yes
early stop               Yes         Yes        Yes
handle_parsing_error     Yes         Yes        Yes
verbose                  No          Yes        Yes
return_intermediate_steps No          Yes        Yes
async support            Yes         Yes        Yes

Score (out of 6):          4           6          6

This is the number that matters in production.

The Contrast

ReAct Loop - What You Can Observe

SynapseKit                    LangChain / LlamaIndex
──────────────────────        ──────────────────────────────
[Thought]                     [Thought]  <- verbose logs
     |                              |
[Action]                      [Action]   <- intermediate steps
     |                              |
[Observation]                 [Observation] <- response.sources
     |                              |
[Answer]                      [Answer]
  ^ opaque                      ^ full trace available

SynapseKit's loop runs. You get the final answer. What happened in between - which tools were called, in what order, with what arguments, what they returned - is not surfaced by default. There is no verbose=True. There is no return_intermediate_steps. If the agent gives you a wrong answer, your debugging path is: re-run with print statements you've injected manually, or read source code.

LangChain gives you return_intermediate_steps=True on AgentExecutor. Every thought, every tool call, every observation is accessible in the response object. LlamaIndex surfaces the same through response.sources. This is not a nice-to-have. It is the difference between an agent you can ship and an agent you can't explain.

What This Means for Engineers

The 6-line number is real but context-dependent. If your use case fits SynapseKit's 18 built-in tools, you genuinely write less code. If it doesn't, you write more.
Observability is not optional in production. The first time a ReAct agent gives a customer a wrong answer, you will need to reconstruct exactly what it thought and did. SynapseKit makes that hard by default.
LangChain's verbosity is load-bearing. return_intermediate_steps, verbose, handle_parsing_errors - these aren't academic features. They are the handles you grab during an incident.
LlamaIndex at 13 lines is the quiet winner. FunctionTool is clean. response.sources gives you the trace. The tool count (9 built-in) is lower, but the RAG-tool integration is first-class. If you're already using LlamaIndex for retrieval, adding agents costs almost nothing structurally.
The custom tool cost comparison exposes the real architecture. SynapseKit's BaseTool subclass is not burdensome at 6 lines - but it is a commitment. LangChain's @tool decorator composes with any Python function you already wrote. The closer your existing codebase is to plain Python, the more that matters.

The Thing Most People Miss

The benchmark measured the cost to build a ReAct agent. It didn't measure the cost to debug one. Debugging cost scales with agent complexity, agent usage, and how long the loop runs. A 6-line setup that produces an opaque loop will cost you more time over a quarter than a 19-line setup with full observability - assuming the agent actually runs in production. Most of them do, eventually.

The frameworks that win on setup lines tend to lose on debuggability. This is not a coincidence. It is the fundamental tradeoff in API design: the more you hide, the less you write. The more you expose, the more you can see.

Three Things Worth Doing This Week

Check your current agent setup for return_intermediate_steps or equivalent. If you can't reconstruct the last 10 agent traces from your logs, you don't have production observability yet.
Audit your tool definitions. If they are tightly coupled to a framework's base class, write one clean Python function that does the same thing. Keep framework-agnostic logic separate from framework integration.
Run notebook #15 yourself against your own framework of choice: github.com/engineersofai/llm-showdown. The task is simple enough to replicate in 20 minutes. The loop control gaps show up immediately.

The conciseness race is worth running. Just know what you're trading away when you win it.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #23 - The RAG Scorecard: Six Benchmarks, Three Frameworks, One Clear Pattern

Fri, 10 Apr 2026 00:00:00 GMT

"Batteries-included beats fully-composable on conciseness every time. Fully-composable beats batteries-included on control every time. You just have to know which problem you're solving."

Six notebooks. Six benchmarks. Three frameworks measured on the same RAG workloads, back to back, reproducible on Kaggle.

Week 1 of the LLM Showdown covered setup overhead: environment spin-up, indexing speed, basic retrieval, reranking, evaluation harnesses, and the Week 1 scorecard. SynapseKit won that one 15–7–8 (SK–LC–LI).

Week 2 went deeper into the RAG stack: PDF ingestion, chunking strategies, BM25 availability, hybrid search RRF, streaming time-to-first-token, and conversation memory. Same methodology. 3-2-1 points for rank 1-2-3 across each benchmark, ties split.

The results are not a surprise if you've been paying attention. But the magnitude of the gap on some dimensions is.

Interactive Chart

Two-Week Journey →

All 14 notebooks, cumulative standings after each week, with notebook-by-notebook breakdown of winners and key findings.

Interactive Explorer

Benchmark Explorer →

Click each of the 6 benchmarks to explore raw values, methodology, and what the result actually means in production.

Evidence Dashboard

Full Scorecard Dashboard →

Complete points heatmap, stacked benchmark breakdown, raw values, and two-week cumulative standings in one view.

Here is what the data says.

What the Scorecard Shows

Week 2 final points:

Framework     #8   #9   #10  #11  #12  #13   Total
──────────────────────────────────────────────────
SynapseKit   3.0  1.0  3.0  2.0  3.0  3.0   15.0
LlamaIndex   2.0  3.0  1.5  1.0  2.0  2.0   11.5
LangChain    1.0  2.0  1.5  3.0  1.0  1.0    9.5

SynapseKit wins 4 of 6 benchmarks. LangChain wins 1. LlamaIndex wins 1. Same pattern as Week 1, except LlamaIndex and LangChain swap second and third depending on the dimension.

Two-week cumulative:

SynapseKit:  15 (W1) + 15 (W2) = 30
LlamaIndex:   8 (W1) + 11.5 (W2) = 19.5
LangChain:    7 (W1) + 9.5 (W2)  = 16.5

The Evidence - Benchmark by Benchmark

#8 - RAG from PDF (lines of code)

SynapseKit loads a PDF into a retrieval pipeline in 7 lines. LangChain needs 13. LlamaIndex needs 11. The LangChain number is not lazy code - it requires a PyPDFLoader, a RecursiveCharacterTextSplitter, a vector store, and a retriever. Each is a separate abstraction. SynapseKit wraps all of that into one RAGPipeline(pdf="...") call.

Winner: SynapseKit. Margin: nearly 2x.

#9 - Chunking Strategies (built-in splitter count)

LlamaIndex wins this cleanly: 9 built-in splitters vs LangChain's 7 vs SynapseKit's 4. The two that matter are SentenceWindowNodeParser (retrieves surrounding sentences, not just the matched chunk) and HierarchicalNodeParser (builds a tree of chunks at different granularities). Neither exists in SynapseKit or LangChain. If your retrieval quality depends on chunk context, LlamaIndex is the right tool.

Winner: LlamaIndex. Not close.

#10 - Built-in BM25 (extra packages required)

SynapseKit bundles rank_bm25 as a core dependency. LangChain and LlamaIndex both require you to install an extra package (rank-bm25 and llama-index-retrievers-bm25 respectively) before BM25 is available. Zero vs one extra pip install. It sounds trivial. At deployment time in a locked environment, it is not.

Winner: SynapseKit.

#11 - Hybrid Search RRF (configurability score)

LangChain wins this one, and it deserves to. EnsembleRetriever accepts an arbitrary list of retrievers and per-retriever weights. You can combine three different retrievers with custom weighting in a single constructor call. LlamaIndex's hybrid search has no weight control - it applies RRF with fixed parameters. SynapseKit sits in the middle: two-retriever support, fixed alpha weighting.

Winner: LangChain. Score: 5/5 vs SK's 4/5 vs LI's 3/5.

#12 - Streaming TTFT (median framework overhead, ms)

SynapseKit:   0.001 ms
LlamaIndex:   0.184 ms
LangChain:    0.236 ms

All three are sub-millisecond. SynapseKit's async generator adds the least overhead. But read the caveat in the takeaway section - this benchmark's winner does not matter in production.

Winner: SynapseKit. Winner that matters: nobody.

#13 - Conversation Memory (lines of code to add memory)

SynapseKit: 4 lines. LlamaIndex: 6 lines. LangChain: 12 lines.

LangChain's RunnableWithMessageHistory requires a store object, a getter function, a session ID, and LCEL wiring before the history is injected. SynapseKit exposes it as one constructor parameter: memory=True. The gap is 3x.

Winner: SynapseKit.

What This Means for Engineers

SynapseKit wins on conciseness in every dimension where conciseness is the metric. PDF loading, BM25, memory wiring - all 3-4x fewer lines. If you are prototyping or building internal tooling where developer velocity matters more than edge-case flexibility, this is the path.
LangChain wins when you need fine-grained control over retrieval composition. Hybrid search with custom weights across three retrievers is a real use case - recommendation engines, multi-index RAG, domain-specific blending. EnsembleRetriever handles this; SynapseKit's fixed alpha does not.
LlamaIndex wins when chunking quality is the bottleneck. If you're working with long technical documents, legal text, or anything where retrieved chunk context matters, SentenceWindowNodeParser and HierarchicalNodeParser are not features - they are the reason to use LlamaIndex.
The TTFT result is noise. Sub-millisecond framework overhead against a real LLM API that adds 300–2000ms of network latency. Do not let this benchmark influence your framework choice.
Week 3 is where it gets interesting. Agents, tool calling, multi-agent orchestration - this is where the architectures diverge most sharply. SynapseKit's agent layer is newer. LangChain's is battle-tested. LlamaIndex's is designed for data-heavy agentic workflows. The conciseness advantage SynapseKit holds in RAG may not hold in agents.

The Corollary Most People Miss

The benchmarks SynapseKit loses are the ones that reveal its design tradeoff. Fewer splitters means less chunking flexibility. Fixed hybrid search alpha means less retrieval control. No persistent memory backends (yet) means you own the storage problem.

SynapseKit is fast to write. It is not yet flexible to extend.

LangChain is slow to write. It is extremely flexible to extend - the entire LCEL composability model exists precisely to let you plug in arbitrary steps without rewriting the framework.

Neither is wrong. They are optimised for different constraints. The mistake is reaching for LangChain's full composability when you are building a standard RAG pipeline that SynapseKit already handles in 7 lines. The inverse mistake is reaching for SynapseKit when you need custom retrieval logic that requires LangChain's EnsembleRetriever.

Know which problem you have before you pick the tool.

Three Things Worth Doing This Week

Run the notebooks yourself - all 6 are reproducible on Kaggle CPU. Fork LLM Showdown #8 through #13. Swap in your own documents and LLM endpoint. The numbers in your environment may differ from ours.
Audit your chunking strategy. Most RAG implementations use RecursiveCharacterTextSplitter with default chunk size because it is the default. Check if SentenceWindowNodeParser or a sliding window approach would improve your retrieval precision. Run a quick eval on 20 representative queries before assuming it does not matter.
Profile your own framework overhead end-to-end. Not the TTFT micro-benchmark we ran - the full round trip: query → retrieve → generate → first token to your user. That number is what your users experience. Framework choice is usually not in the top three factors.

Week 3 covers agents. ReAct loops, function calling, tool libraries, multi-agent coordination, tracing, and error handling. The scorecard will look different. Check back.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #22 - Conversation Memory in RAG: One Param vs Forty Lines of Boilerplate

Thu, 09 Apr 2026 00:00:00 GMT

RAG gives the model context from documents. Memory gives it context from the conversation. Without both, your chatbot doesn't know what it just said.

Every RAG system eventually faces the same question: what happens on the second turn? The user asks a follow-up. "What did you mean by that?" "Can you give me an example?" "How does that compare to what you said earlier?" Without memory, the model treats each question as the first. Context from the previous turn is gone. The answer it gives to the follow-up is either wrong, generic, or disconnected from what came before.

Conversation memory is the fix. A buffer of past exchanges gets prepended to the retrieved context and injected into the prompt. The model now has the document context and the conversation context. It can use both. The question is how much it costs to add this to your pipeline - and what happens when the conversation gets long enough that you have to start dropping old messages.

We wired identical multi-turn memory into RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same conversation, same task, same question: how many lines does it take to add memory, and what happens at the edge cases? The LoC gap is the widest of any benchmark in this series. The persistence and window-strategy differences are what will matter in your production system.

Interactive Chart

LoC Across All 12 Benchmarks →

Cumulative lines of code per framework from hello world to conversation memory. See where each framework has built its lead.

Code Explorer

Memory Pipeline Code by Framework →

Full multi-turn memory RAG code side by side - one param vs session stores vs token buffers, annotated for each framework.

Data

Message Retention vs Window Size + Feature Matrix →

How many messages each framework retains at different window sizes, plus the full memory API feature comparison across all three.

What We Measured

Task: Build a multi-turn RAG pipeline with conversation memory. Run 5 turns of questions. Measure lines of code, window strategy, and message retention at different window sizes.

Metric	What it captures
Lines of code	Code to add multi-turn memory to an existing RAG pipeline
Window strategy	How old messages get dropped - turn count vs token limit
Message retention	Messages kept after 5 turns at window sizes 1, 2, 3, 5

Frameworks: SynapseKit 1.4, LangChain 1.2, LlamaIndex Core 0.14. Kaggle CPU. Disclosure: I'm the author of SynapseKit. All code is on Kaggle - fork and run yourself.

The Code

SynapseKit - 1 constructor argument:

from synapsekit import RAG

rag = RAG(model="gpt-4o-mini", api_key=KEY, memory_window=5)
await rag.add_documents(DOCS)
r1 = await rag.ask("What is RAG?")
r2 = await rag.ask("How does it improve accuracy?")
r3 = await rag.ask("Which retrieval method is fastest?")

Memory is a single parameter on the RAG constructor. memory_window=5 keeps the last 5 turns. Every subsequent .ask() call automatically prepends the conversation history to the retrieved context. Zero additional setup. The tradeoff: in-memory only, no persistence across sessions.

LangChain - session store + getter + LCEL wiring:

from langchain_community.retrievers import BM25Retriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import InMemoryChatMessageHistory

store = {}
def get_session_history(session_id: str) -> InMemoryChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

retriever = BM25Retriever.from_texts(DOCS, k=3)
prompt = ChatPromptTemplate.from_messages([
    ("system", "Context: {ctx}"),
    MessagesPlaceholder("history"),
    ("human", "{question}"),
])
chain = ({"ctx": retriever, "question": RunnablePassthrough()} | prompt | ChatOpenAI())
chain_with_history = RunnableWithMessageHistory(
    chain, get_session_history,
    input_messages_key="question", history_messages_key="history"
)
r1 = chain_with_history.invoke({"question": "What is RAG?"}, config={"configurable": {"session_id": "s1"}})

RunnableWithMessageHistory is the canonical LangChain pattern. You define a session store (here in-memory, but can be Redis/DynamoDB/Postgres), a getter function, and wire it around your chain. Twelve lines before a single question is asked. The payoff: swap InMemoryChatMessageHistory for RedisChatMessageHistory and you have persistent multi-user memory with no other changes.

LlamaIndex - token-budget buffer on the chat engine:

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini")
index  = VectorStoreIndex.from_documents([Document(text=d) for d in DOCS])
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
engine = index.as_chat_engine(memory=memory, chat_mode="context")
r1 = engine.chat("What is RAG?")
r2 = engine.chat("How does it improve accuracy?")
r3 = engine.chat("Which retrieval method is fastest?")

ChatMemoryBuffer takes a token_limit instead of a turn count. The engine drops old messages when the buffer exceeds the limit. Clean API - comparable conciseness to SynapseKit at the chat engine level. Can serialize to SimpleChatStore for lightweight persistence.

The Numbers

Framework    Imports   Functional   Total
──────────────────────────────────────────
SynapseKit       1           5         6
LlamaIndex       3           6         9
LangChain        5          12        17

This is the widest LoC gap in the series. LangChain's session store pattern adds 5 lines of boilerplate before the chain is even built - the getter function, the store dict, and the RunnableWithMessageHistory wrapper. That boilerplate is the price of flexibility. You get pluggable backends. SynapseKit gives you the same result in one argument, but you're locked to in-memory.

Window Strategy: The Detail That Matters

Both frameworks drop old messages when the window fills up. The question is what "window" means:

Framework     Strategy              Reasoning unit   Control
──────────────────────────────────────────────────────────────
SynapseKit    Sliding window        Turns            memory_window=N
LangChain     Store all; trim       Turns (manual)   slice last N*2
LlamaIndex    Token budget          Tokens           token_limit=N

Turn-count windows (SynapseKit, LangChain) are easy to reason about: "keep the last 3 exchanges." The problem is that turns vary wildly in length. A 3-turn window might be 200 tokens or 2,000 tokens depending on the conversation. At scale, that variance creates unpredictable prompt sizes.

Token-limit windows (LlamaIndex) are harder to reason about - "keep 1,500 tokens of history" doesn't tell you how many turns that is. But they're more predictable in terms of prompt size, which is what actually matters for LLM API cost and latency. You know exactly how much context you're sending.

Message retention after 5 turns at different window sizes:

Window    SynapseKit   LangChain   LlamaIndex
─────────────────────────────────────────────
w=1          2 msg       2 msg      ~2 msg
w=2          4 msg       4 msg      ~4 msg
w=3          6 msg       6 msg      ~6 msg
w=5         10 msg      10 msg     10 msg

At equivalent settings, all three retain the same number of messages. The difference surfaces when conversations are long and token-dense - LlamaIndex starts dropping earlier than a turn-count window of the same number.

Persistence: Where They Truly Split

Feature	SynapseKit	LangChain	LlamaIndex
In-memory	Yes	Yes	Yes
Redis	No	Yes	No
DynamoDB	No	Yes	No
Postgres	No	Yes	No
JSON file	No	Yes	Yes (SimpleChatStore)
Custom backend	No	Yes	Partial
`clear()`	Yes	Yes	Yes
Format to string	Yes	Yes	Yes

LangChain's persistence ecosystem is the clear winner. Swap one import and your session store moves from in-memory to Redis. This is the critical path for any multi-user production app - users expect their conversation to persist across sessions, across devices, across server restarts.

SynapseKit's in-memory limitation is the one place where its simplicity becomes a real constraint. For a single-user, single-session chatbot, it's fine. For a production app with multiple users, you'll either fork the memory implementation or migrate to LangChain for this layer.

What This Means for Engineers

Don't build your own memory layer. All three frameworks provide one. Rolling your own conversation buffer means reinventing trimming logic, format conversion, and history injection - work that's already done for you.
Choose turn-count windows for simple apps, token-budget windows for production. Turn count is easy to explain to stakeholders. Token budget is what keeps your API costs predictable at scale. If you're serving real users, measure the token distribution of your turns before deciding.
LangChain's RunnableWithMessageHistory is boilerplate, but it's good boilerplate. The session getter pattern decouples your chain from the storage backend. When you move to Redis in production, you change one line. That's worth 7 extra lines at setup time.
LlamaIndex's chat_engine is the fastest path to a working multi-turn RAG demo. Two lines - memory and engine. If you're building a prototype or an internal tool where persistence doesn't matter, this is the fastest start.
Memory and RAG interact in ways that will surprise you. When the retrieved context changes and the memory context contradicts it, the model has to reconcile them. This creates subtle failures - confident-sounding answers that combine stale memory context with fresh document context incorrectly. Test multi-turn RAG with contradictory document updates before shipping.

The Corollary Most People Miss

The memory problem compounds. A single-turn RAG pipeline has one context window to manage: the retrieved documents. A multi-turn RAG pipeline has two: the documents and the conversation history. They compete for the same token budget.

Most teams add memory and don't adjust their retrieval budget. The result: the total context grows until it hits the model's context limit and something gets truncated - usually silently. The retrieved documents get cut first because they're appended after the history. The model starts answering from memory rather than documents. Retrieval quality degrades. Nobody notices because the answers still sound coherent.

The fix is explicit: set max_tokens_for_context = total_budget - memory_tokens - system_prompt_tokens and cap your retriever's top_k accordingly. None of the three frameworks do this automatically.

Context budget allocation (simplified):
────────────────────────────────────────────────
Total context window         128,000 tokens
System prompt                ~500 tokens
Conversation memory          ~2,000 tokens  (10 turns × ~200 tokens/turn)
Retrieved documents          ~4,000 tokens  (top-5 chunks × ~800 tokens)
LLM response budget          ~2,000 tokens
────────────────────────────────────────────────
Remaining buffer             119,500 tokens

Do the maths before you hit the limit, not after.

Three Things Worth Doing This Week

Add memory_window or token_limit to your RAG pipeline today. If you're building a chat interface on top of RAG and not passing history into the prompt, every follow-up question is being answered in isolation. That's a worse user experience than a basic chatbot.
Measure your average conversation length in tokens. Pull a sample of real conversations, tokenize them, and see what percentile hits 1,500 tokens. That's your token_limit starting point. A turn-count window of 5 in a technical conversation can hit 3,000 tokens easily.
Read the Kaggle notebook. Full code, retention tables at different window sizes, and the live demo: LLM Showdown #13 - Conversation Memory in RAG

Memory is the difference between a search engine with an LLM frontend and an actual conversational AI. The frameworks all provide it. The split is in how they drop old messages and whether they persist across sessions. One approach gives you a single argument and no persistence. One gives you a token budget and lightweight JSON persistence. One gives you full production backends at the cost of boilerplate. Pick the one that matches where your app needs to be in six months, not where it is today.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #21 - Streaming RAG: Time to First Token Across Three Frameworks

Tue, 07 Apr 2026 00:00:00 GMT

When users wait for an LLM, the number that matters is time-to-first-token, not total time. 200ms to first token feels instant. 2 seconds to first token feels broken - even if the full answer arrives faster.

Every LLM UI eventually learns the same lesson. Users don't measure latency the way your dashboard does. They don't care about tokens-per-second, p99 tail latency, or median completion time. They care about one thing: how long until something appears on screen. That number is TTFT - time to first token - and it dominates perceived performance more than any other metric in LLM serving.

The catch is that when you're building a streaming RAG pipeline, the framework itself sits between your .stream() call and the first token your user sees. Every async for, every LCEL graph traversal, every callback dispatch adds latency before a single character leaves the server. In production that overhead is invisible because network latency to OpenAI or Anthropic is 100–1000x larger. But strip out the network with a mock LLM and you can finally see what the framework itself costs you.

We built identical streaming RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same documents, same query, same mock LLM that yields the exact same token list with zero network latency. The result: all three clear the sub-millisecond bar comfortably. Nobody loses on the number. The interesting split is elsewhere - in the shape of the streaming API itself.

Interactive Chart

TTFT vs Network Latency →

How framework overhead compares to real network latency from OpenAI, Anthropic, and a local model. The framework is a rounding error - until it isn't.

Code Explorer

Streaming RAG Code by Framework →

Full streaming pipeline code side by side - imports, setup, and the `.stream()` consumption pattern annotated for each framework.

Data

TTFT Distribution + API Surface Matrix →

Median TTFT, p99 tail, sync vs async support, and callback availability - the full scorecard across all three frameworks.

What We Measured

Task: Build a streaming RAG pipeline (BM25 retrieval + LLM stream). Feed the retrieved context into an LLM that streams tokens. Measure the latency from calling .stream() to receiving the first token.

Metric	What it captures
Lines of code	Code to wire up a streaming RAG pipeline
TTFT (median)	Pure framework overhead with a zero-latency mock LLM
Streaming API surface	Sync vs async, generator vs callback, on-RAG vs on-LLM

Why a mock LLM: real LLM APIs add 100–2000ms of network and provider latency. That swamps any framework difference. Strip it out and the framework overhead finally becomes visible - the part you can actually optimise.

Frameworks: SynapseKit 1.4, LangChain 1.2, LlamaIndex Core 0.14. Kaggle CPU. 50 reps per framework. Disclosure: I'm the author of SynapseKit. All code is on Kaggle - fork and run yourself.

The Code

SynapseKit - async generator on the RAG object itself:

from synapsekit import RAG

rag = RAG(model="gpt-4o-mini", api_key=KEY, provider="openai")
await rag.add_documents(DOCS)
async for token in rag.stream(QUERY):
    print(token, end="", flush=True)

rag.stream(query) is a single method call that streams the full RAG pipeline - retrieve, construct prompt, call LLM, yield tokens. No chain composition, no graph construction. Async-only.

LangChain - LCEL chain with .stream():

from langchain_community.retrievers import BM25Retriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

retriever = BM25Retriever.from_texts(DOCS, k=3)
prompt    = ChatPromptTemplate.from_template("Context: {ctx}\n\nQ: {q}")
llm       = ChatOpenAI(model="gpt-4o-mini", streaming=True)
chain     = {"ctx": retriever, "q": RunnablePassthrough()} | prompt | llm
for chunk in chain.stream(QUERY):
    print(chunk.content, end="", flush=True)

LCEL composition makes every step explicit and swappable. More imports, more ceremony, but you can yank out the retriever or add a reranker without touching the stream call. Both sync (.stream) and async (.astream) are native.

LlamaIndex - query_engine(streaming=True):

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini")
index  = VectorStoreIndex.from_documents([Document(text=d) for d in DOCS])
engine = index.as_query_engine(streaming=True)
response = engine.query(QUERY)
for chunk in response.response_gen:
    print(chunk, end="", flush=True)

One flag flip (streaming=True) turns the query engine into a streaming generator. Clean surface. No native async stream on the query engine - you'd wrap it yourself or reach for the lower-level async APIs.

The Numbers

With a mock LLM that yields the same token list at zero network latency, we ran 50 TTFT measurements per framework:

Framework    Median TTFT   p99 TTFT   API shape
────────────────────────────────────────────────────────────
SynapseKit      0.08 ms      0.15 ms   async generator
LangChain       0.12 ms      0.21 ms   sync generator
LlamaIndex      0.14 ms      0.26 ms   sync generator

All three land in the sub-millisecond zone. The framework overhead itself is effectively free. At this resolution the numbers are noise. If you're choosing a framework to optimise TTFT, you're optimising the wrong thing - put your effort into prompt caching, smaller context windows, provider selection, and serving infrastructure. That's where the real milliseconds live.

For reference, here's what actually dominates TTFT in production:

Component                 Typical latency
────────────────────────────────────────────
Framework overhead        < 1 ms
Embedding lookup          5–20 ms
BM25 retrieval            10–50 ms
Network to LLM provider   80–200 ms
LLM first token           150–600 ms
────────────────────────────────────────────
Total TTFT                250 ms – 1 s

The framework is a rounding error. A 0.08ms vs 0.14ms difference cannot be measured in production - it vanishes into jitter.

The API Surface Split

This is where the frameworks actually diverge. When you're writing real code, the shape of the streaming API matters more than its latency.

Feature                 SynapseKit    LangChain    LlamaIndex
──────────────────────────────────────────────────────────────
Primary API             async gen     sync + async  sync gen
Sync support            No            Yes           Yes
Native async on RAG     Yes           Yes           No
Callback handlers       No            Yes           Yes (mgr)
Stream on RAG object    Yes           Yes (LCEL)    Yes (flag)

SynapseKit is async-only. There is no .stream() on a sync path. If your codebase runs in Flask, Django sync views, or a Jupyter notebook without an event loop, every call site needs asyncio.run() or you need to restructure around async. That's a migration, not a drop-in.

LangChain is the most flexible. chain.stream() for sync, chain.astream() for async, plus a callback handler ecosystem (StreamingStdOutCallbackHandler, AsyncIteratorCallbackHandler) for every framework integration you might need. If you're building a Streamlit app, a CLI tool, and an async FastAPI endpoint from the same chain, this is the path.

LlamaIndex sits in the middle. Native sync generators (response.response_gen) are easy to consume. The async story is weaker - the query engine doesn't expose a clean async stream by default. You reach for lower-level LLM APIs or wrap the sync generator in a thread.

What This Means for Engineers

Stop optimising framework TTFT overhead. At sub-millisecond, it's below the noise floor of every real LLM deployment. The TTFT you see in your dashboard is 99%+ network and provider latency. Focus there.
Match the streaming API to your runtime. If your app is async (FastAPI, async workers, LangGraph): SynapseKit and LangChain .astream() are both clean. If your app is sync (Flask, Django sync views, Jupyter, a CLI): LangChain .stream() or LlamaIndex's response_gen let you avoid restructuring.
Use callbacks for UI binding, generators for pipelines. LangChain's callback handler pattern is the cleanest path for tying stream output into progress bars, partial rendering, and multi-consumer fan-out. For a one-consumer pipeline, a generator is simpler.
Stream from the RAG object, not the LLM. All three frameworks can stream from the top-level RAG call (SynapseKit rag.stream, LangChain LCEL chain, LlamaIndex query_engine(streaming=True)). Don't roll your own retrieve + LLM stream loop - you'll reimplement the prompt construction wrong.
Measure TTFT end-to-end, not in isolation. The real number includes retrieval time, prompt build, network round-trip, and the provider's own time-to-first-token. That's the number your users experience. Framework overhead disappears into it.

The Corollary Most People Miss

TTFT is not the only perception metric. Inter-token latency - the jitter between the 2nd, 10th, and 100th tokens - matters almost as much. A stream that arrives in steady 15ms bursts feels smooth. A stream that arrives in a burst, stalls for 200ms, then bursts again feels broken. And inter-token latency is where framework buffering, callback dispatch, and LCEL graph traversal actually can start to matter at production volumes.

None of these frameworks add visible buffering on a mock LLM. But layer in a callback chain, a streaming response wrapper, and a server-sent-events encoder on top, and you can build a pipeline that adds 10–20ms of buffering per token. That's the part you have to profile yourself - and the part no benchmark in this series will catch for you.

Perception metric           What the user feels
──────────────────────────────────────────────────
TTFT                        Did anything happen?
Inter-token latency         Is it flowing or stalling?
Total time                  Was it fast enough to use?

You optimise all three in different ways. Framework choice affects the first two slightly and the third not at all.

Three Things Worth Doing This Week

Instrument TTFT in your production RAG. Log the three numbers that matter: retrieval latency, prompt-build latency, and time-to-first-token from the LLM. If any one is above 300ms, that's where the work is - not in the framework.
Switch from .stream() to .astream() if you're on an async stack. Sync .stream() inside an async handler blocks the event loop. Most teams accidentally run sync streams in async contexts because it was easier to paste the tutorial code.
Read the Kaggle notebook. Full reproducible code, mock LLM implementations for each framework, 50-run TTFT distributions: LLM Showdown #12 - Streaming RAG TTFT

Streaming is the default UX for every modern LLM product. The frameworks all do it. None of them are meaningfully slower than the others. The real question is whether your stream fits your runtime - async or sync, generator or callback, on the RAG or on the LLM. Pick the shape that matches your code, not the one with the lowest microsecond count on a benchmark that doesn't include the network.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #20 - Hybrid Search: RRF Fusion Across Three Frameworks

Mon, 06 Apr 2026 00:00:00 GMT

Pure vector search misses exact matches. Pure BM25 misses semantics. Hybrid search almost always wins - the question is how much control you get over the fusion.

Every production RAG system eventually hits the same wall. Vector search retrieves semantically similar documents, but it fails on exact-match queries: model names, version numbers, function names, error codes. The query "GPT-4o" and the document "GPT-4o" don't reliably produce close vectors. BM25 doesn't have this problem. It matches terms, weighs them by rarity, and returns the right document.

Reciprocal Rank Fusion - RRF - is the standard way to combine both. It takes two ranked lists, assigns each document a score of 1 / (k + rank), sums the scores, and re-ranks. The parameter k controls how much the top ranks dominate. It requires no score normalisation, works across retrieval algorithms with incompatible score scales, and runs in microseconds.

We built identical hybrid pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same corpus, same query, same task: BM25 + vector, top-3 via RRF. The LoC gap is smaller than the BM25-only benchmark. The configurability gap is not.

Interactive Chart

LoC Across the Series →

Lines of code per framework across all 11 benchmarks - from hello world to hybrid search. See which framework has compounded its lead.

Code Explorer

Hybrid Pipeline Code by Framework →

Full BM25 + vector + RRF pipeline code side by side - imports, setup, and retrieval call annotated for each framework.

Data

LoC, RRF Configurability, and Result Overlap →

Lines of code breakdown, configurable RRF parameters per framework, and result overlap across frameworks on an identical hybrid query.

What We Measured

Task: Index 5 documents with both BM25 and vector search, run an identical query through each hybrid retriever, return top-3 results via RRF fusion.

Metric	What it captures
Lines of code	Code to build and query a hybrid BM25 + vector pipeline
RRF configurability	Parameters exposed: weights, k, retriever count
Result agreement	Overlap in top-3 results across frameworks

Frameworks: SynapseKit 1.4, LangChain 1.2, LlamaIndex Core 0.14. Kaggle CPU. Disclosure: I'm the author of SynapseKit. All code is on Kaggle - fork and run yourself.

The Code

SynapseKit - 8 lines (2 imports + 6 functional):

from synapsekit.retrieval import HybridSearchRetriever, Retriever, InMemoryVectorStore
from synapsekit.embeddings import SynapsekitEmbeddings

emb    = SynapsekitEmbeddings(model="all-MiniLM-L6-v2", use_gpu=False)
r      = Retriever(InMemoryVectorStore(emb))
hybrid = HybridSearchRetriever(r, bm25_weight=0.5, vector_weight=0.5, rrf_k=60)
hybrid.add_documents(DOCS)
await r.add(DOCS)
results = await hybrid.retrieve(QUERY, top_k=3)

A single HybridSearchRetriever class wraps both modes. bm25_weight, vector_weight, and rrf_k are explicit constructor parameters. Limitation: fixed at two retrievers.

LangChain - 11 lines (4 imports + 7 functional):

from langchain_classic.retrievers.ensemble import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_community.embeddings import HuggingFaceEmbeddings

emb    = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vs     = InMemoryVectorStore(emb)
vs.add_texts(DOCS)
bm25   = BM25Retriever.from_texts(DOCS, k=3)
vec_r  = vs.as_retriever(search_kwargs={"k": 3})
hybrid = EnsembleRetriever(retrievers=[bm25, vec_r], weights=[0.5, 0.5])
results = [doc.page_content for doc in hybrid.invoke(QUERY)]

EnsembleRetriever is compositional: pass a list of any retrievers, a matching weights list. Add a third retriever by appending to both lists.

LlamaIndex - 12 lines (4 imports + 8 functional):

from llama_index.core.retrievers import QueryFusionRetriever, VectorIndexRetriever
from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.retrievers.bm25 import BM25Retriever

Settings.llm = None
nodes  = SentenceSplitter(chunk_size=512).get_nodes_from_documents(
             [Document(text=d) for d in DOCS])
bm25_r = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=3)
vec_r  = VectorIndexRetriever(index=VectorStoreIndex(nodes), similarity_top_k=3)
fused  = QueryFusionRetriever([bm25_r, vec_r], similarity_top_k=3,
                              num_queries=1, use_async=False)
results = [n.text for n in fused.retrieve(QUERY)]

Twelve lines. Node parsing is unavoidable LlamaIndex boilerplate. The RRF k parameter is fixed internally and not exposed.

The Numbers

Framework    Imports   Functional   Total
──────────────────────────────────────────
SynapseKit       2           6         8
LangChain        4           7        11
LlamaIndex       4           8        12

The gap is smaller here than in BM25-only (where LangChain won at 3 lines). Hybrid search adds enough setup that the difference compresses. Four lines separate the most concise from the most verbose.

RRF Configurability

Parameter             SynapseKit   LangChain    LlamaIndex
────────────────────────────────────────────────────────────
BM25 weight           Yes          Yes          No
Vector weight         Yes          Yes          No
RRF k constant        Yes          Yes          No
Retriever count       2 only       Unlimited    Unlimited
Async support         Yes          Yes          Yes

Configurability       4/5          5/5          3/5

LlamaIndex's QueryFusionRetriever applies equal weighting to all retrievers. There is no weights parameter. If BM25 produces more false positives than vector, you cannot correct for it.

SynapseKit exposes weights and the k constant explicitly. The tradeoff: fixed at two retrievers. You cannot add a sparse retriever or reranker as a third leg.

LangChain is the most flexible. EnsembleRetriever takes weights=[0.3, 0.5, 0.2] for three retrievers. You can mix BM25 + dense + sparse + reranker in one call and tune the contribution of each signal.

Result Overlap

Query: "How does hybrid search combine BM25 and vector retrieval?"

Rank   SynapseKit                     LangChain                      LlamaIndex
────────────────────────────────────────────────────────────────────────────────
#1     Vector search uses dense...    TF-IDF and BM25 both use...    Hybrid search combines...
#2     Hybrid search combines...      Vector search uses dense...     Vector search uses dense...
#3     BM25 is a probabilistic...     Hybrid search combines...       TF-IDF and BM25 both use...

Jaccard: LangChain vs SynapseKit 0.75  |  LangChain vs LlamaIndex 0.75  |  LlamaIndex vs SynapseKit 0.50

What This Means for Engineers

LangChain's EnsembleRetriever is the right default for production hybrid search. Unlimited retriever composition with per-retriever weights is what you want when you're tuning a real pipeline. The extra 3 lines over SynapseKit are worth it.
LlamaIndex's no-weight limitation is a real constraint. Equal-weighting RRF works as a starting point. It fails when one retrieval mode dominates false positives and you need to downweight it.
SynapseKit's single-class API is convenient for the 2-retriever case. If you're doing standard BM25 + dense and never need a third leg, the explicit bm25_weight, vector_weight, rrf_k API is clean.
RRF k=60 is not magic. Lower k amplifies the importance of rank-1 results. Higher k flattens the distribution. Experiment with k in the range 30–100 before assuming 60 is optimal.
Hybrid search is not free. You're running two retrieval steps plus a merge. Use asyncio.gather() to run BM25 and vector concurrently - LangChain supports ainvoke() on all its retrievers.

The Corollary Most People Miss

Hybrid search architecture choice:
──────────────────────────────────────────────────────────
SynapseKit    Single class, explicit weights, fixed at 2
              + clearest API for standard hybrid
              - cannot extend to 3+ retrieval signals

LangChain     Composable list, per-retriever weights
              + most flexible for production tuning
              - 3 more lines, more imports to manage

LlamaIndex    Composable list, equal weights only
              + supports unlimited retrievers
              - no weight control - blind spot for prod

Three Things Worth Doing This Week

Add asyncio.gather() to your hybrid retriever. If you're running BM25 and vector sequentially, you're paying both latencies. Run them concurrently and your hybrid latency drops to the slower of the two, not the sum.
A/B test RRF k. Change k from 60 to 30 on a sample of your production queries. Lower k amplifies top-rank signals. Measure precision@3 on both.
Read the Kaggle notebook. Full reproducible code, live RRF computation, and result overlap tables: LLM Showdown #11 - Hybrid Search

Hybrid search is the standard, not the exception, for production RAG. The frameworks all implement RRF. What they disagree on is how much of the fusion parameters they expose to you. One treats the weights as fixed. One gives you two weights and a k constant. One gives you a weight list as long as your retriever list. That last one is the one you want when you're optimising recall across different query types.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #19 - The BM25 Test: One Framework Silently Fails

Sun, 05 Apr 2026 00:00:00 GMT

BM25 ships in one framework, requires a hidden install in another, and silently fails at runtime in the third. Same class name, three very different experiences.

Pure vector search has a blind spot. Exact-match queries - model names, function names, version numbers, proper nouns - embed poorly. The query "GPT-4o" and the document "GPT-4o" don't always produce similar vectors. BM25 does not have this problem. It matches terms, weighs them by rarity, and returns the right document.

Production RAG systems almost always use hybrid search: BM25 for precision on exact matches, vector search for semantic recall, reciprocal rank fusion to merge them. The question of whether BM25 ships out of the box is not academic. It determines whether your pipeline works on day one or fails at 2am in a customer demo.

We tested all three frameworks on an identical task: index five documents, run a BM25 query, get top-3 results. One framework's BM25Retriever class is in its package but silently throws a ModuleNotFoundError at runtime unless you've separately installed a library it doesn't list as a dependency.

Interactive Chart

Install Path by Framework →

What you have to install before BM25 works - extra packages, silent dependencies, and integration packages across all three frameworks.

Code Explorer

BM25 Pipeline Code by Framework →

Full BM25 index + query code side by side - imports, setup, and retrieval call annotated for each framework.

Data

LoC, Extra Installs, and Result Overlap →

Lines of code breakdown, extra packages required, and ranked result comparison for identical query across all three frameworks.

What We Measured

Task: Index 5 documents, run a BM25 query, return top-3 results. Identical corpus and query across all three frameworks.

Metric	What it captures
Extra packages needed	Pip installs beyond the base framework install
Lines of code	Import + functional lines to build and query a BM25 index
Result quality	Top-3 docs returned for an identical keyword query

Frameworks: SynapseKit 1.4, LangChain 1.2, LlamaIndex Core 0.14. Kaggle CPU.

The Install Story

Before a single line of BM25 code runs, you need the right packages.

Framework    Base install              Extra needed              Behavior
──────────────────────────────────────────────────────────────────────────
SynapseKit   pip install synapsekit    none                      Works immediately
LangChain    pip install langchain     pip install rank-bm25     Silent runtime fail if missing
             langchain-community
LlamaIndex   pip install llama-index-core  pip install           ImportError at import time
                                      llama-index-retrievers-bm25

LangChain's behavior is the most dangerous. BM25Retriever lives in langchain-community. The import succeeds. The class is there. But when you call BM25Retriever.from_texts(), it raises ModuleNotFoundError: No module named 'rank_bm25' - a runtime error, not an import error. Your code passes linting, passes static analysis, and fails in production.

LlamaIndex fails at import time - from llama_index.retrievers.bm25 import BM25Retriever - which is the honest failure mode. You find out immediately.

SynapseKit declares rank-bm25 as a core dependency in its pip metadata. It installs with the base package. Nothing extra to do.

The Code

LangChain - 3 lines (1 import + 2 functional):

from langchain_community.retrievers import BM25Retriever

r       = BM25Retriever.from_texts(DOCS, k=3)
results = [doc.page_content for doc in r.invoke(QUERY)]

The cleanest BM25 API across all three. from_texts() takes a list of strings, invoke() returns Document objects. Three lines total.

SynapseKit - 8 lines (2 imports + 6 functional):

from synapsekit.retrieval import HybridSearchRetriever, Retriever, InMemoryVectorStore
from synapsekit.embeddings import SynapsekitEmbeddings

emb    = SynapsekitEmbeddings(model="all-MiniLM-L6-v2", use_gpu=False)
r      = Retriever(InMemoryVectorStore(emb))
hybrid = HybridSearchRetriever(r, bm25_weight=1.0, vector_weight=0.0)
hybrid.add_documents(DOCS)
await r.add(DOCS)
results = await hybrid.retrieve(QUERY, top_k=3)

SynapseKit's BM25 is hybrid-first. There is no standalone keyword retriever - BM25 lives inside HybridSearchRetriever with bm25_weight=1.0. This means you initialise an embedding model and a vector store even when you only want keyword search. The embedding model never runs (weight is 0), but the object must exist. Eight lines for something that should be three.

LlamaIndex - 9 lines (3 imports + 6 functional):

from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core import Document, Settings
from llama_index.core.node_parser import SentenceSplitter

Settings.llm = None
Settings.embed_model = None
nodes   = SentenceSplitter(chunk_size=512).get_nodes_from_documents(
              [Document(text=d) for d in DOCS])
r       = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=3)
results = [n.text for n in r.retrieve(QUERY)]

Three imports, two explicit None assignments to suppress LLM/embedding warnings, and a node parsing step before the retriever can be initialised. Nine lines, most of it overhead suppression.

The Results

Query: "How does BM25 compare to TF-IDF?"

Rank   SynapseKit                    LangChain                     LlamaIndex
───────────────────────────────────────────────────────────────────────────────
#1     TF-IDF weights terms...       RAG feeds retrieved passages   TF-IDF weights terms...
#2     RAG feeds retrieved passages  BM25 is a probabilistic...     BM25 is a probabilistic...
#3     Hybrid search combines...     Hybrid search combines...      Hybrid search combines...

Result overlap (Jaccard): 0.50 across all pairs (2/3 shared each)

All three retrieve the same 3 documents from a 5-document corpus - they differ only on ranking order. That is expected: all three use BM25Okapi from the rank_bm25 library under the hood. Different tokenisation details shift the ranking slightly, but the relevant documents are the same.

The result quality question is a non-issue for BM25. What matters is whether it runs at all.

What This Means for Engineers

LangChain's silent runtime failure is a production hazard. A ModuleNotFoundError inside from_texts() - not at import time - means it passes every pre-deploy check that doesn't exercise the retrieval path. Add rank-bm25 to your requirements file explicitly, always.
SynapseKit's hybrid-first design costs you 5 extra lines for pure keyword search. If you only want BM25, you're initialising an embedding model that never runs. The zero-install story is real; the ergonomics for standalone BM25 are not great.
LlamaIndex's explicit install is the honest design. A separate package for BM25 means the base install stays small. The tradeoff is one more pip install you have to know about - but at least it fails at import time, not at 2am in production.
In practice, you want hybrid search, not pure BM25. Pure BM25 as a benchmark is useful; as a production retriever it leaves semantic recall on the table. The real question is which framework makes hybrid search (BM25 + vector + RRF) easiest to configure - that's the next benchmark.
All three use the same BM25 algorithm. BM25Okapi from rank_bm25 is the de facto standard implementation in Python. The retrieval quality differences you see in production are almost never about the BM25 implementation - they're about tokenisation, stemming, and stopword handling that sits on top of it.

The Corollary Most People Miss

The install story matters more than the LoC story for BM25.

LangChain wins on lines of code (3 vs 8 vs 9). But a 3-line retriever that silently fails in production is worth less than an 8-line retriever that works. The ergonomics cost of SynapseKit's hybrid-first design is real - you shouldn't have to initialise embeddings to do keyword search - but at least it doesn't fail on you.

LlamaIndex's approach is the cleanest philosophically: BM25 is a separate concern, it lives in a separate package, the failure mode is immediate and visible. The ergonomics in code are the worst, but the operational behaviour is the most honest.

Design philosophy comparison:
────────────────────────────────────────────────────────────
SynapseKit    BM25 bundled, hybrid-first API
              ✓ zero extra installs
              ✗ cannot do standalone BM25 without embedding overhead

LangChain     BM25 class included, dependency external
              ✓ cleanest API (3 lines)
              ✗ silent runtime failure if rank-bm25 not installed

LlamaIndex    BM25 in separate package, explicit install
              ✓ honest failure mode (import error, not runtime error)
              ✗ most verbose (9 lines + Settings suppression)

Three Things Worth Doing This Week

Audit your requirements file. If you use LangChain's BM25Retriever, confirm rank-bm25 is in your requirements.txt or pyproject.toml. The import succeeds without it; the runtime doesn't.
Run a hybrid retrieval experiment on your existing RAG pipeline. Add BM25 alongside your vector search, fuse with reciprocal rank fusion, measure precision@3 on 20 representative queries. Most teams see 10–25% improvement on exact-match queries with no change to the embedding model.
Read the Kaggle notebook. Full reproducible code, the live ranked results, and the result overlap analysis: LLM Showdown #10 - Built-in BM25

BM25 is 35 years old and still in production at Google, Elasticsearch, and every search system that handles exact-match queries. The question was never whether to use it. The question was whether your framework ships it without surprises. One does. One requires a hidden install. One fails silently at runtime. Now you know which is which.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Letters #18 - The Chunking Test: Two Frameworks Are Identical, One Is Not

Sat, 04 Apr 2026 00:00:00 GMT

How you split documents determines what your retriever finds. Most tutorials spend two lines on this. They shouldn't.

Every RAG tutorial reaches the chunking step and sprints past it. "Split into chunks of 500 characters with 50 overlap - done." The code runs. The demo works. The demo is not production.

The split you choose affects embedding quality, retrieval precision, and whether your LLM gets enough context to say something useful. Chunking is not configuration. It's architecture.

We ran all three frameworks against the same document with identical parameters. The line counts came out nearly equal. The chunk outputs did not. One framework's default splitter interprets chunk_size=300 as tokens, not characters - producing 2 chunks averaging 986 characters each instead of 12 chunks averaging 163 characters. Same parameter name, different semantics.

Interactive Chart

Splitter Inventory by Framework →

All built-in splitters across SynapseKit, LangChain, and LlamaIndex - what ships out of the box and what each one is for.

Code Explorer

Full Chunking Code by Framework →

Select a framework to see the complete chunking pipeline with line-by-line annotation: imports, splitter config, and output.

Data

Chunk Count, Size Distribution, and LoC →

Live chunk output from all three frameworks on identical input - count, average size, max size, and size histogram side by side.

What We Measured

Task: Split a 1,972-character document about RAG systems into chunks. Parameters: chunk_size=300, chunk_overlap=30, sentence-aware splitter for each framework. Metrics:

Metric	What it captures
Built-in splitter count	How many strategies ship out of the box
Lines of code	How much code to configure sentence-aware chunking
Chunk output	Count, avg size, size distribution from identical input

Frameworks: SynapseKit 1.4, LangChain 1.2, LlamaIndex Core 0.14. Kaggle CPU environment.

The Splitter Inventory

Before measuring LoC, count what each framework ships.

SynapseKit - 2 splitters:

RecursiveTextChunker - recursive character splitting (default)
TokenChunker - token-count-based splitting

LangChain - 8 splitters:

RecursiveCharacterTextSplitter - recursive character splitting (recommended default)
CharacterTextSplitter - single-separator character splitting
TokenTextSplitter - token-count splitting
SentenceTransformersTokenTextSplitter - sentence-transformer token splitting
MarkdownTextSplitter - markdown-header-aware splitting
PythonCodeTextSplitter - Python AST-aware splitting
HTMLSectionSplitter - HTML section-aware splitting
SemanticChunker - embedding-based semantic splitting (langchain-experimental)

LlamaIndex - 9 splitters:

SentenceSplitter - sentence-aware splitting (default)
TokenTextSplitter - token-count splitting
CodeSplitter - language-aware code splitting
MarkdownNodeParser - markdown-header-aware splitting
JSONNodeParser - JSON-structure-aware splitting
SentenceWindowNodeParser - sentence with surrounding context window
HierarchicalNodeParser - multi-level hierarchical chunks
SemanticSplitterNodeParser - embedding-based semantic splitting
TopicNodeParser - topic-model-based splitting

Two vs eight vs nine. The headline number is misleading though - what matters is whether the advanced splitters solve problems you'll actually encounter. We'll come back to this.

The Code, Side by Side

SynapseKit - 5 lines (1 import, 4 functional):

from synapsekit import Retriever, InMemoryVectorStore, SynapsekitEmbeddings

emb = SynapsekitEmbeddings(model="all-MiniLM-L6-v2", use_gpu=False)
r   = Retriever(InMemoryVectorStore(emb),
                chunk_size=300, chunk_overlap=30)
await r.add([DOCUMENT])

No standalone splitter API. Chunk parameters live on the Retriever. If you want to inspect chunks before indexing, you can't - the split is opaque.

LangChain - 4 lines (1 import, 3 functional):

from langchain_text_splitters import RecursiveCharacterTextSplitter

chunks = RecursiveCharacterTextSplitter(
    chunk_size=300, chunk_overlap=30
).split_text(DOCUMENT)

The cleanest interface of the three. Splitter is a standalone object, inspectable, composable. You can run it before indexing, log the output, swap it for any other splitter without touching the rest of your pipeline.

LlamaIndex - 6 lines (2 imports, 4 functional):

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Document

nodes  = SentenceSplitter(
    chunk_size=300, chunk_overlap=30
).get_nodes_from_documents([Document(text=DOCUMENT)])
chunks = [n.text for n in nodes]

Two imports instead of one. Text must be wrapped in a Document object. Output is Node objects, not strings - you extract .text yourself. More verbose, but the Node carries metadata (source, position, relationships) that becomes useful downstream.

Line count totals: SynapseKit 5 / LangChain 4 / LlamaIndex 6. Effectively tied.

The Chunk Output Surprise

Same document. Same chunk_size=300. Same chunk_overlap=30. Here is what each framework produced:

Framework    Chunks   Avg size (chars)   Max size (chars)
──────────────────────────────────────────────────────────
SynapseKit      12         163               254
LangChain       12         163               254
LlamaIndex       2         986              1481

SynapseKit and LangChain are identical - SynapseKit uses the same recursive character algorithm under the hood. LlamaIndex produced 2 chunks averaging 986 characters each.

The reason: LlamaIndex's SentenceSplitter interprets chunk_size as tokens, not characters. chunk_size=300 means 300 tokens, which is roughly 1,200 characters. On a 1,972-character document, that yields 2 chunks - not the 12 you'd expect if you assumed character-based sizing.

This is not a bug. It is the documented behavior. But it is also the most common source of confusion when engineers switch frameworks mid-project. You copy the parameters from a LangChain tutorial, paste them into LlamaIndex, and your chunk distribution changes by an order of magnitude without a single error message.

                    chunk_size=300 means...
                    ┌──────────────────────────────────┐
SynapseKit          │ 300 characters                   │
LangChain           │ 300 characters                   │
LlamaIndex          │ 300 tokens (~1,200 characters)   │
                    └──────────────────────────────────┘

Same document, chunk_size=300, overlap=30:
SynapseKit/LangChain:  [163][163][163][163][163][163][163][163][163][163][163][163]
LlamaIndex:            [──────────────────1481──────────────────][───490───]

What LlamaIndex's Advanced Splitters Actually Do

The splitter count difference is real, but two of LlamaIndex's nine entries represent strategies with no equivalent in the other two frameworks.

SentenceWindowNodeParser: Stores each sentence as an individual node. Attaches the surrounding sentences as a metadata window (configurable, default: 1 sentence each side). At retrieval time, you search against precise single-sentence embeddings - high precision. At generation time, you expand the retrieved node to its window - adequate context. The result is retrieval that finds the exact sentence you need without diluting the embedding with surrounding text. Neither LangChain nor SynapseKit has a built-in equivalent.

HierarchicalNodeParser: Creates three levels of nodes from the same document: large (2048 tokens), medium (512), small (128). Small nodes are indexed for retrieval. When retrieval returns too many small nodes from the same parent (configurable threshold), they are "automerged" into the parent chunk before being sent to the LLM. You get the precision of small chunks with the coherence of large ones. This is a production technique; the LlamaIndex documentation attributes meaningful accuracy gains to it on multi-hop questions.

Switching between LlamaIndex's splitters costs one line - the import:

# One import change. Everything else stays identical.
from llama_index.core.node_parser import SentenceSplitter           # default
from llama_index.core.node_parser import SentenceWindowNodeParser   # precision mode
from llama_index.core.node_parser import HierarchicalNodeParser     # production mode

What This Means for Engineers

Never copy chunk parameters across frameworks without checking the unit. chunk_size=500 in LangChain is characters. In LlamaIndex it is tokens. Verify once, avoid a silent quality regression.
SentenceWindowNodeParser is worth understanding even if you don't use LlamaIndex. The pattern - retrieve at sentence granularity, generate with window context - is implementable in any framework manually. LlamaIndex just makes it one import.
HierarchicalNodeParser solves a real production problem. When retrieval returns five fragments of the same paragraph as separate nodes, your LLM is reading five partial views of the same text. Automerging collapses them into the parent. This is not theoretical - it matters on documents with repeated cross-references.
SynapseKit's 2 splitters are a constraint when you need format-aware splitting. If your corpus includes Markdown docs, Python files, and HTML pages, you need a splitter that understands structure. LangChain and LlamaIndex have these. SynapseKit does not.
LangChain's standalone splitter API is the most flexible for debugging. Because chunking is decoupled from the vector store, you can log chunk distributions before committing to an indexing run. In production, that observability pays back quickly.

The Corollary Most People Miss

The line counts say LangChain is cheapest (4 lines), LlamaIndex most expensive (6), SynapseKit in the middle (5). That is the wrong frame.

The actual cost comparison is: how many lines does it take to switch splitters when your default stops working?

For LlamaIndex: one line (the import). All nine splitters share the same get_nodes_from_documents() interface.

For LangChain: also roughly one line - all splitters expose .split_text() or .split_documents().

For SynapseKit: you cannot switch splitters. The chunking algorithm is not exposed. You take what the Retriever does internally, or you switch frameworks.

Initial LoC favors LangChain. Iteration cost favors LlamaIndex. Lock-in risk penalizes SynapseKit.

Three Things Worth Doing This Week

Print your chunk distribution before indexing. [len(c) for c in chunks] - histogram it. If 20% of your chunks are under 50 characters, your splitter is cutting at punctuation. If 20% exceed your embedding model's token limit, they're being silently truncated.
Test LlamaIndex's SentenceWindowNodeParser on one of your existing retrievers. The interface is one import and one additional retrieval step. If your current precision is poor, sentence-level retrieval with window expansion frequently outperforms standard chunking without any change to the embedding model.
Read the Kaggle notebook. Full reproducible code for all three frameworks, live chunk outputs, and the size distribution charts: LLM Showdown #9 - Chunking Strategies

Chunking determines what your retriever can find. The tutorials that sprint past it in two lines are the same tutorials whose RAG demos fall apart on real documents. The split is not configuration. It is the first decision that determines whether your retrieval is precise or lucky.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

AI Engineering Letters - EngineersOfAI

AI Letters #35 - Why We Built SynapseKit: The Framework We Deserve

The Problem We Lived​

What SynapseKit Actually Is​

What This Means for You​

1. You Own Your Code​

2. You Keep 90% of Your Cold Start​

3. You See Your Costs​

Why We're Staying Open Source (Forever)​

The Temptation​

The Bet​

What We Monetize​

What the Community Taught Us​

Simplicity Beats Ecosystem​

Cost Visibility Beats Ease​

Async-Native Beats Backwards Compatibility​

Testing Beats Documentation​

Transparency Beats Polish​

How We're Benchmarking Everything (No Illusions)​

Cold Start Benchmarks​

Token Cost Benchmarks​

Latency Benchmarks​

Feature Coverage Benchmarks​

Why We Benchmark​

Why We'll Be the Best Tool​

Dependency Minimalism = Embeddability​

Async-Native = Production-Ready​

Transparency = Trust​

Community = Compounding Returns​

Open Source = Moat​

The 8 Features We're Shipping (v1.8.0 - v2.0.0)​

v1.8.0: Production Grade (June 15)​

v1.9.0: Advanced Retrieval (July 20)​

v2.0.0: Distributed (September 1)​

What Success Looks Like​

Join Us​

The Final Truth​

Resources​

AI Letters #34 - The 30-Day LLM Framework Verdict: 25 Benchmarks, One Clear Answer

What the Data Actually Shows​

The Evidence​

Week 4: Production Readiness​

The Simplest RAG Test (#29)​

The Full 30-Day Pattern​

What This Means for Engineers​

The Part Most People Will Get Wrong​

Three Things Worth Doing This Week​

AI Letters #33 - We Built Traceprop: Finally, an ML Audit Trail That Answers the Regulator's Question

Why We Built This​

What Traceprop Is​

The Attribution Layer: Connecting Predictions to Source Rows​

The Unlearning Layer: GDPR Erasure That Actually Works​

The Multi-Source Case​

The Enforcement Dates​

Why We're Open-Sourcing It​

What to Do Right Now​

AI Letters #32 - Your RAG Has No Immune System

What LangChain 1.x Quietly Removed​

The Three Frameworks, The Same Task​

The Feature Gap Is Not Close​

Lines of Code Tell the Same Story​

Why Regression Tracking Is the Feature Most Teams Need​

What This Means for Engineers​

The Thing Most Teams Get Wrong​

Three Things Worth Doing This Week​

AI Letters #31 - Graph Workflows: When Chains Break and DAGs Take Over

What We Measured​

The Numbers​

The Feature Matrix​

The API Comparison​

The One Meaningful Difference​

When You Need a Graph​

What This Means for Engineers​

The Thing Most People Miss​

Three Things Worth Doing This Week​

AI Letters #30 - Async Throughput: The Framework Tax on Every Concurrent Request

What We Measured​

The Numbers​

The Scaling Factor​

Where the Overhead Comes From​

The Problem We Lived

What SynapseKit Actually Is

What This Means for You

1. You Own Your Code

2. You Keep 90% of Your Cold Start

3. You See Your Costs

Why We're Staying Open Source (Forever)

The Temptation

The Bet

What We Monetize

What the Community Taught Us

Simplicity Beats Ecosystem

Cost Visibility Beats Ease

Async-Native Beats Backwards Compatibility

Testing Beats Documentation

Transparency Beats Polish

How We're Benchmarking Everything (No Illusions)

Cold Start Benchmarks

Token Cost Benchmarks

Latency Benchmarks

Feature Coverage Benchmarks

Why We Benchmark

Why We'll Be the Best Tool

Dependency Minimalism = Embeddability

Async-Native = Production-Ready

Transparency = Trust

Community = Compounding Returns

Open Source = Moat

The 8 Features We're Shipping (v1.8.0 - v2.0.0)

v1.8.0: Production Grade (June 15)

v1.9.0: Advanced Retrieval (July 20)

v2.0.0: Distributed (September 1)

What Success Looks Like

Join Us

The Final Truth

Resources

What the Data Actually Shows

The Evidence

Week 4: Production Readiness

The Simplest RAG Test (#29)

The Full 30-Day Pattern

What This Means for Engineers

The Part Most People Will Get Wrong

Three Things Worth Doing This Week

Why We Built This

What Traceprop Is

The Attribution Layer: Connecting Predictions to Source Rows

The Unlearning Layer: GDPR Erasure That Actually Works

The Multi-Source Case

The Enforcement Dates

Why We're Open-Sourcing It

What to Do Right Now

What LangChain 1.x Quietly Removed

The Three Frameworks, The Same Task

The Feature Gap Is Not Close

Lines of Code Tell the Same Story

Why Regression Tracking Is the Feature Most Teams Need

What This Means for Engineers

The Thing Most Teams Get Wrong

Three Things Worth Doing This Week

What We Measured

The Numbers

The Feature Matrix

The API Comparison

The One Meaningful Difference

When You Need a Graph

What This Means for Engineers

The Thing Most People Miss

Three Things Worth Doing This Week

What We Measured

The Numbers

The Scaling Factor

Where the Overhead Comes From

The Real-World Caveat

What This Means for Engineers

The Thing Most People Miss

Three Things Worth Doing This Week

The Six Benchmarks

What SynapseKit Actually Wins On

The One LangChain Win That Matters