Position9 min readMay 22, 2026

Context stuffing vs. tool calling: why many AI projects fail at the data architecture

More context does not make LLMs more reliable, the research shows the opposite. Three documented failure modes (lost in the middle, context rot, tool overload), why data architecture decides the outcome, and when tool calling is the more reliable choice.

Oliver Kraft

CovaSyn

Context stuffing vs. tool calling: why many AI projects fail at the data architecture

Key takeaways

The common assumption "more context = better answer" is empirically wrong. Research shows: LLMs measurably get worse with longer input.
Three documented failure modes: lost in the middle, context rot, and tool overload under long context.
The most dangerous finding: a model can perform worse with irrelevant context than with no context at all.
Many AI projects fail not at the model but at the data architecture: they stuff context where they should retrieve precisely.
Context stuffing has legitimate uses, but for exact, structured tasks, tool calling is the more reliable architecture.

The expensive assumption

When an AI project fails in pharma, biotech or chemistry, the usual diagnosis is "the model is not good enough yet" or "we need a larger context window". Both are usually wrong. The more common reason is an architectural decision made right at the start, often unconsciously: stuff everything into the context instead of retrieving precisely what is needed.

With context windows of one million tokens, context stuffing sounds tempting: throw data sheets, tables, measurement series and documentation into the prompt and hope the model pulls out what matters. On paper, more information is more knowledge. In practice it is the opposite, and that is now well measured.

Failure mode 1: lost in the middle

The foundational work here is Liu et al. (Stanford/UW, TACL 2024). It shows a U-shaped performance curve: models use information at the start and end of the context well but drop significantly in the middle. Performance can collapse substantially when the position of the relevant information shifts, a clear sign that current models do not use long contexts robustly.

The most striking single finding: when the relevant information sits in the middle, GPT-3.5-Turbo's accuracy on a multi-document question falls below what the same model achieves with no documents at all. More context made the answer worse than no context. And: explicit long-context models often do not outperform their standard counterparts here, the bigger window does not solve the problem.

Failure mode 2: context rot

A more recent study from Chroma (2025) systematically tested 18 frontier models and found a pattern the authors call "context rot": answer quality drops measurably as the input grows longer, across every single tested model, even when the context window is far from full.

Especially relevant in real applications is the distractor effect. Irrelevant context forces the model into an extra search step and degrades its reliability significantly. The kicker: semantically similar but wrong content (distractors) is the most damaging, and that is exactly what occurs constantly in technical domains, where documents resemble each other and differ only in details like a year, a value or a substituent. A well-structured, consistent corpus actually increases distractor density because every false hit looks more plausible.

That is the inconvenient truth behind "RAG is dead, we have million-token windows now": the window is large, attention inside it is not.

Failure mode 3: tool overload and long tool responses

The third mode hits exactly the agent setups that matter in practice. The LongFuncEval paper (2025) measured how tool calling behaves under long context, with a sobering result: a 7 to 85 percent performance drop as more tools become available; 7 to 91 percent collapse as tool responses get longer; and significant degradation across long multi-turn dialogues.

The lesson is not "tools are bad", it is: tool calling also fails when combined with context stuffing. Hundreds of near-identical tools and pages of raw responses just shift the problem. Good agent design means few, precise tools with compact, usable answers.

What context stuffing is still good for

To keep this fair: context stuffing and RAG are not generally wrong. For unstructured, language-oriented tasks they are often the right choice, asking a question against a single document, summarizing a report, searching a knowledge base for a passage, scanning a contract. Wherever the answer is text and nuance matters, giving the model the relevant text makes sense.

The problem appears when you use context stuffing for tasks with an exact, structured, deterministic answer. Computing a solubility, counting atoms, looking up an ICH limit, fitting stability kinetics, these are not reading tasks. Delegating them to the context means turning a precision problem into a probability problem.

The blind spot: lots of LLM research, little retrieval research

There is a structural imbalance in the field. Most of the attention, and capital, flows into bigger, more capable models. Comparatively little flows into the question of which information a model should see at which moment. Yet the findings above show that is where the leverage sits: the models are often clever enough to solve the problem, when their context stays clean. It rarely does.

For a company that means in practice: the next model generation will not solve your architecture problem for you. Anyone waiting for "this will get better with GPT-X" is optimizing the wrong variable. Reliable improvement comes from the data architecture, from giving the model the right thing precisely, instead of laying everything in front of it.

The architecture decision, and how we make it at CovaSyn

The rule of thumb that falls out of the research is clear:

Text answer, nuance, interpretation → give context (RAG/stuffing is right).
Exact, structured, verifiable answer → delegate to a tool (tool calling is right).

CovaSyn is built consistently for the second case. Instead of stuffing data sheets and tables into a chemistry agent's context and hoping for the best, it calls a deterministic tool that computes the exact answer and returns it compactly, a solubility with uncertainty interval, a tox flag, a structural descriptor. No distractor noise, no middle-of-context lottery, no page-long raw answer.

That this architecture is superior we measured on independent data: with a tool layer attached, frontier models jump on chemistry tasks from 14–41 percent to 76–92 percent, not because they get more context but because they retrieve the right, small, exact building block. And because we deliberately curate the tools rather than expose every function, we avoid the tool overload described by LongFuncEval. More on the difference between pure descriptor access and a curated platform sits in our post on RDKit over MCP.

The bottom line

The most important decision in an AI project is not the model choice but the data architecture, and it gets made early. Anyone who delegates exact tasks to the context bakes in the lost-in-the-middle, context-rot and tool-overload failure modes from the start. Anyone who delegates them to tools sidesteps them. A bigger context window only shifts that boundary; the right architecture removes it.

The free tier lets you see the difference yourself, tool calling instead of context stuffing for your chemistry questions, directly in the agent. 100 credits per week. → See CovaSyn MCP

FAQ

Does a larger context window help accuracy?

Not reliably. Studies (Liu et al. 2024; Chroma 2025) show that answer quality drops with longer input, even on models built explicitly for long context. The window grows, attention inside it does not.

What is "lost in the middle"?

The finding that LLMs use information at the start and end of a long context well but drop significantly in the middle, a U-shaped performance curve. In extreme cases the answer with context is worse than without any.

What is context rot?

The measurable degradation in LLM answer quality as input length grows. A Chroma study found the effect across all 18 tested frontier models; semantically similar but irrelevant content (distractors) amplify it.

Context stuffing or tool calling, what should I use?

For unstructured, language tasks (summarization, document search), context / RAG is sensible. For exact, structured, verifiable tasks (calculations, lookups, predictions), tool calling is the more reliable architecture.

Why do many AI projects fail?

Often not at the model but at the data architecture: exact tasks get stuffed into the context instead of delegated to deterministic tools. That bakes the known failure modes in from the start.

Sources

Liu N. F. et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL, arXiv:2307.03172.
Hong K. et al. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Research.
LongFuncEval: Measuring the effectiveness of long context models for function calling. arXiv:2505.10570 (2025).
Shi F. et al. (2023). Large Language Models Can Be Easily Distracted by Irrelevant Context (GSM-IC).

CovaSyn MCP

Scientific tools in your AI workflow.

130+ functions for pharma, biotech and chemistry. Free tier instantly active.

See CovaSyn MCP →