Introduction
I’ve been thinking about MCP scaling problems since I built and deployed an MCP server for the OpenDota API. In that post I talked about the pitfalls of auto-converting a REST API into MCP tools: context pollution, overlapping tool descriptions, the agent struggling to pick the right tool out of 200+ options. At the time, I framed those as reasons to curate your tool set carefully. But I didn’t have a good answer for what you do when you genuinely need a lot of tools.
Then Anthropic published an engineering post that gave me the framing I was missing. By changing how agents interact with MCP tools, they report cutting token usage by over 98%. I want to walk through the problem and the proposed solution here, because I think anyone building MCP integrations beyond the toy stage will run into this wall eventually. I certainly ran into the early signs of it.
I haven’t implemented the code execution pattern myself. This post is me working through Anthropic’s ideas and connecting them to problems I’ve actually seen.
The Two Failure Modes
When an agent connects to MCP servers, there are two distinct ways the architecture bleeds tokens. Both are invisible at small scale, but let’s do the arithmetic and see what happens when you scale up.
Tool Definition Bloat
Most MCP clients load every tool definition into the model’s context before the user even asks a question. Each tool definition includes a name, description, and a full JSON Schema for its parameters. A single tool runs about 200-500 tokens.
Now suppose your agent connects to $n = 20$ MCP servers, each exposing $k = 50$ tools. That’s $n \times k = 1{,}000$ tool definitions. At ~300 tokens each, you’re burning $\sim 300{,}000$ tokens of context before a single user message is processed. The cost scales linearly with the number of tools, and there’s no mechanism in the standard MCP protocol to load them lazily.
I wrote about a version of this in the OpenDota post, specifically about how each MCP tool eats up precious context, and how overlapping descriptions make it harder for the agent to choose the right one. Jeremiah Lowin, the creator of FastMCP, makes the same point. But even if you’re careful about curation, the problem comes back the moment you need to integrate across many services. Real enterprise setups with CRM, calendar, email, database, and monitoring integrations easily hit hundreds of tools.
Intermediate Result Passthrough
The second problem is subtler. When an agent calls a tool directly, the full result gets injected back into the conversation context.
Suppose you ask the agent to summarize a 2-hour meeting transcript fetched from a meeting notes server. That transcript might be $T = 50{,}000$ tokens. With a direct tool call, all $T$ tokens flow through the model’s context. And if the agent needs a follow-up call (say, to check attendee info in a CRM), it processes that same transcript again on the next turn. The cost per turn is at least $O(T)$, and across $m$ turns it’s $O(m \cdot T)$.
Or suppose you query a dataset with $R = 10{,}000$ rows to find $r = 5$ matching records. All $R$ rows pass through the model even though only $r$ matter. A human would filter first, then read. The naive MCP pattern forces the model to be the filter.
Between these two failure modes, what should be a 2,000-token interaction can balloon to 150,000 tokens. That’s 98.7% overhead. The useful signal is a tiny fraction of what the model actually processes.
The Insight
LLMs are already good at writing code. So instead of having the model call tools one-by-one through a chat loop, receiving full results in context each time, have it write a program that calls the tools, processes the results, and returns only what matters.
This is the same principle behind predicate pushdown in Spark or writing SQL with WHERE clauses instead of SELECT * and filtering in your application. In distributed systems, moving computation to the data is almost always cheaper than moving data to the computation. The context window is just another bottleneck with a transfer cost, and the same optimization applies.
The architectural shift is to treat MCP tool definitions not as callable functions for the model, but as a filesystem API the model can import and use from code. Lazy-loading instead of eager-loading.
How It Works
There are three pieces to this pattern.
Organizing Tools as a Filesystem
Instead of registering all tools upfront, you expose them as a navigable directory structure:
servers/
├── google-drive/
│ ├── get_document.py
│ ├── list_files.py
│ └── search_files.py
├── salesforce/
│ ├── get_record.py
│ ├── update_record.py
│ └── query_records.py
├── slack/
│ ├── send_message.py
│ ├── get_channel_history.py
│ └── search_messages.py
└── __init__.py
The agent starts by listing the servers/ directory (a handful of tokens) and drills into the specific server it needs.
Progressive Disclosure
Rather than the model seeing all $n \times k$ tool schemas at once, you give it a single search interface:
async def search_tools(
query: str,
detail_level: Literal["name", "description", "full_schema"],
) -> list[ToolInfo]:
matches = await tool_index.search(query)
if detail_level == "name":
return [{"name": t.name} for t in matches]
elif detail_level == "description":
return [{"name": t.name, "description": t.description} for t in matches]
else:
return matches
The model starts broad (“name” level), narrows down, and only pulls full schemas for the 2-3 tools it actually needs. The context cost drops from $O(n \times k)$ to $O(r)$ where $r$ is the number of relevant tools. This is conceptually what Jeremiah Lowin’s guidance on curating tools does at build time, except here it’s happening at runtime.
You might wonder whether this search needs to be semantic (vector embeddings over tool descriptions, etc.). At $n \times k = 1{,}000$ tools, it almost certainly doesn’t. Tool descriptions are written by developers, not arbitrary natural language; keyword or fuzzy matching over a corpus that small is plenty. If your tools are so poorly described that semantic search is the only way to find them, that’s a tool design problem, not a search problem.
Filtering Data Before It Hits the Model
Instead of the model receiving $R$ rows through the conversation, it writes code that runs in a sandboxed execution environment:
# The model writes this; it runs in a sandbox, not in context
from servers.salesforce import query_records
opportunities = await query_records(
object="Opportunity",
query="Amount > 100000 AND StageName = 'Negotiation'",
fields=["Name", "Amount", "CloseDate", "AccountId"],
)
# Only the r matching records exist here, not R
summary = [
{"name": opp["Name"], "amount": opp["Amount"], "close_date": opp["CloseDate"]}
for opp in opportunities.records
]
# This is all the model sees back
print(json.dumps(summary, indent=2))
The intermediate data never enters the model’s context window. It’s predicate pushdown for LLM agents.
What Else This Buys You
Privacy
When the model calls tools directly, every intermediate result (PII, credentials, internal metrics) flows through the model’s context. With code execution, intermediate data stays in the sandbox:
from privacy.tokenizer import tokenize
customers = await get_customer_records(region="us-west-2")
safe_customers = [
{**c, "email": tokenize(c["email"]), "ssn": tokenize(c["ssn"])}
for c in customers
]
# email becomes "[EMAIL_1]", "[EMAIL_2]", etc.
print(json.dumps(safe_customers, indent=2))
The model reasons over placeholders. The actual PII never leaves the execution environment. For anyone dealing with compliance requirements (and having worked at AWS, I know this is a lot of people), this matters.
State Persistence
Direct tool calls are stateless from turn to turn. With code execution, the agent can write intermediate results to files and build up state across steps:
import json
from pathlib import Path
analysis = await run_expensive_query()
Path("/workspace/analysis_cache.json").write_text(json.dumps(analysis))
# On subsequent turns, the agent reads this instead of re-querying
An agent can resume after interruption and avoid redundant API calls across turns.
The Tradeoffs
This pattern has real costs.
You’re running LLM-generated code. That code needs to execute in a sandboxed environment with resource limits, network restrictions, and isolation. Whether you use containers, Firecracker microVMs, or something like gVisor, you’re taking on operational overhead that direct tool calls avoid entirely.
With direct tool calls, the MCP client mediates every interaction. With code execution, the model has more autonomy; it can compose calls, loop, and manipulate data in ways that are harder to audit. You need good logging, execution time limits, and ideally a human-in-the-loop for anything sensitive. Everything breaks all the time in production, and LLM-generated code is no exception. When generated code fails, you’re debugging code you didn’t write, running in a sandbox you might not have direct access to. Good observability and execution traces are the only way to stay sane here; something like Lambda Powertools would help a lot.
And not every use case justifies this. If your agent connects to 3-5 MCP servers with 20 total tools, direct tool calls are fine. The complexity of code execution pays for itself at dozens of servers, hundreds of tools, or large intermediate datasets. I wouldn’t have needed any of this for the OpenDota MCP server, but I can see needing it the moment you’re integrating across multiple enterprise services.
Closing Thoughts
A lot of people are saying “MCP is dead, just give agents a CLI.” And reading through Anthropic’s post, I can see why people are arriving at that conclusion: if the answer to scaling MCP is “have the model write code that calls tools in a sandbox,” then what exactly is MCP buying you over a well-structured CLI with good --help output?
I think the answer is that MCP still solves the discovery and interface problem. A standardized protocol for “here are my tools, here are their schemas” is valuable even if the optimal interaction pattern on top of it is code execution rather than direct function dispatch. The protocol and the execution model are separate concerns. Throwing out MCP because direct tool calls don’t scale is like throwing out REST because naive polling doesn’t scale; you fix the interaction pattern, not the interface.
Most agents today use the “load all tools, call them directly” pattern, and that works for getting started. If you want to see what that looks like in practice, I wrote about building an MCP server for the OpenDota API, including deployment to Lambda and the lessons learned from auto-generating tools. But as the number of available MCP servers grows, the agents that scale will be the ones that treat tool access as a code generation problem, not a function dispatch problem.