MCP Code Mode: Keeping Tool Responses Out of Agent Context

Your agent just fetched 50 emails, and 20,000 tokens of headers, metadata, timestamps, thread IDs, and body text flowed into the context window. The agent needed 3 subject lines.

This is response bloat: the gap between what an agent asks for and what it gets back. MCP tools return raw API responses because that’s what APIs do. The model receives all of it, processes it, and produces a summary. But by then the context damage is done. Those 20K tokens are baked in for the rest of the conversation.

Multiply this across a multi-step workflow touching 3-4 providers and you’re looking at 50-150K tokens of raw JSON that the model never needed to see.

Why code_execution Doesn’t Solve Agent Context Bloat

Anthropic ships a code_execution tool in Claude. The agent can write and run Python code during a conversation. The Anthropic documentation positions it as a way to process data, run calculations, and analyze results.

It sounds like it solves the response bloat problem. Agent gets data, writes code to process it, returns the summary.

But walk through the actual flow:

Agent has 916 MCP tools + the code_execution tool in its context
User asks: “Find open Jira bugs that don’t have a linked PR in GitHub”
Agent calls jira_list_issues normally through MCP
Raw JSON response enters the context window: ~25K tokens
Agent calls github_list_pull_requests normally through MCP
More raw JSON enters context: ~30K tokens
Agent writes Python code using code_execution to cross-reference the data
Code runs, produces the filtered list

The agent got its answer. But 55K tokens of raw data are sitting in the context window. The code processed data that was already in context. The agent context window damage happened at step 4, not step 7.

Source: Measured on StackOne code mode benchmark: single MCP workflow, 96% token reduction

MCP Code Mode: Keeping Tool Responses in the Sandbox

Custom code mode flips the architecture. The agent doesn’t have 916 tools. It has 2:

const codeModeTools = [
  {
    name: "search_tools",
    description: "Search for relevant tools by natural language query",
    input_schema: {
      properties: { query: { type: "string" } },
      required: ["query"]
    }
  },
  {
    name: "execute_code",
    description: "Execute TypeScript code in a sandbox with pre-configured MCP tool wrappers. All discovered tools are available as async functions on the global `tools` object.",
    input_schema: {
      properties: {
        code: {
          type: "string",
          description: "TypeScript code to execute. Use `tools.tool_name(args)` to call MCP tools. Return a string summary for the user."
        }
      },
      required: ["code"]
    }
  }
];

The agent first searches for the tools it needs (progressive discovery, same as before). Then it writes TypeScript that calls those tools from inside a sandbox. The raw API responses stay in the sandbox. Only the return value, a small string summary, comes back into the LLM’s context.

Here’s what the agent generates for the same Jira+GitHub task:

// Agent-generated code, executed in sandbox
const bugs = await tools.jira_list_issues({
  query: { type: "Bug", status: "Open" }
});
const prs = await tools.github_list_pull_requests({
  query: { state: "open" }
});

// Cross-reference: find bugs with no matching PR
const prKeys = prs.data
  .map(p => p.title.match(/[A-Z]+-\d+/))
  .flat()
  .filter(Boolean);

const unlinked = bugs.data.filter(b => !prKeys.includes(b.key));

// Send the email
await tools.gmail_send_message({
  to: "guillaume@stackone.com",
  subject: `${unlinked.length} bugs without linked PRs`,
  body: unlinked.map(b => `${b.key}: ${b.title}`).join("\n")
});

return `Found ${unlinked.length} open bugs without linked PRs. Sent list to your inbox.`;

In StackOne’s benchmark, a single MCP workflow generated 55,780 characters of raw JSON inside the sandbox (approximately 14K tokens). What returned to the LLM context was a 500-token summary. That’s a 96% token reduction. The raw data never left the sandbox.

Sandbox Architecture: Bridge, Auth, Tracing, and Containment

The sandbox needs four properties: tool access, auth isolation, call tracing, and containment.

Tool access is the tricky part. The agent writes code that calls MCP tools, but the sandbox can’t hold an MCP connection directly. Instead, the sandbox runner registers tool wrappers as global functions:

// sandbox-runner.mjs — tool registration
globalThis.tools = {};
for (const tool of toolDefs) {
  const fnName = tool.name.replace(/-/g, "_");
  globalThis.tools[fnName] = async (args) => {
    const response = await fetch(mcpBaseUrl, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        jsonrpc: "2.0",
        method: "tools/call",
        params: { name: tool.name, arguments: args }
      })
    });
    const result = await response.json();
    return result.result;
  };
}

Auth is baked into the MCP server, not the sandbox. The sandbox calls tools through an HTTP bridge to the MCP server, which handles all authentication with downstream providers. The agent’s code never sees API keys or OAuth tokens.

Call tracing records every tool invocation the sandbox makes. This matters for auditability. When the agent writes code that sends an email, you need a log of exactly what was sent, to whom, and when. The trace also helps debugging when agent-generated code produces wrong results.

Containment means no filesystem access, no network access beyond the MCP bridge, and a timeout. If agent-generated code enters an infinite loop or tries something unexpected, the sandbox kills it after 30 seconds.

Trade-offs: MCP Code Mode vs Direct Tool Calling

Code mode isn’t free. Three things get harder.

Approval UX changes. With standard tool calling, the user sees “Agent wants to call jira_list_issues with these arguments.” That’s easy to review. With code mode, the user sees a block of TypeScript. Reviewing a 15-line code block is harder than reviewing a tool call. For high-trust environments (internal tools, automated pipelines), this is fine. For customer-facing agents where every action needs explicit approval, the UX needs work.

Error handling shifts to the agent. When a direct tool call fails, the agent sees the error and can retry or try a different approach. When code in the sandbox fails, the agent sees a generic error from the sandbox. The error message needs to include enough context (stack trace, failed tool name, response code) for the agent to diagnose and fix the code.

Not everything benefits from code mode. Simple single-tool calls don’t need a sandbox. If the agent just needs to create one Jira ticket, calling the tool directly is simpler and cheaper than spinning up a sandbox. Code mode shines on multi-step, multi-provider workflows where data processing happens between tool calls.

Factor	Direct Tool Calling	Code Mode
Context usage	Full raw response in context	Only summary in context
Multi-step workflows	Multiple round trips	Single code block
Approval UX	Clear tool+args review	Code block review
Error recovery	Agent retries naturally	Needs structured errors
Simple tasks	Low overhead	Unnecessary complexity
Infrastructure	None	Sandbox runtime needed

Code mode excels for multi-provider workflows. Direct calling wins for simple, single-tool tasks.

How Cloudflare, Block, and Anthropic Use Code Mode

Cloudflare coined “Code Mode” and ships it with their Agents SDK. Their implementation uses V8 isolates for sandboxing, which gives sub-millisecond cold starts and strong isolation. They measured 81% token reduction in their benchmarks.

Block’s Goose project on code mode observed: “LLMs are better at writing code to call MCP, than at calling MCP directly.” Their observation matches what we see. When an agent writes code, it can express conditional logic, loops, and data transformations that would require multiple tool-call round trips otherwise.

Anthropic’s research on code execution with MCP measures large token reductions on real workflows. Their implementation differs from StackOne’s (their code_execution tool is a first-party model feature, not a sandbox architecture), but the insight is the same: raw tool responses are the biggest context cost.

When to Use MCP Code Mode vs Direct Tool Calls

The decision comes down to how much intermediate data hits the context window.

Use Code Mode When	Use Direct Tool Calling When
Workflow touches 2+ providers	Single tool call, simple arguments
Cross-referencing or filtering data between calls	Response is small (under 2K tokens)
Raw API responses exceed 10K tokens	User needs to approve each individual action
Task involves loops or conditional logic	Error handling needs to be explicit and visible

The architecture isn’t either/or. At StackOne, agents switch between modes dynamically. Simple queries get direct tool calls, multi-step workflows get routed to code mode. For a concrete setup guide showing how CLI and MCP tools coexist in Claude Code, see the two-loop CLI + MCP architecture. The decision point is straightforward: if the task will generate more than 10K tokens of intermediate data, code mode keeps the raw data out of context entirely.

Try MCP code mode for free