Same answers, up to 91% fewer tokens: we measured it on a frontier model
A graded benchmark of flat MCP tool exposure vs Toolport's lazy discovery on GPT-5.5. Every task completed correctly in both modes; lazy used 74-91% fewer tokens, and the gap grows as you add servers.
We have been saying that loading every MCP server’s full tool list into context on every request is a tax you pay constantly. It is easy to claim a token reduction. It is harder to prove you did not trade accuracy for it. So we measured it properly: a frontier model, the same tasks both ways, and every run graded on whether the answer was actually correct, not just whether the agent stopped.
The result: identical task success, and 74% to 91% fewer tokens, with the gap widening as you connect more servers.
The setup
- Two modes, same tasks, same model.
- Flat: every tool from every connected server is exposed directly. The normal MCP setup.
- Lazy: Toolport advertises 3 meta-tools (
toolport_status,toolport_search_tools,toolport_call_tool) and the agent searches for what it needs.
- Model: GPT-5.5. We used a frontier model on purpose, so model capability isn’t the variable and both modes can actually finish the work.
- Tasks (5 runs each): list Stripe products, list Neon projects, list Vercel projects (the last is a two-step that has to find a team id first).
- Graded for correctness. A run only counts if the agent’s final answer contains the real items from the account. “Completed” can’t hide a wrong answer or an “I couldn’t do that.”
- Swept across 3 and 6 connected servers to see how the gap scales.
The numbers
Total tokens to complete the three tasks, median of 5 runs, both modes graded:
| Servers | Tools | Flat | Lazy | Reduction | Correct (flat / lazy) |
|---|---|---|---|---|---|
| 3 | 63 | 179,181 | 47,095 | 74% | 15/15 · 15/15 |
| 6 | 183 | 471,775 | 40,354 | 91% | 15/15 · 15/15 |
Two things matter here.
Nobody traded accuracy for tokens. Every task completed correctly in both modes, 30 for 30. The cheaper option was not the dumber option.
The savings grow with your catalog. Look at what happens going from 3 to 6 servers: flat mode’s cost more than doubled (179K to 472K tokens) because it re-sends every tool’s schema on every call. Lazy mode’s cost went down (47K to 40K), because it pays a flat ~450-token overhead no matter how many servers you connect. The per-request tool-definition overhead tells the same story: flat climbs from 19,002 to 51,533 tokens, lazy stays at a constant 451.
That is the whole point of lazy discovery. The more tools you connect, the more flat exposure costs you on every single turn, and the more lazy discovery saves.
Where this goes at scale
The jump from 3 to 6 servers already shows the trend. At a real catalog it gets dramatic. We measured a live setup of 415 tools across 14 servers (no model needed, just the size of the tool definitions): 164,880 tokens of definitions on every request, which Toolport collapses to 660. That is a 99.6% reduction, and on a 200K-context model those definitions alone eat 82% of the window before you have typed a word. The full breakdown, including the dollar math, is in the benchmark.
Being honest about it
- This is one frontier model, one machine, three read-only tasks, five runs each. Treat the direction, a large and consistent reduction at equal correctness, as the signal, not the exact percentage. (The definition-overhead numbers are deterministic and need no such hedge.)
- Lazy discovery adds search round-trips. The totals above are already net of that. For a single tiny server it is overkill; the trade pays off once you pass a handful of tools.
- We grade correctness rather than eyeballing it, and the harness is in the repo, so you can run the same thing against your own servers and your own model.
Run it yourself
The graded sweep, the token calculator, and the full method are open source.