All posts

MCP on local models: what actually works

Local models pay the highest price for MCP tool bloat and get the least honest advice about it. Here is what works today with LM Studio, Jan, Goose and Open WebUI, what a 7B model can and can't do, and when to use lazy discovery versus the full catalog.

Here is an uncomfortable fact about running MCP servers against a local model: the machines that can least afford tool bloat are the ones paying the most for it.

A frontier model behind an API can shrug off 24,000 tokens of tool definitions. The bill goes up, the answers stay fine, and the prefill happens on a datacenter GPU you never see. Your local setup gets none of those mercies. Those same 24,000 tokens get processed on your hardware, every single turn, before the model reads a word you typed. They squat in your KV cache. And they eat into a usable context that was probably 8k to 32k to begin with.

Cloud users experience tool bloat as a bigger invoice. You experience it as a spinner.

The math is worse at home

In our benchmark catalog, an average MCP tool definition runs about 397 tokens. Three ordinary servers put roughly 24k tokens of schemas in front of every request. At consumer prompt-processing speeds that is seconds of prefill per turn, spent reading a phone book the model mostly won’t use, and it happens again every time the conversation grows past what the cache can hold.

That is the case for a gateway in one sentence: your agent should load three small meta-tools and search for the rest on demand, instead of carrying every schema everywhere. We measured that swap at up to 91% fewer tool tokens at the same graded task success on a frontier model. On local hardware the same cut is worth more, because you are buying back seconds and VRAM, not just tokens.

The catch nobody puts in the README

Lazy discovery asks the model to do something before it does the thing you asked: notice it needs a tool, search for it, read the result, then call what it found. That is a two-hop plan with tool calls on both hops.

Strong models do this without noticing. Small ones often don’t. In our testing, 7B-class models routinely fail the flow, not because the gateway is complicated but because reliable multi-step tool calling is exactly what small models are bad at. If your daily driver is a 7B, lazy discovery will frustrate you, and no gateway vendor should tell you otherwise.

So here is the honest decision tree:

  • Running a capable model (the larger local classes, or anything that handles multi-step tool calls dependably)? Use lazy discovery. Flat context, fast prefill, full catalog one search away.
  • Running something smaller? Have Toolport expose the full catalog instead, and keep your server set lean. Any model with basic tool calling can use it, and you still get the rest of the gateway: one config shared by every client, keys in the OS keychain instead of scattered JSON files, and rug-pull and tool-poisoning detection on everything.

Either way, nothing leaves your machine. The gateway is a local process. If your servers are local too, the whole loop runs offline, which is presumably why you went local in the first place.

Client by client

The local-model apps people actually use are all one toggle away:

  • LM Studio reads ~/.lmstudio/mcp.json. Toolport writes its gateway entry there and your servers appear in tool-enabled chats.
  • Jan keeps its config in mcp_config.json under the app’s data directory. Same deal: one entry, every server.
  • Goose stores extensions in config.yaml next to your model settings. Toolport edits only its own entry and leaves the rest of the file alone.
  • Open WebUI is the interesting one: it doesn’t read MCP configs at all, it consumes OpenAPI tool servers. Toolport’s gateway speaks HTTP/OpenAPI natively, so you point Open WebUI at http://localhost:8765/openapi.json and skip the mcpo bridge entirely. We validated that path end to end.

The full list, with exact paths per OS, is on the clients page.

What to actually do

Start with the model you have and be realistic about it. If it can chain tool calls, turn on lazy discovery and enjoy a context that stays flat at three servers or thirty. If it can’t, run the full catalog through the gateway anyway, keep the server count low, and you have still consolidated your configs, locked up your keys, and put a watcher on every tool definition.

Then open Toolport’s Activity view and watch a few requests go through. Seeing what your model does with tools, search hits, misses, the calls it makes, tells you more about whether it can handle lazy discovery than any parameter-count rule of thumb.

Toolport is free and open source. Local gateway, local keys, and the token math finally on your side.

Download Toolport Star on GitHub