Smart MCP Mapping: How AI Selects the Right Tool From 120+ Options

Maxime Champoux10 min read

The average enterprise now connects to 47 SaaS tools. For an AI assistant trying to act on a user request, each of those tools is a possible destination. When you run 120+ MCP connectors, and you are building toward 500, the question stops being can the AI call the right tool. It becomes whether the AI can even find it.

This is the routing problem at the center of Well's architecture. We did not set out to build a routing layer. We set out to build an AI that operates business tools on behalf of founders and teams. The routing layer emerged because the obvious approach failed.

The Naive Approach: Let the Model Figure It Out

When we started, we had twelve connectors. Slack, Gmail, Google Calendar, Notion, a handful of others. The integration strategy was simple: describe every tool in the system prompt, pass the user query, let the LLM pick the right one.

At twelve connectors, this worked. The model had enough context to distinguish between "send a message on Slack" and "create an event in Google Calendar." Token cost was manageable. Latency was acceptable. Accuracy sat around 94%.

Then we added connector thirteen through forty.

Three things broke simultaneously. First, token costs. Each tool description runs 200 to 800 tokens. Forty tools meant 12,000 to 32,000 tokens of tool descriptions alone, before the user even said anything. At scale, this was burning cash on context that the model mostly ignored.

Second, latency. More tokens in, slower response out. Users noticed.

Third, and worst: accuracy dropped. At forty connectors, the model started confusing tools with overlapping capabilities. "Show me overdue invoices" could reasonably route to QuickBooks, Xero, Stripe, or Well's native financial data. The model would pick one, often wrong, with high confidence. It did not know which accounting tool the customer actually used. It guessed.

We measured this. Routing accuracy fell from 94% at twelve connectors to 72% at forty. Projected forward to 120+ connectors, the math was grim. This was not a prompting problem. You cannot prompt-engineer your way out of combinatorial explosion.

The Combinatorial Problem

Consider what happens when a user types "Show me overdue invoices."

The AI must determine: which financial tool is connected? Which one contains invoice data? Does the user's workspace use QuickBooks or Xero? Are there invoices in both? Should it query one or merge results from several?

With 120+ connectors, the selection space is far larger still. Many queries could legitimately hit two or three tools. Some tools expose dozens of sub-capabilities. The actual decision space for a single query can exceed several hundred candidate actions. The model must pick correctly on the first try, because a wrong route means a wrong answer, and wrong answers erode trust fast.

We tried several patches. We grouped tools by category. We added instructions telling the model to prefer certain tools for certain query types. We experimented with two-stage selection: first pick the category, then pick the tool.

Each patch improved accuracy by a few percentage points and added fragility. The category groupings were static. When we added a new connector, we had to manually update the taxonomy. The two-stage approach introduced a failure mode where the model picked the right category but the wrong tool within it. Every fix created a new edge case.

After three months of patching, we accepted the core insight: tool selection is itself an AI problem. It cannot be solved with configuration, taxonomies, or better prompts. It requires a system that learns which tools handle which queries, and improves over time.

Skills: Learned Patterns for Tool Routing

The solution we built is called the skill layer. A skill is a learned association between a type of query and a specific tool or set of tools. Skills are not hand-written rules. They are patterns extracted from successful tool invocations and refined through feedback.

Here is how it works in practice. When a user asks "Show me overdue invoices" and the system successfully routes to QuickBooks and returns correct data, that interaction becomes a training signal. The skill layer records: for this workspace, invoice queries route to QuickBooks. The next time a similar query arrives, the skill layer narrows the candidate set before the LLM ever sees the options.

Skills operate at three levels of specificity. Workspace-level skills capture which tools a particular customer uses. If a workspace has QuickBooks connected but not Xero, invoice queries should never consider Xero. This sounds obvious, but the naive approach had no way to encode it without stuffing the information into every prompt.

Query-type skills capture patterns across workspaces. Invoice queries go to accounting tools. Message queries go to communication tools. Calendar queries go to scheduling tools. These are broader patterns that apply generally.

Intent-level skills handle ambiguity. "Send the invoice to John" is an invoice query that requires both an accounting tool (to fetch the invoice) and a communication tool (to send it). The skill layer learns multi-tool routing patterns, where a single user intent maps to an orchestrated sequence of tool calls.

The result is that instead of presenting 120+ tools to the model, the skill layer narrows the candidate set to three or four. The model picks from a small, relevant set. Accuracy returned to 95% and has since climbed to 97.3% as skills accumulate.

The Meta-MCP Proxy

Skills handle the intelligence side of routing. The Meta-MCP Proxy handles the infrastructure side.

MCP, the Model Context Protocol, defines how an AI model interacts with external tools. Each of our 120+ connectors exposes its capabilities through MCP. The problem is that an AI agent cannot efficiently maintain live connections to 120+ separate MCP servers. The overhead alone would be prohibitive.

The Meta-MCP Proxy is a single MCP endpoint that sits between the AI and all connectors. From the model's perspective, it talks to one server. Behind that server, the proxy fans out to whichever connector the skill layer selected. It handles connection pooling, authentication forwarding, rate limiting, and error recovery for every connector.

The architecture looks like this: user query arrives, the skill layer narrows the candidate tools, the Meta-MCP Proxy surfaces only those tools to the model, the model selects and invokes, the proxy routes the invocation to the correct connector. The round trip feels instant to the user. Behind the scenes, the proxy is doing significant work.

One design decision that took us time to get right: the proxy does not cache tool descriptions statically. Connector capabilities change. A QuickBooks integration might gain new endpoints after an API update. The proxy maintains a live capability index that refreshes on a schedule and on-demand when a connector reports changes. This matters because stale capability data leads to routing errors that are difficult to debug. The model tries to call a function that no longer exists, or misses a new function that would have been the correct route.

When Routing Goes Wrong

No routing system is perfect. What matters is how errors surface and how the system learns from them.

We built an evaluation framework that tracks every route decision. Each query gets logged with: the candidate tools considered, the tool selected, whether the invocation succeeded, and whether the user accepted the result. A successful invocation that the user rejects (because it returned data from the wrong source) counts as a routing error.

Errors feed back into the skill layer. If "Show me overdue invoices" routes to QuickBooks but the user corrects it to Xero, the workspace-level skill updates. If enough workspaces show the same correction pattern, the query-type skill adjusts its priors.

We track three metrics. Routing accuracy: did the query reach the correct tool? First-try accuracy: did it get there without a correction? Recovery time: when routing fails, how quickly does the system recover and try the right tool?

Current numbers across production workspaces: routing accuracy 97.3%, first-try accuracy 94.1%, median recovery time 1.2 seconds. The gap between routing accuracy and first-try accuracy represents cases where the system's second choice was correct. We are working to close that gap, but the recovery speed means users rarely notice.

Scaling to 500 Connectors

Everything we have built so far was designed for what comes next. The current 120+ connectors cover the most common SaaS tools. The roadmap targets 500 connectors by end of next year, covering long-tail industry-specific software, regional tools, and legacy systems that businesses still depend on.

At 500 connectors, the naive approach would require passing roughly 150,000 tokens of tool descriptions per query. Even with today's large context windows, this is impractical. More importantly, model accuracy degrades well before context limits are reached. Research from Anthropic and others confirms that LLMs struggle with selection tasks when presented with more than 20 to 30 options, regardless of context window size.

The skill layer scales differently. As connectors grow from 120+ to 500, the candidate set for any given query stays small. A user asking about invoices still gets routed to two or three accounting tools, not 500. The work shifts to the skill layer, which must learn patterns for new connectors quickly and handle edge cases where new tools overlap with existing ones.

This is where the compounding effect shows up. Every query processed, every correction logged, every successful route confirmed adds signal to the skill layer. A new connector benefits from patterns learned across all existing connectors. When we add a new CRM, it inherits routing patterns from Salesforce, HubSpot, and Attio integrations. The 500th connector will route more accurately on day one than the 20th connector did, because the skill layer by then will encode millions of routing decisions.

Tool Selection as an AI Problem

The deeper insight from building this system is that tool selection belongs in the same category as other AI problems we have learned to stop hand-coding.

Ten years ago, search ranking was a rules engine. Engineers wrote heuristics for which results should rank higher. Then machine learning ate that problem, and hand-coded ranking rules became a historical curiosity. Recommendation systems followed the same arc. Fraud detection. Content moderation. In each case, the shift happened when the decision space grew beyond what static rules could handle.

Tool routing is on the same trajectory. At twelve connectors, hand-coded rules worked fine. At 120+, they cannot keep up. At 500, they are not even worth attempting. The teams that treat tool selection as a configuration problem will hit a ceiling. The teams that treat it as a learning problem will not.

We see this reflected in the competitive dynamics of the AI agent space. Several companies are building agents that connect to external tools. Most treat routing as a prompt engineering challenge. They write careful system prompts that describe when to use which tool. This works at small scale. It falls apart exactly when the product needs it most, when the customer connects enough tools for the agent to be genuinely useful.

The infrastructure required to solve routing properly is substantial. You need a proxy layer that abstracts connector complexity. A skill system that learns from production traffic. An evaluation framework that catches errors and feeds them back. A capability index that stays current across hundreds of evolving APIs. None of this is visible to the user. All of it is necessary.

We spent eight months building what amounts to an invisible layer. No user ever sees the Meta-MCP Proxy. No one asks about skill-based routing. They ask "Show me overdue invoices," and the right answer appears. The complexity lives entirely beneath the surface.

That is the nature of infrastructure work. It compounds quietly. Each connector added makes the skill layer smarter. Each query processed makes routing more accurate. The moat is not in any single component. It is in the accumulated intelligence of millions of routing decisions, encoded in a system designed to learn from every one of them.

Maxime Champoux, CEO & co-founder, Well

Maxime Champoux

CEO & co-founder, Well

Maxime is the CEO and co-founder of Well. He built Well to rebuild finance around AI-native data, not spreadsheets.

LinkedIn

Ready to automate your financial workflows?