Writing tools the model actually uses

The framing

A tool definition has three parts: name, description, parameters. The model reads all three before deciding to call the tool. The description is, for all practical purposes, a prompt that runs every time the model evaluates that tool. Treat it like one.

The companion piece on tool-use vs structured output covers when to reach for tools at all. This post is about: given that you're writing tools, what makes one the model uses well.

Rule 1: Name like a verb the model would already use

Bad: customer_data_retrieval_v2. Good: get_customer.

The model picks tools partly by lexical match — "I need to find this customer" pattern-matches to get_customer faster than to customer_data_retrieval_v2. Names are verbs, plus an object. Skip the version suffix and the module prefix. The system prompt can explain what subsystem these belong to if it matters.

Rule 2: Describe what to use it for, not what it does

Bad: "Returns a JSON object containing the customer record fields." Good: "Use this when you need to look up a customer by ID. Returns name, email, plan tier, and account status."

The model's question isn't "what does this function return." It's "should I call this now?" Lead with the trigger, follow with the return shape. The first sentence of the description should answer the question "in what situation should I call this tool."

Rule 3: Mention what NOT to use it for

Good: "Use this to look up a customer by ID. Don't use this to search for customers by name — use search_customers instead."

Negative space matters. When two tools are adjacent in function, telling the model where each one stops lets it pick the right one. The "don't use this for X — use Y instead" phrasing transfers the model's attention to Y when X isn't the fit.

Rule 4: Make required parameters obviously required

Bad: filters: object with a schema buried in the description. Good: customer_id: string (required) — the unique customer ID, format "cus_*".

Parameter descriptions should include format hints, examples, and units. The model fills parameters by guessing from conversation context — if the format is "cus_*" but you don't mention it, expect it to call your tool with the customer's email instead. Spell out the format. Give an example.

Rule 5: Return errors that the model can act on

The model receives the tool's response on the next turn and decides what to do with it. If the response is "Error 500" the model has nothing to work with. If the response is "Error: customer not found. Try search_customers with the name or email instead," the model recovers cleanly.

This is server-side, not part of the tool definition, but it's the same skill. Tool error messages are prompts to the future-self of the agent. Write them that way.

Rule 6: Fewer tools, broader scope

Teams routinely ship agents with 30+ tools. The model gets worse at picking. Cognitive load applies to the model the same way it does to a human reading an API surface.

The right shape is usually 4-8 tools, each with broad scope and clear separation. "Look up a customer" is one tool that handles ID, email, and name lookup internally — not three separate tools the model has to disambiguate between. Keep surface area small. Push the disambiguation logic into the server-side handler, where it can be tested.

The 30-second test

Before shipping a tool definition, read just the name and first sentence of the description out loud. Ask: "would I know when to call this?" If you'd hesitate, the model will too. Rewrite.

Same test for parameters: read the parameter name and the first words of its description. Ask: "could I provide this value from the conversation context I have?" If not, the parameter description is missing format/example info.

The eval case we always add

For each tool, an eval case where the model should obviously call it, and another where it obviously shouldn't. If both cases pass, the tool's definition is doing its job. If the "obviously call it" case fails — the model didn't pick the tool — the description is wrong, not the model. Rewrite.

We covered the eval shape in the minimum viable eval; tool-routing cases are 25-40% of the eval suite on every agent we ship.

The summary

Tool definitions are prompts. Name like a verb, describe the trigger, mark required params with formats, write errors as prompts, keep the surface small. Most "the agent is unreliable" complaints we get pulled into to debug come down to bad tool definitions, not bad models. The model is reading the description carefully. Make it worth reading.