The framing
A tool definition has three parts: name, description, parameters. The model reads all three before deciding to call the tool. The description is, for all practical purposes, a prompt that runs every time the model evaluates that tool. Treat it like one.
The companion piece on tool-use vs structured output covers when to reach for tools at all. This post is about: given that you're writing tools, what makes one the model uses well.
Rule 1: Name like a verb the model would already use
Bad: customer_data_retrieval_v2.
Good: get_customer.
The model picks tools partly by lexical match — "I need to find
this customer" pattern-matches to get_customer
faster than to customer_data_retrieval_v2. Names
are verbs, plus an object. Skip the version suffix and the
module prefix. The system prompt can explain what subsystem
these belong to if it matters.
Rule 2: Describe what to use it for, not what it does
Bad: "Returns a JSON object containing the customer record fields." Good: "Use this when you need to look up a customer by ID. Returns name, email, plan tier, and account status."
The model's question isn't "what does this function return." It's "should I call this now?" Lead with the trigger, follow with the return shape. The first sentence of the description should answer the question "in what situation should I call this tool."
Rule 3: Mention what NOT to use it for
Good: "Use this to look up a customer by ID.
Don't use this to search for customers by name — use
search_customers instead."
Negative space matters. When two tools are adjacent in function, telling the model where each one stops lets it pick the right one. The "don't use this for X — use Y instead" phrasing transfers the model's attention to Y when X isn't the fit.
Rule 4: Make required parameters obviously required
Bad: filters: object with a
schema buried in the description.
Good: customer_id: string (required)
— the unique customer ID, format "cus_*".
Parameter descriptions should include format hints, examples, and units. The model fills parameters by guessing from conversation context — if the format is "cus_*" but you don't mention it, expect it to call your tool with the customer's email instead. Spell out the format. Give an example.
Rule 5: Return errors that the model can act on
The model receives the tool's response on the next turn and
decides what to do with it. If the response is "Error 500"
the model has nothing to work with. If the response is "Error:
customer not found. Try search_customers with the
name or email instead," the model recovers cleanly.
This is server-side, not part of the tool definition, but it's the same skill. Tool error messages are prompts to the future-self of the agent. Write them that way.
Rule 6: Fewer tools, broader scope
Teams routinely ship agents with 30+ tools. The model gets worse at picking. Cognitive load applies to the model the same way it does to a human reading an API surface.
The right shape is usually 4-8 tools, each with broad scope and clear separation. "Look up a customer" is one tool that handles ID, email, and name lookup internally — not three separate tools the model has to disambiguate between. Keep surface area small. Push the disambiguation logic into the server-side handler, where it can be tested.
The 30-second test
Before shipping a tool definition, read just the name and first sentence of the description out loud. Ask: "would I know when to call this?" If you'd hesitate, the model will too. Rewrite.
Same test for parameters: read the parameter name and the first words of its description. Ask: "could I provide this value from the conversation context I have?" If not, the parameter description is missing format/example info.
The eval case we always add
For each tool, an eval case where the model should obviously call it, and another where it obviously shouldn't. If both cases pass, the tool's definition is doing its job. If the "obviously call it" case fails — the model didn't pick the tool — the description is wrong, not the model. Rewrite.
We covered the eval shape in the minimum viable eval; tool-routing cases are 25-40% of the eval suite on every agent we ship.
The summary
Tool definitions are prompts. Name like a verb, describe the trigger, mark required params with formats, write errors as prompts, keep the surface small. Most "the agent is unreliable" complaints we get pulled into to debug come down to bad tool definitions, not bad models. The model is reading the description carefully. Make it worth reading.