How we picked the ten
Ten is a sample, not a survey. We picked the ten clients whose stacks we know cold — meaning we built or audited the whole pipeline and have access to monthly invoices. Spend ranges from ~$900/mo (smallest, an internal tool at a 12-person startup) to ~$92K/mo (largest, a customer-facing agent on a B2B platform).
Names are scrubbed. Workload shapes are not. If your stack resembles one of these, you can use the median spend as a sanity check.
Profile A: Opus-heavy
Three of the ten clients sit here. The pattern: a small number of high-stakes calls per day, each requiring the strongest reasoning. Examples — legal-document analysis pipeline, financial-modeling assistant, deep-research agent for an enterprise sales team.
- Call volume: 200–2,000 / day
- Average tokens per call: 18,000 in / 4,500 out
- Model split: ~85% Opus 4.7, ~15% Sonnet 4.6 (used for cheap pre-classification)
- Caching: heavy. The system prompt is 6–14K tokens and gets cached aggressively.
- Median monthly spend: $11,400
- Range across the three: $4,900 – $19,200
Where the money goes: Opus output tokens. Across this profile, output is ~40% of the bill despite being only ~20% of token volume — the $25/Mtok output rate dominates.
Profile B: Sonnet-default
Four of the ten. The most common shape we see. Customer-facing agents, internal copilots, broad coding-assistance products. Sonnet 4.6 handles the bulk of traffic; Opus is reserved for hard escalations and Haiku for trivial classification.
- Call volume: 5,000–80,000 / day
- Average tokens per call: 6,200 in / 1,400 out
- Model split: ~75% Sonnet, ~15% Haiku, ~10% Opus
- Caching: heavy on system prompt + middle of conversation
- Median monthly spend: $8,600
- Range across the four: $2,200 – $24,000
Where the money goes: input tokens. The bill is overwhelmingly input-driven. The single highest-leverage cost optimisation in this profile is always: trim the system prompt and aggressively cache what remains.
Profile C: Haiku-batch
Three of the ten. High-volume, low-stakes-per-call workloads. Email classification at an MSP, content moderation at a UGC platform, embedding-replacement classification for a search product.
- Call volume: 100,000–4,000,000 / day
- Average tokens per call: 800 in / 120 out
- Model split: ~95% Haiku 4.5, ~5% Sonnet (escalations)
- Caching: surprisingly low — these workloads usually have unique inputs
- Median monthly spend: $3,800
- Range across the three: $900 – $14,500
Where the money goes: pure throughput. Per-call cost is tiny; the bill is volume × $1/Mtok input × $5/Mtok output. The optimisations that matter here are not prompt-shaped — they're "do we even need to call the model on this row?" filtering at the application layer.
The leaks we keep finding
When we audit an incoming engagement (someone bringing us in to fix an existing Claude bill), the same five leaks show up:
- Missing prompt cache. A 12K-token system prompt sent uncached on every call. Trivial to fix, ~30–60% bill reduction depending on hit rate. (See our 71% bill-cut case study.)
- Wrong model on easy traffic. Opus or Sonnet running on traffic Haiku would handle in its sleep. We see this most on classification and routing layers.
- No max-tokens cap. Output drift to 4–8K tokens when the answer needed was 200. Setting a sensible
max_tokensusually trims 10–20% off the bill. - Re-sending the conversation history uncached. Every turn re-includes the full transcript with no caching strategy. This compounds fast on long sessions.
- No throttle on the trigger upstream. An app feature that fires the model on every keystroke, every page-view, every webhook — when a debounce or batch would do.
Across our audits, fixing the top three of these alone usually cuts bills by 40–70% with zero quality regression. We've never finished an audit without finding at least two.
The model-split cheat sheet
Asked at every kickoff: "what's the right split between Opus, Sonnet, and Haiku?" There isn't a universal answer. The shape we use as a starting point:
- Hard reasoning (multi-step planning, code architecture, document analysis) → Opus 4.7
- Default conversation, default tool-use, default code-edit → Sonnet 4.6
- Classification, extraction, simple routing, structured-output bulk work → Haiku 4.5
The mistake we see most: teams default everything to whatever model they tested with first. Usually Sonnet. They never split out the 40% of traffic Haiku could handle for ~5x cheaper.
Plug your own numbers in
Our LLM cost calculator lets you plug in volume, average input/output, model split, and cache hit rate to get to a bill estimate. If you run those numbers and your actual bill is more than ~25% above the estimate, you've got a leak. If it's below, you're probably under-provisioning something — that's a different conversation.
And if you want us to do the audit for you, the engagement is typically 1–2 weeks and pays back in the first month. Book a scoping call.