The default position
For 90% of new AI features, the right starting point is a well-prompted frontier model with prompt caching. No fine-tuning. The reasons are honest:
- Better quality out of the box than a fine-tune of a smaller model.
- No data collection, labelling, or training pipeline.
- You can iterate on the prompt in minutes; you can iterate on a fine-tune in days.
- You can swap models when a better one ships; fine-tunes are tied to a base.
The default to "fine-tune our own model" pattern from 2023 is mostly dead. Frontier model + good prompt + caching + an eval suite eats most of the use cases that used to call for it.
The four cases where fine-tuning wins
1. High-volume classification at the scale where Haiku-rate adds up
You're running 10M+ classifications a month. Haiku at $1/Mtok input is the closest base case. A fine-tuned 7B-class model on Modal or RunPod can hit Haiku-class quality at meaningfully lower per-call cost — typically 30-60% cheaper at this scale. Below ~10M calls/month the savings don't justify the operational overhead.
2. Highly-specific domain language the frontier models drift on
Medical-coding, legal-contract classification, financial filings — domains where the vocabulary is dense enough that even Sonnet hallucinates terminology under pressure. A fine-tuned model on a labeled corpus locks in the right vocabulary and the right schema. Caveat: the labeled corpus is usually the harder problem; if you don't have one, fine-tuning can't fix that.
3. Latency-critical small-model deployments
The use cases where you need sub-100ms first-token, on-device or in a tight loop. Frontier APIs can't hit those latencies. A fine-tuned small model (7B-13B class) on local infra can. This is rare in our work — voice agents, real-time classification at the edge — but when it's the requirement, fine-tuning is the only path.
4. Output-format consistency at production scale
Workloads where the output format has to be exactly right, every call, with no parser fallbacks. Fine-tuning on a uniform-format corpus is more reliable than prompting for the same format on a frontier model — the format becomes a learned distribution rather than an instruction.
Where fine-tuning loses
Three places we keep redirecting clients away from fine-tuning:
- "Make it sound like our brand voice." System-prompt + 3-5 example outputs gets you 90% of the way there. Fine-tuning is overkill and locks you into the base model.
- "Train it on our knowledge base." This is RAG (or, in 2026, long-context with caching). Fine-tuning on documents to make the model "know" them is a 2023 pattern that doesn't outperform the alternatives anymore.
- "Improve quality on our specific task." 70% of the time, "improve quality" is a prompt problem or an eval problem. Fix those first. The remaining 30% is the cases above.
The breakeven math
When fine-tuning does win, the volume threshold matters. Rough shape we use:
- Build cost: $15K-$60K for a labeled-data fine-tune of a 7-13B model, including data prep, training runs, eval, and deployment. See our GPU shootout for compute costs.
- Operational cost: ~$200-$2K/mo for a hosted fine-tuned model on Modal or RunPod, depending on volume and SLA.
- Per-call savings vs Haiku: ~$0.0003-$0.0008 per call for a typical classification workload.
Math: build cost / per-call savings = breakeven volume. Typically 5M-50M calls before the fine-tune pays back. Plus operational overhead. If you're under that, prompt the frontier model.
The scoping conversation we run
When a buyer asks for a fine-tune, we ask three questions:
- What does prompting at the frontier model already give you? Run the workload on Sonnet with caching and a good prompt. Measure quality. The gap to "good enough" is the budget for fine-tuning.
- Do you have at least 1,000 labeled examples of correct output? Usually no. The first half of the engagement becomes data preparation. That's not "fine-tuning" — that's building a training set, which is a different project.
- Is the call volume high enough that the savings justify the build? Run the breakeven math. Most of the time, the answer is "no."
If all three answers favour fine-tuning, we do it. If any of them don't, we recommend prompting and an eval suite instead. Across the engagements where fine-tuning is the right call, this scoping reliably points the right direction inside an hour.
The summary
Default to prompting. Fine-tune only on volume, domain-specificity, latency, or strict format constraints. The 2023 reflex of "ship our own model" is mostly retired. The cases where fine-tuning wins are real, narrow, and worth doing well — but they're not where most buyers think they are.