Fine-tuning vs prompting in 2026: when each one wins

The default position

For 90% of new AI features, the right starting point is a well-prompted frontier model with prompt caching. No fine-tuning. The reasons are honest:

Better quality out of the box than a fine-tune of a smaller model.
No data collection, labelling, or training pipeline.
You can iterate on the prompt in minutes; you can iterate on a fine-tune in days.
You can swap models when a better one ships; fine-tunes are tied to a base.

The default to "fine-tune our own model" pattern from 2023 is mostly dead. Frontier model + good prompt + caching + an eval suite eats most of the use cases that used to call for it.

The four cases where fine-tuning wins

1. High-volume classification at the scale where Haiku-rate adds up

You're running 10M+ classifications a month. Haiku at $1/Mtok input is the closest base case. A fine-tuned 7B-class model on Modal or RunPod can hit Haiku-class quality at meaningfully lower per-call cost — typically 30-60% cheaper at this scale. Below ~10M calls/month the savings don't justify the operational overhead.

2. Highly-specific domain language the frontier models drift on

Medical-coding, legal-contract classification, financial filings — domains where the vocabulary is dense enough that even Sonnet hallucinates terminology under pressure. A fine-tuned model on a labeled corpus locks in the right vocabulary and the right schema. Caveat: the labeled corpus is usually the harder problem; if you don't have one, fine-tuning can't fix that.

3. Latency-critical small-model deployments

The use cases where you need sub-100ms first-token, on-device or in a tight loop. Frontier APIs can't hit those latencies. A fine-tuned small model (7B-13B class) on local infra can. This is rare in our work — voice agents, real-time classification at the edge — but when it's the requirement, fine-tuning is the only path.

4. Output-format consistency at production scale

Workloads where the output format has to be exactly right, every call, with no parser fallbacks. Fine-tuning on a uniform-format corpus is more reliable than prompting for the same format on a frontier model — the format becomes a learned distribution rather than an instruction.

Where fine-tuning loses

Three places we keep redirecting clients away from fine-tuning:

"Make it sound like our brand voice." System-prompt + 3-5 example outputs gets you 90% of the way there. Fine-tuning is overkill and locks you into the base model.
"Train it on our knowledge base." This is RAG (or, in 2026, long-context with caching). Fine-tuning on documents to make the model "know" them is a 2023 pattern that doesn't outperform the alternatives anymore.
"Improve quality on our specific task." 70% of the time, "improve quality" is a prompt problem or an eval problem. Fix those first. The remaining 30% is the cases above.

The breakeven math

When fine-tuning does win, the volume threshold matters. Rough shape we use:

Build cost: $15K-$60K for a labeled-data fine-tune of a 7-13B model, including data prep, training runs, eval, and deployment. See our GPU shootout for compute costs.
Operational cost: ~$200-$2K/mo for a hosted fine-tuned model on Modal or RunPod, depending on volume and SLA.
Per-call savings vs Haiku: ~$0.0003-$0.0008 per call for a typical classification workload.

Math: build cost / per-call savings = breakeven volume. Typically 5M-50M calls before the fine-tune pays back. Plus operational overhead. If you're under that, prompt the frontier model.

The scoping conversation we run

When a buyer asks for a fine-tune, we ask three questions:

What does prompting at the frontier model already give you? Run the workload on Sonnet with caching and a good prompt. Measure quality. The gap to "good enough" is the budget for fine-tuning.
Do you have at least 1,000 labeled examples of correct output? Usually no. The first half of the engagement becomes data preparation. That's not "fine-tuning" — that's building a training set, which is a different project.
Is the call volume high enough that the savings justify the build? Run the breakeven math. Most of the time, the answer is "no."

If all three answers favour fine-tuning, we do it. If any of them don't, we recommend prompting and an eval suite instead. Across the engagements where fine-tuning is the right call, this scoping reliably points the right direction inside an hour.

The summary

Default to prompting. Fine-tune only on volume, domain-specificity, latency, or strict format constraints. The 2023 reflex of "ship our own model" is mostly retired. The cases where fine-tuning wins are real, narrow, and worth doing well — but they're not where most buyers think they are.