The problem we're solving
Most AI features ship without a regression test. The team's "evaluation" is the engineer who built the feature trying it three times in dev and pronouncing it good. Six weeks later a model version changes, a prompt gets edited, the output gets weird, and nobody notices until a customer complains.
Most attempts to fix this overshoot. Teams stand up an eval framework, target 200 prompts, define rubrics, hire a vendor. Six months later they have 30 prompts written, the framework is half- configured, and nobody runs it. The eval has joined the graveyard of internal tools that were too ambitious to finish.
The minimum viable eval splits the difference: enough to catch the regressions that hurt, simple enough that it survives the engineer who built it leaving.
What we install
The shape we use on essentially every engagement:
- A directory called
evals/at the repo root, with one JSON file per case. - Twelve cases. Not 200. Twelve. Three "happy path," three "edge case we discovered the hard way," three "things customers asked that surprised us," three "adversarial / would be embarrassing if we got wrong."
- Each case has: an input, a "must contain" list of substrings, a "must not contain" list, and an optional "judge prompt" we use when string-matching is too brittle.
- A single script —
evals/run.pyor equivalent — that runs the cases against the live system, scores them, and exits 0 on pass, non-zero on fail. - Wired to CI. Runs on every PR that touches the prompt or the model wiring. Costs maybe $0.40 per run. Stops bad commits at the door.
That's it. No framework, no UI, no rubric committee. The whole thing is usually 200–400 lines of code and ships in a day.
The "judge prompt" trick
Substring matching covers about 70% of cases. The remaining 30% need a model to grade. We use Haiku 4.5 as the judge and a tight prompt:
"You are evaluating a candidate response to the question below. The criterion is: [criterion]. Reply with PASS or FAIL on the first line, then a one-sentence explanation. Be strict. If the response partially matches, FAIL."
Two notes that matter. First — Haiku is the judge, not the system under test. Cheap and reliably grades. Second — the criterion has to be specific. "Is the answer good" is not a criterion. "Does the answer cite the source document by name and quote at least one passage" is.
What we deliberately don't do
Listing the things we've tried and rejected, since this is the part teams burn time on:
- No exhaustive coverage. Twelve prompts is the floor that catches regressions. Going to fifty doesn't catch more bugs proportionally — it just makes the eval slower and more painful to update.
- No similarity-score grading. "Output should be 0.85 cosine-similar to golden" gets you false confidence. Either substring-match (cheap, reliable) or model-judge (more expensive, more flexible). Skip the embedding similarity middle ground.
- No human-in-the-loop platforms. The whole point is that the eval runs on every PR. If a human has to label, it doesn't run on every PR.
- No "comprehensive rubrics." Three to five criteria per case, each binary. If you can't decide pass/fail, the criterion is wrong.
How to pick the twelve cases
The single most-asked question. The answer is "the cases that, if they regressed silently, you'd be most embarrassed by." For most systems that maps to:
- The flagship demo case the founder shows investors.
- The edge case that already broke once and got patched.
- The customer-reported bug from the last 90 days.
- The "what would a competitor screenshot" case — the worst answer the system could give that a screenshot could embarrass you.
- A "boring path" case that should always be trivially correct.
We start with five and accumulate the rest as the engagement runs. Every time we hit a bug in production, that bug becomes case #N+1 before the fix ships. After 8–12 cases the eval starts catching regressions before they reach prod.
The lifecycle that keeps it alive
Most eval setups die because nobody owns them. We solve that
structurally: every PR that touches a prompt or model
configuration must update evals/ if the change affects
output behaviour. The CI check enforces it. If the eval
passed before but the prompt changed, either (a) the eval covers
the new behaviour explicitly, or (b) you owe a new case. Adding the
case is part of the PR.
This sounds bureaucratic. It is the difference between an eval suite that's still alive a year later and one that's frozen at the five cases the original engineer wrote.
What this catches and what it misses
Catches: model version changes that break tone or structure, prompt edits that drop a critical instruction, tool-use calls that stop firing on a class of input, output format regressions, basic safety failures.
Misses: drift in production traffic patterns that your 12 cases don't cover, latency regressions, cost regressions (separate dashboards for those), and the hardest one — slow degradation in subjective quality that your judge prompt can't detect.
That last one is real. The MVE doesn't substitute for occasional human review of production samples. We pair it with a quarterly "20-sample eyeball" — pull 20 random prod outputs, three engineers read them, write a one-paragraph note. That note sometimes spawns new eval cases.
The takeaway
Eval debt is the #2 reason AI engagements stall in our build checklist diagnostics (the #1 is data debt). The fix is not "buy an eval product." It's "write twelve prompts in a JSON file and a 200-line scorer, and gate your PRs on it."
We install this on day three of every engagement. It has saved every project we've shipped at least once. It has cost no client more than a day to set up. It is the highest leverage thing you can do in the first week of any AI build.