The problem
Client runs a UGC platform with ~12M user-generated items per month flowing through automated moderation (text, image captions, comments, profile bios). Their existing pipeline ran every item through Claude Sonnet with a long policy system prompt, then escalated low-confidence cases to a human team of 9 trust & safety reviewers.
Two things were breaking:
- Cost. Sonnet-on-everything was running ~$40K/month and scaling linearly with item volume.
- Queue depth. The human team's escalation queue was averaging 11K items deep — escalations sat 36+ hours before review, which on a UGC platform is a very long time for a "maybe-policy-violating" item to be live.
The brief: keep the same policy outcomes, fix the cost line, and get the queue down to "a human can actually clear it the same day."
The diagnosis
We sampled 5,000 historical decisions and ran them through both Claude Haiku 4.5 and Sonnet 4.6 with the same policy prompt. Two findings drove the design:
- Haiku agreed with Sonnet on ~83% of items. The disagreements clustered in a specific sliver — items with sarcasm, layered context, or cultural references. The bulk of the item stream (clear-cut posts) was being decided identically by both models.
- Most "human escalations" weren't ambiguous. About 60% of items in the queue were Sonnet-flagged-as-uncertain but were trivially obvious one way or the other on review. The model was over-escalating.
So the problem wasn't "we need a smarter model." It was "we're paying Sonnet to do work Haiku could do, and Sonnet is escalating when it should be deciding."
What we shipped
A three-tier cascade with explicit confidence thresholds at each gate. All four components run on Modal (serverless GPU/CPU) and write decisions + audit trail to Postgres.
- Tier 1 — Haiku pre-filter. Every item hits Haiku first with a tightened prompt that asks for a JSON verdict (allow / block / escalate) plus a self-reported confidence score. Items with confidence ≥ 0.92 get the Haiku decision and stop. About 78% of items end here. Cost: $0.0012 per item.
- Tier 2 — Sonnet adjudicator. Anything Haiku flagged escalate or returned with confidence < 0.92 goes to Sonnet with the original item plus Haiku's reasoning. Sonnet returns a final decision plus its own confidence. Items with Sonnet confidence ≥ 0.85 stop here. About 19% of items end at Tier 2. Cost: $0.011 per item.
- Tier 3 — Human queue. Only items where Sonnet's confidence is < 0.85 escalate to the human team. About 3% of items hit this tier. Average queue depth dropped from 11K → 3.2K, and the team clears it inside the same business day.
- Audit trail. Every decision (which tier ruled, which model, what the confidence was, what the reasoning chain looked like) lands in Postgres. Trust & safety can run "show me every Haiku-allow decision in category X for the last week" queries to spot drift.
What we evaluated continuously
- Daily replay against a frozen 1,000-item gold-set labelled by the client's senior moderators. Catches prompt drift before it reaches users.
- Tier-1 → Tier-2 disagreement rate (does Haiku agree with Sonnet when both judge the same item?) — early-warning indicator that one tier's prompt or model has shifted.
- Cost per 10K items as a single dashboard number. Anomaly alert at 1.4× rolling median.
- Human reviewer agreement with Tier 2 — sampled 5% of Tier-2-decided items and routed them to humans anyway as a calibration check. Disagreement on < 4% of cases over 90 days.
The cost math
Per-item economics, blended across the volume distribution:
- Before: 12M items × $0.0033 (Sonnet on everything) = ~$40K/month.
- After:
- 78% × $0.0012 (Haiku) = $11.2K monthly subtotal across 9.36M items
- 19% × $0.011 (Haiku + Sonnet) = $25.1K — wait, that doesn't help.
Right — the trick that made this work isn't just tier routing, it's that Tier 2 doesn't reprocess the full input. Sonnet receives the original item plus Haiku's already-generated structured summary. Effective input tokens to Sonnet are ~30% of the original item, plus Haiku's ~150-token verdict. So:
- 78% × $0.0012 = $11.2K
- 22% × ($0.0012 Haiku + $0.0034 short-context Sonnet) = ~$12.1K
- Total: ~$23.3K monthly with the original architecture — but we also added prompt caching on the policy doc (~6K tokens, identical on every call) which dropped that input portion 90%, bringing the actual landed bill to ~$10.1K/month.
Use the LLM cost calculator to model the math at your own item volume — it's the same architecture this case used.
What we'd do differently
- Tighten the gold-set sooner. We started with 200 labelled items in week 1; should have been 1,000 from day one. The smaller set let one regression slip through that the bigger set caught a week later (no production impact, but uncomfortable).
- Make tier-routing observable from day one. The dashboard that shows "% items decided at Tier 1 / Tier 2 / Tier 3" was an afterthought. It should have been built before the cascade went live — it's the single most useful number for spotting drift.
- Prompt-cache the policy doc from architecture day. We added it as a tuning step in week 4; should have been load-bearing from the architecture diagram. Same lesson as our earlier LLM bill case study — caching is architectural, not optimisation.
Stack
- Models: Claude Haiku 4.5 (Tier 1, pre-filter) + Claude Sonnet 4.6 (Tier 2, adjudicator)
- Compute: Modal for serverless inference orchestration. Cold starts at the volume we run aren't a concern.
- Storage: Postgres on Neon — decisions, audit trail, gold-set, eval runs.
- Eval harness: ~150 lines of Python that runs the gold-set nightly and posts deltas to a Slack channel.
- Observability: Datadog for cost + latency + tier-distribution dashboards.
Related
- Built a grounded outbound-personalisation agent — same tier-routing principle applied to a different problem.
- Rebuilt a SaaS support stack from scratch in 3 weeks — earlier engagement using the same Modal + Postgres pattern.
- How we cut a client's LLM bill 71% with prompt caching — companion blog post on the caching lesson learned here.