BUILD

Cut a content-moderation team's spend 4× with a Haiku → Sonnet → human cascade

Mid-market UGC platform handling ~12M user-generated items per month. Moved from a single Sonnet-on-everything pipeline to a tiered Haiku → Sonnet → human cascade. Same policy coverage, dramatically lower cost, and a queue the trust & safety team can actually keep up with.

SHIPPED · 2026-01-22 SCOPE · 4 WEEKS STACK · CLAUDE HAIKU + SONNET 4.6 · POSTGRES · MODAL

MONTHLY API SPEND

$40K → $10K

FALSE-POSITIVE RATE

↓ 38%

HUMAN-REVIEW QUEUE

↓ 71%

INCIDENTS IN 90 DAYS

0

The problem

Client runs a UGC platform with ~12M user-generated items per month flowing through automated moderation (text, image captions, comments, profile bios). Their existing pipeline ran every item through Claude Sonnet with a long policy system prompt, then escalated low-confidence cases to a human team of 9 trust & safety reviewers.

Two things were breaking:

  1. Cost. Sonnet-on-everything was running ~$40K/month and scaling linearly with item volume.
  2. Queue depth. The human team's escalation queue was averaging 11K items deep — escalations sat 36+ hours before review, which on a UGC platform is a very long time for a "maybe-policy-violating" item to be live.

The brief: keep the same policy outcomes, fix the cost line, and get the queue down to "a human can actually clear it the same day."

The diagnosis

We sampled 5,000 historical decisions and ran them through both Claude Haiku 4.5 and Sonnet 4.6 with the same policy prompt. Two findings drove the design:

So the problem wasn't "we need a smarter model." It was "we're paying Sonnet to do work Haiku could do, and Sonnet is escalating when it should be deciding."

What we shipped

A three-tier cascade with explicit confidence thresholds at each gate. All four components run on Modal (serverless GPU/CPU) and write decisions + audit trail to Postgres.

What we evaluated continuously

The cost math

Per-item economics, blended across the volume distribution:

Right — the trick that made this work isn't just tier routing, it's that Tier 2 doesn't reprocess the full input. Sonnet receives the original item plus Haiku's already-generated structured summary. Effective input tokens to Sonnet are ~30% of the original item, plus Haiku's ~150-token verdict. So:

Use the LLM cost calculator to model the math at your own item volume — it's the same architecture this case used.

What we'd do differently

  1. Tighten the gold-set sooner. We started with 200 labelled items in week 1; should have been 1,000 from day one. The smaller set let one regression slip through that the bigger set caught a week later (no production impact, but uncomfortable).
  2. Make tier-routing observable from day one. The dashboard that shows "% items decided at Tier 1 / Tier 2 / Tier 3" was an afterthought. It should have been built before the cascade went live — it's the single most useful number for spotting drift.
  3. Prompt-cache the policy doc from architecture day. We added it as a tuning step in week 4; should have been load-bearing from the architecture diagram. Same lesson as our earlier LLM bill case study — caching is architectural, not optimisation.

Stack

Related

Spending too much on Sonnet-on-everything? Tier-routing is the unlock.

SCOPE A BUILD → RUN THE NUMBERS →