The problem
Client is a B2B SaaS company with a 3-person SDR team running outbound across 30+ rotating mailboxes (~200K total sends per month). Their existing motion was a generic sequencer with mail-merge variables — sender domains were burning out every 4–6 weeks, reply rate was stuck at 2.1%, and the team was spending most of their day producing "personalised" first lines that all sounded the same.
They'd evaluated three off-the-shelf "AI SDR" tools. Each produced personalisation that looked plausible but referenced things that didn't exist (made-up product features, invented job changes, the wrong company entirely). Reply rates went up 2× on small samples, then collapsed once a hallucination embarrassed a senior buyer in the prospect's company. They wanted a system they could actually ship at full volume without a 3am incident.
The diagnosis
We sampled 800 historical prospects and audited what counted as a "good" personalised first line in their best-performing campaigns. Two patterns:
- Replies came from grounded specificity, not creativity. The sequences with real reply rates referenced public-record facts (a recent funding announcement, a specific job posting, a press release). Generic "I noticed you do X" got ignored.
- Hallucinations all came from the same source. When the model was given a LinkedIn profile and asked to "find an interesting hook," it confabulated. When the model was given five concrete public facts and asked to pick the most relevant one, it didn't.
The job wasn't "make the model creative." It was "give the model grounded inputs and constrain its output."
What we shipped
Three layers, glued to their existing Clay → Smartlead pipeline:
- Research layer. For each prospect, a small Python orchestrator pulls 5 grounded sources: company press releases (last 6 months), recent funding (Crunchbase API), the prospect's recent LinkedIn posts (when accessible via Clay's enrichment), open job postings, and the company's blog feed. Outputs a JSON "evidence pack" with up to 8 timestamped facts and source URLs. ~$0.01 in third-party data costs per prospect, no LLM tokens spent.
- Selection layer. The evidence pack goes to Claude Sonnet with a strict schema: pick the single most relevant fact for outbound and explain why in < 30 words. The model is forbidden from inventing or extrapolating — only the literal facts in the evidence pack are allowed. ~250 input tokens, ~80 output tokens. Costs ~$0.0008 per prospect.
- Composition layer. The selected fact + the SDR's chosen sequence template go to Sonnet a second time to write the first line and any follow-up references. Same constraint: only the chosen fact is in scope. ~600 input tokens (cached system prompt brings it down dramatically — see the math below), ~150 output tokens. Costs ~$0.0006 per draft after caching.
- Human gate. The draft lands in a small custom UI for the SDR to scan in ≤ 30 seconds before it's released to Smartlead. They're checking for cringe, not editing. They reject ~3% of drafts; the rest go out unmodified.
What "working" looks like in the dashboards
- Reply rate blended across all sequences: 14% (up from 2.1% baseline).
- Meeting-booked rate on replies: 28%, roughly the same as the baseline — meaning more of the replies are real intent, not "stop emailing me."
- Sender-domain reputation stable at "good" across all 30 mailboxes; previously two domains hit "poor" every month and had to be rotated out.
- Drafts auto-approved by SDR: 97%. The 3% rejected fall into clear categories (the chosen fact wasn't actually relevant; the tone landed wrong for the persona).
The cost math
200K sends/month × ~$0.0014 per draft (selection + composition with cached system prompt) ≈ $280/month in raw token costs. Prompt caching brings the input portion down to about 30% of the uncached number, landing the actual monthly bill at ~$80.
The third-party data layer (Clay credits + Crunchbase API) is the bigger cost line — about $1,200/month at this volume — but the client was already paying for Clay regardless. Use the LLM cost calculator with cache rate set to 70% to reproduce the math at any volume.
What we evaluated continuously
- Hallucination check. A second Claude pass (different prompt, no system context) looks at every draft + the original evidence pack and answers a single yes/no: "does any claim in this draft go beyond the literal facts in the evidence?" Flagged drafts route to immediate human review. 0 confirmed hallucinations escaped to send in 60 days.
- Sender-reputation monitoring across all mailboxes — daily Smartlead pull into Postgres, anomaly alert on any mailbox dropping below "good."
- Per-sequence reply-rate dashboard with a 7-day rolling baseline. Drops trigger a manual prompt review.
- Random sampling — 1% of drafts are routed to a senior SDR for blind quality review, scored 1–5. Rolling average has held above 4.2 for the entire engagement.
What we'd do differently
- Build the evidence-pack inspector first. When the SDRs occasionally rejected a draft, we couldn't easily see which evidence the system had picked from. We built that view in week 4 — should have been week 1. Made the prompt-tuning cycle 5× faster once it existed.
- Cap the evidence-pack age tighter. "Last 6 months" let in some stale facts that were technically true but felt outdated. We tightened to 90 days mid-engagement and reply rate ticked up another 1.5 points.
- Ship the SDR review UI as a standalone tool. We embedded it inside Smartlead via an iframe initially. The SDRs hated the round-trip. A dedicated lightweight UI took an extra 3 days but cut review time per draft from ~50s to ~30s.
Stack
- Models: Claude Sonnet 4.6 for selection + composition + hallucination check. We tried Haiku for the selection step but the quality drop on relevance ranking wasn't worth the savings at this volume.
- Enrichment: Clay for the orchestration + people data. Crunchbase API for funding events.
- Sender infra: Smartlead for inbox rotation + warmup + send.
- Storage: Postgres on Neon — drafts, evidence packs, reply tracking, eval scores.
- Compute: A small Python service on Modal handles the orchestration; serverless cold starts are fine for this workload.
- Review UI: ~400 lines of vanilla JS + Postgres. No framework.
Related
- Anatomy of a working AI SDR (and why most fail) — the blueprint this case study is the receipts for.
- Cut a content-moderation team's spend 4× with a Haiku → Sonnet → human cascade — same right-size-the-model principle, different problem.
- Lemlist vs Instantly — picking outbound infra to layer this on top of.