The brief we usually get
"We want an AI agent that handles all our customer service." Some version of that opens 90% of CS engagement conversations. The buyer has read about deflection rates from a vendor and wants the same number on their dashboard.
The brief is wrong in three ways. The first is that "all" is a goal that wastes 80% of the build budget on the long tail. The second is that "agent" presumes a chatbot UI when triage and routing win bigger than chat. The third is that "customer service" is three different jobs (deflection, drafting, routing) that an agent will badly fuse if you let it.
What we deploy: three components
The shape we install on essentially every CS engagement, in this order:
1. Triage classifier (deploys in week 1)
The first thing we ship is not an agent. It's a Haiku-powered classifier that runs over every incoming ticket and assigns: category, urgency, customer-tier, and a "this is answerable from our help docs" flag. Output goes into the existing CS tool's custom fields.
That alone gets the human team a ~25–35% speedup. They can sort, filter, and prioritise the queue better. No customer ever sees the AI. No risk of the AI saying something wrong. Easy win, fast ROI, foundation for the next two components.
2. Drafting assistant (deploys in week 2-3)
For the tickets a human is going to answer, we drop a "suggested reply" into their workflow. The agent has read the ticket, found the relevant docs and recent similar tickets, and drafted a reply in the team's voice. The human reviews, edits, sends.
Lift here is real and consistent: ~30–50% reduction in average handle time per ticket. Crucially, the human is still in the loop, and they're using the AI as a writing partner, not a decision-maker. Quality stays high. Errors are caught before they reach the customer.
3. Auto-deflection (deploys in week 4-6)
Only after the first two are running do we add the auto-reply layer — and only for the narrow set of tickets where (a) the classifier is highly confident, (b) the docs cleanly answer the question, and (c) the resolution doesn't touch a refund, a payment, or a sensitive-account action.
This is the deflection number the buyer was originally asking about. With the right scope it lands at 40–65% of total ticket volume. The remaining 35–60% — including everything spicy or ambiguous — flows to humans, augmented by component 2.
The 20% nobody asks for
The single most valuable thing we deliver, that no one ever puts in the brief: a feedback loop from the human edits back into the AI's draft prompt. Every time a human edits the AI's suggested reply, we capture (a) the original draft, (b) the sent reply, and (c) the diff. Twice a week we feed that to a Sonnet-powered analyser that proposes prompt improvements.
This is what keeps the system getting better. Without it, the drafting assistant locks in at "decent" and slowly degrades as the product changes. With it, quality compounds. We've measured handle-time reductions accelerating in months 3–6 specifically because of this loop.
What we explicitly don't ship
- An "AI agent" that resolves tickets autonomously across the full surface area. Even with 2026 models, the long tail of CS contains too many bear traps. The 60-80% of volume that's safe to fully automate, automate. The rest, augment.
- A chat widget on the marketing site. Buyers love these. They underperform a well-designed FAQ + search bar by every metric we can measure, and they leak embarrassing screenshots more often than humans realise.
- Voice-bot phone support without a hard scope. Voice bots work for narrow paths (status checks, appointment booking). They badly hurt CSAT on anything else.
- Sentiment-driven escalation. Sounds good in a deck, fails in deployment because the model thinks every ticket with a frustrated word is "urgent." We use rule-based escalation tied to ticket-category and customer-tier instead.
- "Knowledge base auto-generation." The KB is the authoritative source. AI-generated KBs are a way to launder the model's hallucinations into a place humans then trust. Don't.
The numbers we see
Across the last six CS engagements we've shipped, blended outcomes after 90 days of running all three components:
- Tickets fully deflected (no human touch): 38–62% (median: 52%)
- Tickets human-answered with AI draft: 92% of the rest
- Average handle-time reduction on AI-drafted tickets: 41%
- CSAT change: -1 to +3 points (median: +1, statistically a wash)
- Cost per ticket decrease: 35–60%
The CSAT result is the one buyers always pre-worry about. Our take: with the architecture above, CSAT is roughly flat. With an "AI agent handles everything" architecture, CSAT drops by 5-10 points and stays there. The architecture is the difference. (For the anti-pattern in detail, see our When NOT to build with AI piece — engagement #1 is exactly this trap.)
What this stack costs
Build time: 4–6 weeks for a mid-market team (50–500K tickets/yr). Ongoing AI cost: ~$0.04–0.12 per ticket processed across all three components, depending on ticket length. For a team handling 30K tickets/mo, that's $1,200–$3,600/mo in model spend, against whatever fraction of FTE time you're freeing up.
The math has not been close on a single engagement we've shipped. Payback is months, not quarters. The reason for the post is: the shape we've described is not what most teams pitch themselves first. Picking the right scope is worth more than picking the right vendor.