Why these specific five
Each demo below has the same underlying defect: the demo conditions are quietly the inverse of production conditions. One curated input, three rehearsed prompts, the founder driving. Production reverses every one of those.
1. "Chat with your PDF"
The demo: a contract, a marketing brief, an
annual report. The presenter asks "what are the key risks?" The
model answers beautifully.
What ships: a tool that handles 1-3 documents
well and degrades hard at 50. The retrieval starts missing
relevant passages, the model starts confidently citing the wrong
section, and customers learn not to trust the answers.
The fix: design for the corpus you'll actually
have. See our
long-context patterns for the specific shapes that work in
production. Don't ship "chat with PDFs"; ship "answer this
specific question about this specific document family with
citations to the actual paragraph."
2. "AI lead scoring"
The demo: a model evaluates a list of inbound
leads, scores them 1–10, the top three are obviously hot, the
bottom three are obviously not.
What ships: a model that emits scores that
correlate weakly with conversion because there's no labelled
training data and the input features are noisy. Sales reps
learn to ignore the scores within two months.
The fix: ship a rules-based scoring engine
first. Run it for a year. Capture labelled outcomes. Then revisit
AI on top of that data. We covered this exact engagement pattern
in When NOT to build
with AI, engagement #2.
3. "AI dashboard insights"
The demo: a dashboard with a panel that says
"this week's anomalies," and the model has correctly identified
that revenue dropped on Tuesday.
What ships: a panel that surfaces 12 "anomalies"
a week, of which 1–2 are real, the rest are noise (weekend
effect, seasonal pattern, normal variance the model treats as
surprising). Users learn to ignore the panel by week three.
The fix: "anomaly" is a statistics problem
pretending to be an AI problem. Use proper statistical methods
for the detection layer. Use the model only to explain
the anomalies the stats already flagged. The split is
load-bearing — getting it wrong means the model is doing both
jobs, and it's bad at one of them.
4. "AI voice agent on cold leads"
The demo: the founder calls in, the AI agent
sounds natural, handles questions, books a meeting. The
audience is amazed.
What ships: the agent calls 1,000 cold prospects
Tuesday morning. ~85% don't answer. ~10% hang up immediately when
they realise it's AI. ~3% engage briefly and then disengage when
asked something off-script. ~2% book. Compare to the cost of a
human SDR doing the same work, and the math is brutal.
The fix: voice agents work in narrow inbound
paths — appointment confirmations, status checks, hours-of-
operation questions. They don't work for cold outbound. The
gap between demo conditions (warm founder calling, scripted
questions) and reality (cold prospect, real questions) is
enormous.
5. "AI SEO content engine"
The demo: the model writes a blog post that
reads well and ranks for a specific keyword on the founder's
laptop in incognito mode.
What ships: 500 published articles that all
sound the same, get filtered as low-quality content by Google's
ranking systems, and slowly tank the domain's traffic. By month
six the team is paying a human writer to manually rewrite the
best ones.
The fix: AI-assisted writing where a human is
the author works. AI-as-author with human as editor doesn't —
not for the volume play, not on the search-traffic curve. If
content is worth ranking, it's worth a human's name on it. If
it isn't, don't write it.
The pattern that connects all five
Each of these demos features one curated example, one user, one rehearsed flow. Production reverses every variable: many examples (some adversarial), many users (with varying patience), and unrehearsed paths the demo never considered.
The questions to ask before greenlighting any of these are the same. What does this look like at 100x the demo's volume? What does it look like with users who are tired, wrong, or hostile? What does it look like when the input is in the long tail rather than the curated set? Almost any AI demo that survives those questions is worth shipping. Almost none of these five do.
The good demos
For balance: AI demos that survive contact with production look boring. The triage classifier that runs on every ticket. The drafting assistant that suggests a reply the human sends. The eval suite that catches a regression in CI. The cohort-level cost analysis. None of those wow a room. All of them ship. That's the trade.