The five AI demos that always look great and ship terribly

Why these specific five

Each demo below has the same underlying defect: the demo conditions are quietly the inverse of production conditions. One curated input, three rehearsed prompts, the founder driving. Production reverses every one of those.

1. "Chat with your PDF"

The demo: a contract, a marketing brief, an annual report. The presenter asks "what are the key risks?" The model answers beautifully.
What ships: a tool that handles 1-3 documents well and degrades hard at 50. The retrieval starts missing relevant passages, the model starts confidently citing the wrong section, and customers learn not to trust the answers.
The fix: design for the corpus you'll actually have. See our long-context patterns for the specific shapes that work in production. Don't ship "chat with PDFs"; ship "answer this specific question about this specific document family with citations to the actual paragraph."

2. "AI lead scoring"

The demo: a model evaluates a list of inbound leads, scores them 1–10, the top three are obviously hot, the bottom three are obviously not.
What ships: a model that emits scores that correlate weakly with conversion because there's no labelled training data and the input features are noisy. Sales reps learn to ignore the scores within two months.
The fix: ship a rules-based scoring engine first. Run it for a year. Capture labelled outcomes. Then revisit AI on top of that data. We covered this exact engagement pattern in When NOT to build with AI, engagement #2.

3. "AI dashboard insights"

The demo: a dashboard with a panel that says "this week's anomalies," and the model has correctly identified that revenue dropped on Tuesday.
What ships: a panel that surfaces 12 "anomalies" a week, of which 1–2 are real, the rest are noise (weekend effect, seasonal pattern, normal variance the model treats as surprising). Users learn to ignore the panel by week three.
The fix: "anomaly" is a statistics problem pretending to be an AI problem. Use proper statistical methods for the detection layer. Use the model only to explain the anomalies the stats already flagged. The split is load-bearing — getting it wrong means the model is doing both jobs, and it's bad at one of them.

4. "AI voice agent on cold leads"

The demo: the founder calls in, the AI agent sounds natural, handles questions, books a meeting. The audience is amazed.
What ships: the agent calls 1,000 cold prospects Tuesday morning. ~85% don't answer. ~10% hang up immediately when they realise it's AI. ~3% engage briefly and then disengage when asked something off-script. ~2% book. Compare to the cost of a human SDR doing the same work, and the math is brutal.
The fix: voice agents work in narrow inbound paths — appointment confirmations, status checks, hours-of- operation questions. They don't work for cold outbound. The gap between demo conditions (warm founder calling, scripted questions) and reality (cold prospect, real questions) is enormous.

5. "AI SEO content engine"

The demo: the model writes a blog post that reads well and ranks for a specific keyword on the founder's laptop in incognito mode.
What ships: 500 published articles that all sound the same, get filtered as low-quality content by Google's ranking systems, and slowly tank the domain's traffic. By month six the team is paying a human writer to manually rewrite the best ones.
The fix: AI-assisted writing where a human is the author works. AI-as-author with human as editor doesn't — not for the volume play, not on the search-traffic curve. If content is worth ranking, it's worth a human's name on it. If it isn't, don't write it.

The pattern that connects all five

Each of these demos features one curated example, one user, one rehearsed flow. Production reverses every variable: many examples (some adversarial), many users (with varying patience), and unrehearsed paths the demo never considered.

The questions to ask before greenlighting any of these are the same. What does this look like at 100x the demo's volume? What does it look like with users who are tired, wrong, or hostile? What does it look like when the input is in the long tail rather than the curated set? Almost any AI demo that survives those questions is worth shipping. Almost none of these five do.

The good demos

For balance: AI demos that survive contact with production look boring. The triage classifier that runs on every ticket. The drafting assistant that suggests a reply the human sends. The eval suite that catches a regression in CI. The cohort-level cost analysis. None of those wow a room. All of them ship. That's the trade.

Five AI demos that always look great and ship terribly