From POC to production: 7 things that always break the week after demo day

1. Rate limits

Symptom: works fine on three concurrent users in the demo, breaks immediately when 80 internal beta testers hit it. The model provider's tier limits become the production bottleneck before anyone planned for them.

Fix: build a token-bucket queue at the gateway layer on day one, not day fifty. Get your tier upgraded with the provider before launch — the upgrade can take 2–4 weeks of lead time at higher tiers. Don't assume the demo's QPS budget scales with the demo.

2. Retries that compound

Symptom: the demo's "if it fails, retry once" logic cascades into a thundering herd when the upstream model is having a bad five-minute window. Your bill spikes and the model stays angry.

Fix: exponential backoff with jitter, not linear retry. Cap retries at three. Add a circuit breaker so you don't keep hammering a provider that's down. The naïve retry that worked in dev is the single most common production fire we get paged about.

3. Eval drift

Symptom: someone tweaked the prompt for a specific edge case the product manager raised. Two weeks later, the original demo's golden cases are quietly worse. Nobody noticed.

Fix: install an eval gate before the launch. We covered this in The minimum viable eval — 12 cases, CI-gated, fails the build on regression. Don't ship without one. The drift is invisible until it isn't.

4. Cost spikes

Symptom: the demo cost $2/hour to run. The first day of production costs $400. By Friday it's $1,400/day. The CFO is on Slack asking what's happening.

Fix: budget alerts at the model-provider dashboard before launch. Hard caps at the application layer (e.g. "no single conversation above 200K tokens, ever"). Aggressive prompt caching on system prompts. We covered the cohort numbers in What our clients actually pay for Claude; the production bill is almost always 10–50x the demo bill on day one because real usage has patterns the demo didn't.

5. Prompt rot

Symptom: the prompt grew through the build. By demo day it's 2,400 tokens with seven "and very importantly" stanzas, four contradicting instructions about the same edge case, and one line about "don't be racist" added in panic the day before the stakeholder review. Output quality is silently worse than week one.

Fix: schedule a prompt-cleanup pass between demo and launch. Read the whole prompt out loud. Cut anything redundant. Resolve contradictions explicitly (decide which one wins, delete the other). Run the eval to confirm cleanup didn't regress anything. Then freeze it.

6. Tool-call timeouts

Symptom: in the demo, every external API the agent calls returned in under a second because dev environments are quiet. In production, the slow-tail of those APIs starts timing out under real load. The agent retries, the conversation gets stuck, the user sees a spinner.

Fix: surface tool-call latency as a first-class metric, with timeouts that are aggressive and a graceful "the lookup is taking a while, let me try a different approach" fallback path encoded in the agent prompt. Long-tail latency from dependencies is the single most underappreciated production risk in agentic systems.

7. Observability gaps

Symptom: the system is broken in production and you have no idea where. Was it the prompt? The retrieval? The tool call? The downstream API? The model just having an off day? You're reading stack traces when what you needed was a trace of every model call, every input, every output.

Fix: log every model call with input + output + tool calls + token counts, structured, searchable, retained for at least 30 days. We use a simple structured-log shape that any team can adopt in a day; the magic is having it the first time something breaks mysteriously, not the third time. Pair with a sampling-based UI so a non-engineer can spot-check production calls without SQL.

The pre-launch checklist we run

We never let a system go to production without all seven of these in place. Concretely, the checklist:

✓ Tier upgrade with the model provider has been requested.
✓ Token-bucket rate limiter at the gateway, with a kill switch.
✓ Exponential-backoff retry with jitter and a circuit breaker.
✓ Eval suite passing in CI (12+ cases).
✓ Hard cost caps at the application layer; alerts at the provider.
✓ Prompt cleaned up, frozen, version-controlled.
✓ Tool calls have timeouts and an in-prompt fallback path.
✓ Structured logging of every call with sampling UI.

That's eight items, not seven, because tier upgrade is its own thing. None of them is hard. All of them are skipped on the majority of production launches we get pulled into to firefight after the fact.

The pattern

Demos optimise for "does it work in the happy case?" Production asks "does it survive the unhappy case?" Six of these seven failures share the same root: the demo was a single user on a fast day; production is many users on a slow day. Once you've internalised that distinction the seven specific fixes become obvious. Until you have, you'll re-discover them all on launch day.

For the planning side of this — the questions you should be answering before any of this matters — see our AI build checklist.

7 things that always break the week after demo day