Why this exists separately from your app logs
Most teams have application logging — request IDs, error traces, metrics. That doesn't capture what you need to debug LLM behaviour. The right pipeline records the full prompt, the full response, every tool call, token counts, costs, latencies, and the version of every prompt template that produced this call.
Without this, "the model is acting weird" is unanswerable. With it, "the model is acting weird" becomes a SQL query.
The log schema
One row per model call. Columns we always include:
id— UUID, primary keysession_id— joins multi-turn conversationsuser_id— for joining to product analytics, hashed if PII-sensitivetimestamp— UTC, microsecondsprovider— anthropic, openai, etcmodel— claude-sonnet-4-6, etcprompt_template_version— hash or semver of the system promptsystem_prompt— full textmessages— JSON, the full message array as senttools— JSON, the tool definitions sentresponse— JSON, the full assistant response including tool callstool_calls— JSON, parsed tool calls and resultsinput_tokens,output_tokens,cache_read_tokens,cache_creation_tokenscost_usd— computed at ingest from token counts × current rate cardlatency_ms— first-token latency and totalerror— null on success, otherwise the errormetadata— JSON, app-specific (feature flag, A/B branch, customer tier)
That's it. ~17 columns. No exotic schema. Postgres or Clickhouse works fine; if you have BigQuery already, use that. The only requirement is the JSON columns are queryable.
Retention
Two tiers:
- Full-fidelity hot (last 30-90 days): all columns, all rows. Used for debugging.
- Aggregated cold (older than that): keep counts, costs, latencies, errors per (day, feature, model). Drop the prompt and response text. Used for trend analysis.
Why 30-90 days hot: that's enough to debug yesterday's bug, audit last week's rollout, and produce the "what changed" report after a model upgrade. Keeping more than that with full prompt text is expensive in storage and creates a PII liability for no operational benefit.
For regulated workloads (financial, medical) the hot tier retention is set by the regulation, not by us — typically 1-7 years, encrypted at rest, with separate access controls. That's a different conversation about compliance, not logging.
Sampling UI
The schema is necessary, the UI is what makes it useful. We build (or wire up) a small internal tool that:
- Lists the most recent 100 calls with their summary fields (user, model, latency, cost, error).
- Has filters: by feature, by model, by date, by user, by error status.
- Click a row → see the full prompt, the full response, the full tool-call trace, the cost breakdown.
- "Random sample" button — pull 20 random calls from the last hour for spot-checking.
- "Replay" button — re-run this exact prompt against current code, diff the new response against the logged one.
Two days of building, then it sits there saving you for the next year. Almost every "the model is broken" investigation we run starts by clicking around in this UI for 5 minutes and finding the answer.
Cost
For a typical small-mid client running ~50K-500K model calls per month:
- Storage: ~5-50 GB/mo of prompt+response text. Postgres or BigQuery: $5-30/mo.
- Ingestion: a small process that writes to the DB. Negligible compute.
- UI hosting: an internal Vercel/Render deployment, ~$10/mo.
- Total: $15-50/mo at this scale. Scales sub-linearly because the schema is mostly text and storage is cheap.
Cheaper than any logging SaaS. More flexible than any logging SaaS. Owned by you. We've never recommended a vendor here — the build is two days and the operational burden is essentially zero.
The first three queries we always run
After the pipeline is live for a week, three queries we always run to find issues:
SELECT * WHERE error IS NOT NULL ORDER BY timestamp DESC— recent errors. Often surfaces a class of failure the team hadn't noticed.SELECT model, COUNT(*), SUM(cost_usd) GROUP BY model— cost distribution by model. Almost always reveals over-routing to Sonnet that should be Haiku (see Sonnet vs Haiku routing).SELECT prompt_template_version, AVG(output_tokens) GROUP BY 1 ORDER BY 1— output length over prompt versions. Catches prompt rot and the silent "the model started writing essays" regression.
What this catches that nothing else does
- Silent quality regressions. Output length drift. Refusal-rate drift. Tool-call frequency drift. None of which an APM dashboard surfaces.
- The "one user is breaking it" pattern. Filter by user, see the conversations, find the adversarial input.
- The "this prompt change broke production" debug. Filter by version, compare before/after.
- Cost spikes. The conversation that ran 80 turns and cost $14 yesterday — and the user who triggered it.
The summary
17 columns, two retention tiers, a small internal UI. $15-50/mo. Install this on day one of any production engagement. Every "the model is broken" investigation gets cheaper. Every regression gets caught faster. The two days of build pay back inside a month.