The shape of a useful LLM logging pipeline

Why this exists separately from your app logs

Most teams have application logging — request IDs, error traces, metrics. That doesn't capture what you need to debug LLM behaviour. The right pipeline records the full prompt, the full response, every tool call, token counts, costs, latencies, and the version of every prompt template that produced this call.

Without this, "the model is acting weird" is unanswerable. With it, "the model is acting weird" becomes a SQL query.

The log schema

One row per model call. Columns we always include:

id — UUID, primary key
session_id — joins multi-turn conversations
user_id — for joining to product analytics, hashed if PII-sensitive
timestamp — UTC, microseconds
provider — anthropic, openai, etc
model — claude-sonnet-4-6, etc
prompt_template_version — hash or semver of the system prompt
system_prompt — full text
messages — JSON, the full message array as sent
tools — JSON, the tool definitions sent
response — JSON, the full assistant response including tool calls
tool_calls — JSON, parsed tool calls and results
input_tokens, output_tokens, cache_read_tokens, cache_creation_tokens
cost_usd — computed at ingest from token counts × current rate card
latency_ms — first-token latency and total
error — null on success, otherwise the error
metadata — JSON, app-specific (feature flag, A/B branch, customer tier)

That's it. ~17 columns. No exotic schema. Postgres or Clickhouse works fine; if you have BigQuery already, use that. The only requirement is the JSON columns are queryable.

Retention

Two tiers:

Full-fidelity hot (last 30-90 days): all columns, all rows. Used for debugging.
Aggregated cold (older than that): keep counts, costs, latencies, errors per (day, feature, model). Drop the prompt and response text. Used for trend analysis.

Why 30-90 days hot: that's enough to debug yesterday's bug, audit last week's rollout, and produce the "what changed" report after a model upgrade. Keeping more than that with full prompt text is expensive in storage and creates a PII liability for no operational benefit.

For regulated workloads (financial, medical) the hot tier retention is set by the regulation, not by us — typically 1-7 years, encrypted at rest, with separate access controls. That's a different conversation about compliance, not logging.

Sampling UI

The schema is necessary, the UI is what makes it useful. We build (or wire up) a small internal tool that:

Lists the most recent 100 calls with their summary fields (user, model, latency, cost, error).
Has filters: by feature, by model, by date, by user, by error status.
Click a row → see the full prompt, the full response, the full tool-call trace, the cost breakdown.
"Random sample" button — pull 20 random calls from the last hour for spot-checking.
"Replay" button — re-run this exact prompt against current code, diff the new response against the logged one.

Two days of building, then it sits there saving you for the next year. Almost every "the model is broken" investigation we run starts by clicking around in this UI for 5 minutes and finding the answer.

Cost

For a typical small-mid client running ~50K-500K model calls per month:

Storage: ~5-50 GB/mo of prompt+response text. Postgres or BigQuery: $5-30/mo.
Ingestion: a small process that writes to the DB. Negligible compute.
UI hosting: an internal Vercel/Render deployment, ~$10/mo.
Total: $15-50/mo at this scale. Scales sub-linearly because the schema is mostly text and storage is cheap.

Cheaper than any logging SaaS. More flexible than any logging SaaS. Owned by you. We've never recommended a vendor here — the build is two days and the operational burden is essentially zero.

The first three queries we always run

After the pipeline is live for a week, three queries we always run to find issues:

SELECT * WHERE error IS NOT NULL ORDER BY timestamp DESC — recent errors. Often surfaces a class of failure the team hadn't noticed.
SELECT model, COUNT(*), SUM(cost_usd) GROUP BY model — cost distribution by model. Almost always reveals over-routing to Sonnet that should be Haiku (see Sonnet vs Haiku routing).
SELECT prompt_template_version, AVG(output_tokens) GROUP BY 1 ORDER BY 1 — output length over prompt versions. Catches prompt rot and the silent "the model started writing essays" regression.

What this catches that nothing else does

Silent quality regressions. Output length drift. Refusal-rate drift. Tool-call frequency drift. None of which an APM dashboard surfaces.
The "one user is breaking it" pattern. Filter by user, see the conversations, find the adversarial input.
The "this prompt change broke production" debug. Filter by version, compare before/after.
Cost spikes. The conversation that ran 80 turns and cost $14 yesterday — and the user who triggered it.

The summary

17 columns, two retention tiers, a small internal UI. $15-50/mo. Install this on day one of any production engagement. Every "the model is broken" investigation gets cheaper. Every regression gets caught faster. The two days of build pay back inside a month.