PLAYBOOK

The shape of a useful LLM logging pipeline

Companion to the 7 things that break the week after demo day — specifically the observability gap. The exact log schema, retention policy, sampling UI, and budget we install on every engagement. Typically under $50/mo at small client scale, scaling cleanly.

READ · 8 MIN UPDATED · 2026-03-30 BY · PINTOED AI STUDIO

Why this exists separately from your app logs

Most teams have application logging — request IDs, error traces, metrics. That doesn't capture what you need to debug LLM behaviour. The right pipeline records the full prompt, the full response, every tool call, token counts, costs, latencies, and the version of every prompt template that produced this call.

Without this, "the model is acting weird" is unanswerable. With it, "the model is acting weird" becomes a SQL query.

The log schema

One row per model call. Columns we always include:

That's it. ~17 columns. No exotic schema. Postgres or Clickhouse works fine; if you have BigQuery already, use that. The only requirement is the JSON columns are queryable.

Retention

Two tiers:

Why 30-90 days hot: that's enough to debug yesterday's bug, audit last week's rollout, and produce the "what changed" report after a model upgrade. Keeping more than that with full prompt text is expensive in storage and creates a PII liability for no operational benefit.

For regulated workloads (financial, medical) the hot tier retention is set by the regulation, not by us — typically 1-7 years, encrypted at rest, with separate access controls. That's a different conversation about compliance, not logging.

Sampling UI

The schema is necessary, the UI is what makes it useful. We build (or wire up) a small internal tool that:

Two days of building, then it sits there saving you for the next year. Almost every "the model is broken" investigation we run starts by clicking around in this UI for 5 minutes and finding the answer.

Cost

For a typical small-mid client running ~50K-500K model calls per month:

Cheaper than any logging SaaS. More flexible than any logging SaaS. Owned by you. We've never recommended a vendor here — the build is two days and the operational burden is essentially zero.

The first three queries we always run

After the pipeline is live for a week, three queries we always run to find issues:

  1. SELECT * WHERE error IS NOT NULL ORDER BY timestamp DESC — recent errors. Often surfaces a class of failure the team hadn't noticed.
  2. SELECT model, COUNT(*), SUM(cost_usd) GROUP BY model — cost distribution by model. Almost always reveals over-routing to Sonnet that should be Haiku (see Sonnet vs Haiku routing).
  3. SELECT prompt_template_version, AVG(output_tokens) GROUP BY 1 ORDER BY 1 — output length over prompt versions. Catches prompt rot and the silent "the model started writing essays" regression.

What this catches that nothing else does

The summary

17 columns, two retention tiers, a small internal UI. $15-50/mo. Install this on day one of any production engagement. Every "the model is broken" investigation gets cheaper. Every regression gets caught faster. The two days of build pay back inside a month.

Want this pipeline installed on your stack? 2-day engagement.

BOOK A CALL → SEE SERVICES →