PLAYBOOK

Long-context patterns that replaced our vector DB

The companion piece to RAG is mostly dead. Four concrete patterns that replaced our default RAG stack in 2026, what each one looks like as code, when each one fits — and the small vector store we still keep around for the cases where it actually earns its keep.

READ · 10 MIN UPDATED · 2026-04-18 BY · PINTOED AI STUDIO

The pattern that died

Two years ago, every AI feature started with: "set up a vector DB, embed the corpus, retrieve top-K, stuff into prompt." That stack was the default reflex. It still works. It's also wrong for most jobs we're shipping today.

Three things converged to make it wrong: (1) context windows grew to 1M tokens for under-$20-per-million pricing on cached input; (2) prompt caching changed the cost function entirely; (3) tool-use and agentic search made model-driven retrieval better than embedding-driven retrieval at most tasks. So: four patterns, in roughly the order we reach for them.

Pattern 1: Stuff the whole document

For documents under ~500K tokens (95% of legal contracts, most product docs, almost all single-customer support tickets) — just send the whole thing in the system prompt with caching enabled.

What it replaced:

Lift over the old pattern: the model has the whole document, not a guess about which 5 chunks are relevant. Retrieval misses disappear. Citations work because the model can quote the actual passage. We've seen accuracy improvements of 15–35% on Q&A tasks just from this swap.

When it doesn't fit: documents above ~500K tokens, or corpora of thousands of documents (see pattern 3).

Pattern 2: Cached corpus + on-demand expansion

For mid-sized corpora (50–500 documents, total ~100K–800K tokens), cache the entire flattened corpus once and let the model navigate it directly with reference. Include a brief table-of-contents at the top of the cached payload.

What it replaced:

The pattern only works because of caching. On uncached input the cost would be untenable. With cached input at 10% list price, the math flips: it's almost always cheaper than maintaining a vector pipeline plus the model call plus the failure modes of the retrieval step.

Pattern 3: Agentic search (model-driven retrieval)

For larger corpora — thousands of documents, tens of millions of tokens — give the model a search tool and let it issue queries. The "vector DB" becomes a search index the model uses, not a pre-retrieval step that runs before the model.

What it replaced:

The implementation can still use vector similarity inside search_corpus — that part doesn't go away. What changes is the orchestration. The model picks the queries, judges the results, and iterates. We've measured retrieval-quality improvements of 25–40% over single-shot vector retrieval on real Q&A workloads.

Cost: more model turns (2–5 typically). But each turn is cheap because of caching, and the alternative — wrong answers because retrieval missed — is more expensive in different ways.

Pattern 4: Hybrid (small vector store + agentic search)

For the largest corpora (millions of documents) or the cases where the model needs cheap, fast routing across categories before doing real work — keep a vector store, but as the first leg of a multi-step search, not the only retrieval step.

Concretely: the model calls a "narrow_to_subset" tool that does vector retrieval to find the relevant 50 documents, then issues keyword or full-text search inside that subset for precision. The vector store becomes a coarse filter; the model handles the fine filter.

We use this for one client whose corpus is ~4M support tickets. Pure long-context can't fit. Pure agentic search is too slow because the search index is huge. The hybrid lands at ~3 second median latency with ~20% better answer quality than either pure approach.

The decision tree we run

  1. Is the relevant content under ~500K tokens? → Pattern 1 (stuff it).
  2. Is the corpus 50–500 documents, totalling under ~800K tokens? → Pattern 2 (cache the corpus).
  3. Is the corpus larger but searchable in one tool call? → Pattern 3 (agentic search).
  4. Is the corpus too large for a single search to be precise? → Pattern 4 (hybrid).
  5. None of the above and you're sure you need vectors? → Plain old RAG, but think hard about why.

We hit pattern 1 or 2 on roughly 70% of new engagements. Pattern 3 covers another 20%. Pattern 4 is rare. Plain RAG is rarer still.

What the diff looks like in production

A typical migration we run, replacing an existing RAG pipeline:

Where vectors still earn their keep

We're not anti-vector. The cases where we still reach for them:

Notice none of those is "RAG for Q&A on documents." That use case is the one that ate the world in 2023 and is the use case long-context has eaten back.

The takeaway

RAG isn't dead, but the default reflex of "embed everything first" is. Run the decision tree above. Most teams will land on pattern 1 or 2 and discover they have less infrastructure to maintain, better answer quality, and a flat-or-down bill.

The migration is rarely more than a week of focused work for a mid-sized corpus. If you'd like us to do that audit and migration for you, book a scoping call.

Got a vector DB you don't need anymore? Let's audit it.

BOOK A CALL → SEE SERVICES →