The pattern that died
Two years ago, every AI feature started with: "set up a vector DB, embed the corpus, retrieve top-K, stuff into prompt." That stack was the default reflex. It still works. It's also wrong for most jobs we're shipping today.
Three things converged to make it wrong: (1) context windows grew to 1M tokens for under-$20-per-million pricing on cached input; (2) prompt caching changed the cost function entirely; (3) tool-use and agentic search made model-driven retrieval better than embedding-driven retrieval at most tasks. So: four patterns, in roughly the order we reach for them.
Pattern 1: Stuff the whole document
For documents under ~500K tokens (95% of legal contracts, most product docs, almost all single-customer support tickets) — just send the whole thing in the system prompt with caching enabled.
What it replaced:
- Old: chunk the document, embed each chunk, store in pgvector, retrieve top-5 at query-time, hope the retrieval got the right chunks.
- New: include the full document in the system prompt with a cache-control breakpoint. First call pays full price; every follow-up call gets the document at ~10% of input cost.
Lift over the old pattern: the model has the whole document, not a guess about which 5 chunks are relevant. Retrieval misses disappear. Citations work because the model can quote the actual passage. We've seen accuracy improvements of 15–35% on Q&A tasks just from this swap.
When it doesn't fit: documents above ~500K tokens, or corpora of thousands of documents (see pattern 3).
Pattern 2: Cached corpus + on-demand expansion
For mid-sized corpora (50–500 documents, total ~100K–800K tokens), cache the entire flattened corpus once and let the model navigate it directly with reference. Include a brief table-of-contents at the top of the cached payload.
What it replaced:
- Old: chunk + embed + retrieve, with all the chunking heuristics (overlap, boundary, semantic vs. fixed-size) we used to argue about.
- New: a single cached payload of "here is everything we know about this customer / product / case." The model picks what to cite. Cached input means the cost is the table-of-contents lookup, not the whole payload.
The pattern only works because of caching. On uncached input the cost would be untenable. With cached input at 10% list price, the math flips: it's almost always cheaper than maintaining a vector pipeline plus the model call plus the failure modes of the retrieval step.
Pattern 3: Agentic search (model-driven retrieval)
For larger corpora — thousands of documents, tens of millions of tokens — give the model a search tool and let it issue queries. The "vector DB" becomes a search index the model uses, not a pre-retrieval step that runs before the model.
What it replaced:
- Old: embed the user's query, vector-similarity search, return top-K chunks, model generates from chunks. One query attempt per user turn.
- New: tool definition like
search_corpus(query: str, n_results: int). Model decides what to search for, refines the query if results are weak, can search multiple times in one turn, can combine results from different searches.
The implementation can still use vector similarity inside
search_corpus — that part doesn't go away. What
changes is the orchestration. The model picks the queries, judges
the results, and iterates. We've measured retrieval-quality
improvements of 25–40% over single-shot vector retrieval on real
Q&A workloads.
Cost: more model turns (2–5 typically). But each turn is cheap because of caching, and the alternative — wrong answers because retrieval missed — is more expensive in different ways.
Pattern 4: Hybrid (small vector store + agentic search)
For the largest corpora (millions of documents) or the cases where the model needs cheap, fast routing across categories before doing real work — keep a vector store, but as the first leg of a multi-step search, not the only retrieval step.
Concretely: the model calls a "narrow_to_subset" tool that does vector retrieval to find the relevant 50 documents, then issues keyword or full-text search inside that subset for precision. The vector store becomes a coarse filter; the model handles the fine filter.
We use this for one client whose corpus is ~4M support tickets. Pure long-context can't fit. Pure agentic search is too slow because the search index is huge. The hybrid lands at ~3 second median latency with ~20% better answer quality than either pure approach.
The decision tree we run
- Is the relevant content under ~500K tokens? → Pattern 1 (stuff it).
- Is the corpus 50–500 documents, totalling under ~800K tokens? → Pattern 2 (cache the corpus).
- Is the corpus larger but searchable in one tool call? → Pattern 3 (agentic search).
- Is the corpus too large for a single search to be precise? → Pattern 4 (hybrid).
- None of the above and you're sure you need vectors? → Plain old RAG, but think hard about why.
We hit pattern 1 or 2 on roughly 70% of new engagements. Pattern 3 covers another 20%. Pattern 4 is rare. Plain RAG is rarer still.
What the diff looks like in production
A typical migration we run, replacing an existing RAG pipeline:
- Removed: vector DB infrastructure (pgvector, Pinecone, etc), embedding model + pipeline, chunking logic, retrieval ranking heuristics, eval suite for retrieval quality.
- Added: a system-prompt builder with cache-control breakpoints, a single tool function for search if needed, the eval suite covering end-to-end answer quality (which we wanted anyway).
- Net code change: usually -800 to -2,000 lines.
- Net infra change: one fewer service to operate.
- Net cost change: usually flat to -30%, depending on corpus size and call volume.
- Net quality change: +15–40% on real eval cases.
Where vectors still earn their keep
We're not anti-vector. The cases where we still reach for them:
- Real-time recommendation surfaces where the latency budget is <100ms — model calls don't fit.
- Pre-routing across thousands of categories where a cheap classifier on top of vectors beats a model call.
- The first leg of pattern 4 above.
- Embedding-similarity is the actual product (image dedupe, recommendation, semantic clustering).
Notice none of those is "RAG for Q&A on documents." That use case is the one that ate the world in 2023 and is the use case long-context has eaten back.
The takeaway
RAG isn't dead, but the default reflex of "embed everything first" is. Run the decision tree above. Most teams will land on pattern 1 or 2 and discover they have less infrastructure to maintain, better answer quality, and a flat-or-down bill.
The migration is rarely more than a week of focused work for a mid-sized corpus. If you'd like us to do that audit and migration for you, book a scoping call.