What RAG was solving
Retrieval-augmented generation existed because models had small context windows and no way to look anything up. If you wanted the model to answer questions about your docs, your only option was: chunk the docs, embed each chunk, store in a vector DB, do a similarity search at query time, stuff the top-k chunks into the prompt, hope the model could reason over them.
That pipeline worked. It also had a long list of failure modes — wrong chunks retrieved, chunks lacking context that lived in neighbouring chunks, the model losing the plot when given disjointed snippets, embedding-model drift across re-indexes. We shipped a lot of RAG systems and we spent more time debugging retrieval than we'd like to admit.
What changed in 2025-2026
- Context windows got real. 200K tokens is the normal now (Claude, GPT, Gemini). Gemini routinely handles 1M+. Long-context recall — actually using the content deep in the window — has improved meaningfully.
- Prompt caching ate the cost. Static prefixes cost 10% of normal input rate on a cache hit. A 50K-token reference doc cached costs almost nothing on the second call onward — see how this dropped a client's bill 71%.
- Tool-use matured. Models can now reliably call a search function, read the result, decide if it's relevant, and iterate. The model itself becomes the retrieval orchestrator.
What we use instead
Three patterns now cover most of what RAG used to:
1. Just put the docs in the prompt (cached)
For corpora under ~150K tokens (a sizeable internal docs set, a product manual, a contract template library), the simplest answer is: stuff the whole thing in the system prompt with cache control enabled. First call writes the cache; every subsequent call reads it at 10% of normal cost. The model sees the entire corpus on every query and reasons over it directly. No retrieval pipeline. No embedding drift. No chunking decisions.
For the 80% of "AI for our docs" use cases we see, this is now the right architecture. It's measurably better at multi-hop questions than chunked RAG was.
2. Tool-use over a real search engine
For larger corpora — a help centre with 5K articles, a code base, a
ticketing system — give the model a search() tool
that hits a real search engine (Algolia, Elasticsearch, Postgres
full-text). The model decides what to query, evaluates the
results, and re-queries if needed. Better results than vector
search for most "find me the docs that talk about X" use cases,
and you get keyword precision back.
Bonus: the search engine you're already running for your product probably has better tooling than your vector DB.
3. Hybrid for true semantic + a billion docs
The actual remaining use case for vector search is when you have a large corpus AND the query is genuinely semantic (not keyword- retrievable). Document-similarity recommendations, content moderation classification, "find images that look like this." That's a real category, and embeddings still win there. It's just much smaller than the marketing in 2023 made it look.
What this means for your stack
- For most "chat with our docs" use cases: you don't need a vector DB. Cache the docs.
- For larger corpora: give the model a search tool over a real search engine.
- For genuinely semantic + huge: keep vectors. But that's a smaller bucket than the framework marketing suggests.
- If you're already on RAG and it works: don't break it. The above is for greenfield projects.
What this means for the tools
LangChain, LlamaIndex, and the vector-DB-of-the-month exist largely to make the RAG pipeline manageable. As fewer projects need that pipeline, the value proposition shrinks. Most of the systems we ship now are direct API calls to Claude or ChatGPT with prompt caching and a small custom tool layer. No framework. The code is shorter and easier to debug.