What "1M context" really means in practice
The model accepts roughly a million input tokens. That's about 750,000 words — a stack of books, a year of Slack, a corpus of legal contracts. The marketing photo is "ask one question of everything you know." The reality requires more nuance.
Three things that the headline doesn't say:
- Quality at 1M is not the same as quality at 50K. The model is more accurate on focused inputs than diffuse ones, regardless of fit.
- 1M tokens at full price is real money. Without aggressive caching this is a feature you'd use four times before pulling it.
- Latency on the longest inputs is meaningfully higher than on shorter ones. You can feel it in the UI.
The right framing isn't "the model can read everything." It's "the model can read more than you used to be able to give it, cheaply, if you cache aggressively, and you should still curate."
Four workloads where 1M changed our architecture
1. Whole-codebase reasoning
A medium-sized codebase (~600K tokens) fits in context with room for the conversation. We've moved a chunk of "explain this system" / "find the bug" / "refactor across 14 files" work from RAG-over-codebase patterns to "send the whole codebase, cached." The lift is dramatic on cross-file reasoning — the model can see both the caller and the callee without us having to predict which files to retrieve.
Where it doesn't help: monorepo-scale codebases. Above ~800K tokens we're back to agentic search via tool-use (see long-context patterns pattern 3).
2. Document review where the chain matters
Contract chains, regulatory filings, multi-document due diligence. The kind of work where the answer to "is this clause consistent with the master agreement?" requires the model to hold both documents simultaneously. Pre-1M, we'd extract sections and feed them in pieces. Post-1M, we send the documents whole and let the model do its own cross-referencing.
Quality lift on these tasks has been consistent: 20-30% fewer missed cross-references against the same eval set, and the model can quote the conflicting passages cleanly.
3. Customer-history summaries with no chunking
For B2B customer-success use cases — "give me a brief on this customer for the meeting in 20 minutes" — we cache the entire customer history (all support tickets, all sales calls transcribed, last 12 months of usage logs) and let the model produce the brief from the whole thing.
The brief quality is materially different from chunked retrieval. The model picks themes a chunk-retrieval pipeline misses because the themes are subtle and only show up across the full corpus. It's the workload where we've heard the most "wait, that's a good catch" from clients.
4. Long-running agentic sessions
Agents that need to remember a long working session — code review across a full PR, a multi-hour debugging session, a multi-step research task — benefit from not having to summarise and reload context every N turns. The 1M window swallows the session.
Pair this with prompt caching and the cost stays sane even on sessions that hit several hundred thousand tokens of accumulated context. We've seen Claude Code sessions in the 400K-token range that produced cleanly correct results because the agent never lost the thread.
Three workloads where 1M didn't help
1. High-volume classification
Email classification, ticket triage, content moderation. The input on each call is small (under 2K tokens). The 1M window is irrelevant. Haiku 4.5 is the right model here, not Opus. Don't pay Opus rates for problems Haiku solves. We covered this in our cohort cost writeup, profile C.
2. Real-time chat with short turns
Customer-facing chat agents where each turn is short and the conversation accumulates slowly. 1M is overkill. The conversation will end before it gets close. Sonnet 4.6 wins on cost-per-turn and latency for this shape. Opus's 1M strength is wasted.
3. Anything where the input is genuinely small
The trap is "we have 1M, so let's stuff everything." More context isn't free even with caching, and quality on diffuse input is meaningfully worse than on curated input. If your task fits in 30K tokens of well-curated context, putting 300K in there will hurt quality, not help it.
The cost discipline that makes 1M usable
Pricing on Opus 4.7 is roughly $5/Mtok in, $25/Mtok out. A million tokens of input at full price is $5 every call. That math doesn't work. Three habits that do:
- Cache the bulk. Cached input runs at ~10% of list. The 1M-token corpus you need to send once gets cheap on every subsequent call.
- Curate before you cache. Don't cache 1M tokens of stuff that's mostly irrelevant. Pre-filter. The cache is more valuable when the cached content is high signal.
- Cap output, hard. Output rate at $25/Mtok is the bigger cost driver on most workloads. Tight max-tokens. Tight system prompt. Don't let the model wander.
The honest summary
1M context is a real capability that changes how four specific workloads should be built. It's not a magic bullet for "RAG is hard, just stuff everything." Most of our production traffic is still on shorter contexts because the jobs don't need more. When the job genuinely benefits from whole-corpus reasoning, the lift is large enough to change the design — but you have to know which job that is.
For the broader pattern catalogue, long-context patterns that replaced our vector DB covers the four shapes we use. For routing decisions across the Claude family, see our Sonnet 4.6 vs Haiku 4.5 piece.