Migrating from OpenAI to Claude: the 5 gotchas we hit every time

Why teams migrate

The reasons we hear most: stronger long-context performance, better tool-use planning on agent loops, prompt caching economics that flip a workload's cost profile, the team has been using Claude Code internally and wants the same model on the product, and — increasingly — the procurement preference for multi-vendor AI infra.

The reasons not to: the workload is genuinely better-served by GPT, the team has heavy investment in OpenAI-specific patterns (Assistants API, function-calling shaped a particular way), or cost on the existing path is acceptable and the migration risk isn't worth it. We've recommended both directions.

For the broader head-to-head on production workloads, see Claude vs ChatGPT for production agents. This piece is the actual migration playbook.

Gotcha 1: Function calling vs tool use

OpenAI's function-calling API and Claude's tool-use API look similar but have meaningful differences. The biggest one: how the model behaves when it receives tool results.

What breaks: the verbatim port of the OpenAI function-calling pattern produces a Claude agent that's noticeably more verbose, double-confirms more often, or stops calling tools when it should chain.

The fix: rewrite the system prompt for Claude's conventions — be explicit about when to call tools versus when to ask, set the right tool descriptions, and shape the assistant's "thinking" guidance to Claude's actual style. The tool definitions usually need lighter reformatting; the prompt that wraps them needs more work than expected.

Gotcha 2: Prompt caching is not a drop-in

OpenAI auto-caches; Claude requires explicit cache_control markers on input blocks. If you migrate a Sonnet workload from GPT-4 by string-swapping, you'll get Claude's full input rate on every call and a much higher bill than expected. The default after a naïve port is "Claude is more expensive."

The fix: place cache breakpoints at the right semantic boundaries — typically end of system prompt, end of tool definitions, end of static context. Done right, you'll see ~70-90% cache hit rates on warm conversations and the bill drops below the original OpenAI cost in most cases. Done wrong, it's a 3x cost regression.

Our 71% bill-cut case study covers the cache-pattern shape in detail.

Gotcha 3: Output format guarantees differ

OpenAI's structured-output mode enforces JSON schema strictly. Claude's structured output is real but the enforcement profile is different — it's stronger on certain patterns, weaker on others, and the failure modes when the model can't conform are different.

What breaks: code that assumes a parse always succeeds. We've seen production crashes from "the model returned a polite explanation instead of JSON" because the original code had no fallback path. OpenAI's strict mode masked the need.

The fix: add a parse-or-retry layer with a JSON-only system prompt as the retry. Add a structured-output eval case for "did the response parse cleanly" so regressions get caught in CI. Most teams discover their parser was already flaky on OpenAI; the migration just made it visible.

Gotcha 4: Refusal behaviour changes

The two model families refuse different content for different reasons with different surface behaviours. A workload that ran smoothly on GPT may hit unexpected refusals on Claude (and vice versa) — usually around legal-adjacent content, financial advice, medical-adjacent content, or anything the original prompt was leaning on the model to "play it safe" by inferring.

The fix: Claude responds well to explicit framing. Adding a clear "you are operating in [context], the user is [role], your job is to [scope]" preamble resolves most of the new refusals. Generally Claude is more comfortable when it understands why the answer is OK to give, less comfortable with implicit "trust me" framings.

Gotcha 5: The eval suite needs a rebuild, not a port

Teams assume their existing eval suite will tell them whether the migration is good. It often won't. The eval cases were written against GPT's quirks; they don't always exercise Claude's. The first run typically produces a confidence-shaking pass rate, half real (regression) and half fake (the eval was coupled to GPT-shaped output).

The fix: rebuild the eval suite as part of the migration. Drop case-by-case literal output matches; rewrite against semantic criteria. We covered the shape we use in the minimum viable eval — that style ports across providers cleanly because the criteria are about correctness, not phrasing.

The migration sequence we run

The week-by-week shape we use on a real engagement:

Week 1. Map every model call, every prompt, every tool definition. Build the cost-comparison spreadsheet on real usage data.
Week 1-2. Rebuild the eval suite against semantic criteria. Get baseline pass rate on the existing OpenAI stack.
Week 2. Port the prompts to Claude conventions. Add cache breakpoints. Fix the parse-or-retry layer.
Week 3. Shadow-eval on Claude — run both side-by-side on the same inputs, compare outputs. Calibrate the eval bar.
Week 3-4. Cut a percentage of traffic to Claude (we usually start at 10%). Watch metrics. Ramp.
Week 4-5. Full cut-over with a feature flag back to OpenAI for one release cycle. Monitor production metrics.
Week 5+. Remove the flag, decommission the OpenAI integration if no use case retains it.

Total engineering time on a typical migration: 60–120 hours, depending on workload complexity. Cost change after stabilisation: usually -20% to -50% for chat-shaped workloads (caching wins), flat to +15% for high-volume Haiku-eligible classification (Haiku-class isn't always cheaper than gpt-4o-mini). See the cohort numbers in our 10 Claude bills piece for shape.

What we tell clients up front

Three things, every time:

"This is a 4-6 week engagement, not a weekend port."
"Your existing eval will look broken. That's because it was. We're rebuilding it."
"Cost will move in both directions before it settles. Don't react to week-1 numbers."

The summary

OpenAI-to-Claude migration is not a string swap. It's a 4-6 week engagement that touches prompts, tool definitions, caching, parsing, eval, and refusal handling. Done well, most workloads come out cheaper, more reliable, and easier to evolve. Done as a string swap, you get the model's worst behaviours and miss its best.

For the underlying decision of whether to migrate at all, see Claude vs ChatGPT for production agents. For the model-routing question after you're on Claude, see Sonnet 4.6 vs Haiku 4.5.

Migrating from OpenAI to Claude: 5 gotchas