COMPUTE

Replicate

Fastest way to get an open-source model running as an API. Thousands of community-published models behind a consistent REST shape, per-second billing, and a genuinely pleasant packaging story via Cog. Expensive at scale.

RATING · 8.3 / 10 PRICING · PER-SECOND COMPUTE · CPU FROM $0.36/HR UPDATED · 2026-04-23
TRY REPLICATE → ESTIMATE MY SPEND → FAQ →

Estimate your monthly spend

INTERACTIVE · LIVE · VERIFIED RATES

Replicate bills per second, not per hour. We converted the per-second rate to hourly for readability. 720 is 24/7 uptime — rare on Replicate unless you're running a dedicated deployment. Most per-request workloads land between 10 and 80 hours of actual compute time per month.

ESTIMATED MONTHLY SPEND
$504
USD / MONTH

Active compute only · setup and idle time on private deployments bill separately.

START ON REPLICATE →

BEST FOR

Prototyping, quick swaps between open-source models, per-request inference without managing infrastructure.

NOT FOR

Cost-sensitive high-volume inference, workloads needing guaranteed dedicated capacity, regulated data.

PRICING

Per-second compute · CPU $0.36/hr · T4 $0.81/hr · L40S $3.51/hr · A100 80GB $5.04/hr · H100 $5.49/hr.

ALTERNATIVES

RunPod Serverless (cheaper), Modal (Python-native), Hugging Face Inference (model-first), self-hosted.

What it is

Replicate is the shortest path between "I want to try this open-source model" and "I have it running behind an HTTP API". The product is conceptually small — you find a model in the library, hit it with a POST request, and it returns predictions — but the execution is unusually well-considered. Every model on the platform exposes the same request shape, the same authentication, the same webhook contract, and the same streaming semantics. Swapping SDXL for Flux for a 70B LLaMA variant is, from the client's perspective, a config change.

The company's core engineering bet is Cog, an open-source container format that packages models with their Python dependencies, weights, and a typed input schema. Authors push a Cog image; Replicate hosts and schedules it; consumers hit it via the same API they'd use for any other model on the platform. It's the closest thing the open-source inference world has to a genuine standard, and the fact that it works consistently across thousands of models is the quiet miracle of the platform.

The catalog is the second half of the value. Replicate hosts tens of thousands of community-published models — the current state-of-the-art image generators, a long tail of audio and video models, LLMs at various sizes, vision-language models, specialized fine-tunes of everything. Most are one line of code to call. The practical effect is that the platform functions as a generalized open-source model API, with an ergonomics layer that makes exploratory work fast in a way that nothing else in the category matches.

Positioning-wise, Replicate sits between the GPU rental providers (RunPod, Vast, Lambda) and the managed inference APIs (OpenAI, Anthropic, Gemini). It isn't selling you compute — it's selling you abstracted access to open-source models running on compute. If you want to rent an H100 and control everything, you go to RunPod. If you want a model behind an API without thinking about GPUs, you come here.

Billing follows the abstraction. Rather than reserving a GPU for an hour and paying the hourly rate regardless of utilization, Replicate charges per second of actual execution. A prediction that takes 4.2 seconds costs 4.2 seconds of the underlying hardware rate. For inference workloads where each call is short and request volumes are spiky, this is the right shape — and it's the reason prototyping on Replicate feels essentially free until you actually start shipping.

What we tested

Based on sustained use across client builds and internal experiments, we've pushed Replicate across the surface area the platform is designed for. We've called SDXL, Flux, and half a dozen other image models at production volumes during prototype phases; we've run LLM inference against 13B-class and 70B-class open-weights models; we've pushed fine-tuning jobs through the LoRA pipelines; we've packaged and published our own Cog models for client integrations; and we've deliberately stress-tested cold starts on less-trafficked community models to see what the worst case actually looks like.

Hardware coverage spans the full lineup: CPU for the odd non-GPU workload, T4 for light inference, L40S as the mid-tier workhorse, A100 80GB for serious generative work, and H100 for the newest large-context LLMs. We've compared warm-path latency, first-request cold-start times, per-second-billing variance, and the failure behavior when a model is in some quasi-broken state because nobody's called it for a week.

On the deployment side we've exercised both semantics that matter: public model calls (billed only for active processing, shared capacity, cold starts possible) and private deployments with reserved capacity (billed for setup, idle, and active uptime on dedicated instances, predictable latency, higher cost). The gap between these two modes is larger than most new users expect, and the choice between them is where most Replicate cost surprises originate.

None of what follows is a formal benchmark. The open-source inference category has plenty of leaderboards. What we can offer is the texture of building real products on Replicate in 2025–2026, from "type the npm command and we're calling SDXL within five minutes" all the way through "the monthly bill is finally real and we need to talk about migrating."

Pricing, in detail

VERIFIED FROM REPLICATE.COM · 2026-04
CPU
$0.36/ HR

For non-GPU workloads or tiny models. $0.000100/sec. Rarely the right pick but useful for orchestration.

  • 4x CPU, 8GB RAM
  • Per-second billing
  • CPU Small tier at $0.09/hr also available
Nvidia T4 · 16GB
$0.81/ HR

Entry GPU. $0.000225/sec. Good enough for small vision models, Whisper, light LLMs.

  • 16GB VRAM, 16GB RAM
  • Cheapest GPU tier on platform
  • Watch cold starts on large weights
Nvidia L40S · 48GB
$3.51/ HR

Mid-tier workhorse at $0.000975/sec. Strong for SDXL, Flux, 13B–34B inference.

  • 48GB VRAM, 65GB RAM
  • Best $/VRAM in lineup
  • Multi-GPU available on committed spend
H100 · 80GB
$5.49/ HR

Top-tier throughput at $0.001525/sec. Worth it when latency or FP8 Transformer Engine gains matter.

  • FP8 + Transformer Engine
  • Multi-GPU H100 on committed spend
  • Peak of the single-GPU price curve
PUBLIC MODELS
PAYPER CALL

Pre-published community models. Billed only for processing time, or per input/output token for some LLMs.

  • "Free to try" for initial exploration
  • No setup or idle charges
  • Cold starts possible on cold workers
BILLING SEMANTICS · READ THIS

Public models bill for active processing only. Private deployments bill for setup + idle + active — reserved capacity is always-on. Fast-booting fine-tunes bill only for active processing. Know which you're running.

SEE FULL PRICING →

What's good

The single biggest reason to use Replicate is time-to-API. From "I've never heard of this model" to "my app is calling it over HTTP" is routinely under five minutes. You find the model page, click the "HTTP" or language SDK tab, copy the snippet, paste your API token, send the request. It works. There is no other platform in the category where this flow is as consistent, and that consistency is what makes the rest of the product valuable.

The consistent API shape across radically different models is the second compounding win. SDXL, Flux, MusicGen, Whisper, Llama, Qwen, SAM, a vision-language model, an obscure research checkpoint someone pushed last week — they all respond to the same prediction endpoint, with the same polling or streaming semantics, and the same webhook contract. Your integration code for model A ports to model B with a change of model slug and input schema. For teams that need to A/B different models during a prototype phase, this is extraordinary leverage.

Cog is a genuinely good developer experience. Writing a cog.yaml, defining a predict.py with typed inputs, and running cog push is noticeably less painful than rolling a custom Docker image plus FastAPI plus a model loader plus input validation. The typing flows through to the model's playground page and the OpenAPI schema — you get a usable web UI and a typed client SDK for free. For a solo developer packaging a research checkpoint, or a team exposing an internal model, Cog saves a meaningful amount of undifferentiated infrastructure work.

Per-second billing matches the inference workload shape in a way that feels honest. If your model's prediction takes 3.4 seconds, you pay for 3.4 seconds. There's no per-minute rounding, no hourly minimum, no reserved capacity ticking while you sleep — unless you explicitly opt into reserved deployments. For bursty traffic or exploratory work, this is the right billing shape, and it's why the first-month-of-prototyping bill on Replicate is usually a pleasant surprise rather than a nasty one.

The scheduled deployments with reserved capacity feature is the escape hatch for production workloads that outgrow on-demand. You pin a specific hardware tier, set minimum replicas, and pay for the reserved capacity regardless of utilization — in exchange, cold starts disappear and latency becomes predictable. It's the same mental model as any cloud's reserved instances, adapted to model inference, and it's the right answer for hot paths that are mature enough to justify steady-state spend.

Where Replicate earns its keep

Users report that Replicate feels more like a developer tool than an inference cloud — which is both the compliment and the warning. The ergonomics are the best in the category; the steady-state economics are the worst.

The community catalog is its own moat. Thousands of models, most with working demos, often with the author still maintaining the container — the surface area is large enough that for almost any "can we try X?" question, the answer is "yes, in about ten minutes." We default to Replicate for the exploratory phase of every generative-AI client project, and we've never regretted it once.

Pros & cons

OUR HONEST TAKE

WHAT WORKS

  • Shortest time-to-API in the category — call an open-source model in five minutes.
  • Consistent REST shape across thousands of wildly different models.
  • Cog packaging is genuinely the best open-source model-container format.
  • Per-second billing fits the spiky shape of real inference traffic.
  • First-class SDKs for Python, JS, Go, plus a usable raw HTTP surface.
  • Webhooks and streaming work consistently — no per-model integration work.
  • Scheduled deployments give you a predictable-latency escape hatch when ready.

WHAT DOESN'T

  • Per-second pricing stacks up fast at steady-state volume vs renting your own GPU.
  • Cold starts on less-trafficked models can run 30–120 seconds on the first call.
  • Model quality varies wildly between community-published models — trust carefully.
  • Training-from-scratch is noticeably cheaper on RunPod or Modal.
  • Private deployments bill for idle and setup time, which surprises new users.
  • No SOC 2 / HIPAA posture strong enough for regulated-data workloads.
  • Versioning discipline is on the model author — some community models break without notice.

Common pitfalls

A handful of failure modes show up repeatedly across the Replicate projects we've seen. None are fatal; all are worth naming upfront.

Running production on free-tier semantics. Public models on Replicate are "free to try" in the sense that initial exploration costs nothing meaningful, and this creates a dangerous muscle memory: teams build a prototype, never really think about the billing model, and then discover the first real-traffic invoice. The fix is to understand the two billing lanes before you ship — public models bill only when they run, private deployments bill always, and picking the wrong lane for your traffic shape can cost you an order of magnitude either way.

Not using reserved deployments for hot paths. If you have a customer-facing feature that hits the same model on every request, and the feature is latency-sensitive, the public model call with its cold-start risk is the wrong lane. A scheduled deployment with reserved capacity eliminates the cold-start tail entirely — you pay for an always-warm worker, and in exchange your p99 latency looks like your p50. The teams that ignore this and leave hot paths on public model calls always end up with angry users during traffic spikes, right when cold starts are most likely.

Assuming pricing parity across GPUs. The gap between T4 ($0.81/hr) and H100 ($5.49/hr) is nearly 7×, but the throughput gap on many workloads is much less than that. If a T4 can run your model — even a little slower — your per-call cost is dramatically lower than on an A100 or H100. We've seen teams default to A100s out of inertia and halve their bill by dropping to L40S after a short throughput test. Check what the minimum viable hardware actually is before committing.

Ignoring cold-start warmup behavior. A community model nobody has called in a week will cold-start slowly — sometimes very slowly — because the container has to pull weights onto a fresh worker. The first user hitting your app through that model gets a 30-to-120-second wait, which is a very bad first impression. If your workload depends on a less-trafficked model, either pre-warm it on a schedule, move to a reserved deployment, or design the client to queue cold-start requests behind a loading state.

Over-using custom pushed models. Cog makes it so easy to push a model that teams sometimes push their own variant of every major model instead of using the public catalog. Each private model means private-deployment billing semantics (setup + idle + active), which is a completely different cost profile than just hitting the public version. Push your own when you need the customization; otherwise use what's already on the platform and avoid the always-on billing surface.

Cost surprises from per-second billing at volume. Per-second sounds cheap until you multiply by request volume. At 100,000 SDXL generations per month averaging 4 seconds each on an A100, you've burned 111 hours — at $5.04/hr that's ~$560, and the same workload self-hosted on RunPod A100 at $2.31/hr would cost less than half that at full utilization. Replicate is not the right place to live for high-volume steady-state inference; it's the right place to start and to keep the long tail.

What's actually offered

CAPABILITIES AT A GLANCE
PUBLIC MODEL LIBRARY

Tens of thousands of community-published models — image, video, audio, LLMs, vision, specialty.

PER-SECOND BILLING

Pay only for active compute time on public models. No hourly minimums, no rounding.

COG PACKAGING

Open-source container format for pushing your own model with typed inputs and a free playground.

FINE-TUNING PIPELINES

First-class fine-tuning on Flux, LLaMA, SDXL and others. Fast-booting fine-tunes bill only on active.

DEDICATED DEPLOYMENTS

Reserved-capacity deployments with min replicas, predictable latency, and no cold starts.

CONSISTENT REST API

Same prediction / polling / streaming shape across every model on the platform.

CLIENT SDKS

First-party Python, JavaScript/TypeScript, Go, plus raw cURL — all kept current with API changes.

WEBHOOKS + STREAMING

Server-sent events for streaming LLM output, webhooks for long-running predictions. Works on every model.

SEEN ENOUGH?

You can be calling a state-of-the-art open-source model from your code in five minutes — no GPU, no Docker, no setup.

TRY REPLICATE →

What's not

Replicate is not the cheapest place to run steady-state inference. At full utilization on a single GPU, renting the same hardware on RunPod, Vast, or Lambda Labs is meaningfully less expensive — often by a factor of two or more. The premium pays for the abstraction, the catalog, and the consistency. If the abstraction isn't buying you anything (because you've settled on one model and know the workload inside out), you're overpaying.

Cold starts on public models are real. For popular models like SDXL and Flux, the platform keeps enough warm capacity that cold starts are uncommon. For less-trafficked models — some research checkpoint nobody's hit in a week — the cold start is not a subtle phenomenon. Pulling weights onto a fresh worker routinely takes 30–120 seconds, and your first user pays that cost. Public models are not a substitute for a warm inference fleet.

Model quality variance is a direct consequence of the community catalog being open. Most models are fine; some are excellent; a few are broken, abandoned, or subtly different from what the README claims. Replicate doesn't curate as aggressively as, say, Hugging Face, and it shows. Before building against a community model, hit the playground a few times, read the author's recent commits, and confirm the behavior matches your expectations. The platform can't guarantee model quality that the author themselves didn't.

Training-from-scratch is noticeably cheaper on RunPod or Modal because you're paying for long-running GPU time, which is Replicate's weakest pricing lane. Replicate's fine-tuning pipelines are great, and the fast-booting fine-tune billing is honest — but multi-hour pre-training or large-model fine-tuning is better-served by a provider that charges rental rates rather than inference-abstraction rates.

Compliance posture is table-stakes rather than differentiated. Replicate is fine for most commercial workloads and the standard startup use cases. It is not the right fit for HIPAA-covered data, customer PII subject to strict residency requirements, or enterprise procurement reviews that demand SOC 2 Type II out of the gate. For those, the hyperscalers and enterprise-first inference providers are the conservative pick.

Who should use it

Replicate is the right call if you fit one of four profiles.

The prototyper. You're exploring whether a generative-AI feature is viable. You want to try SDXL, then Flux, then a new research checkpoint that dropped this week, without standing up GPU infrastructure for each. Replicate was built for exactly this motion. The cost of exploration is negligible, the catalog breadth is unmatched, and you can decide in an afternoon whether the feature is worth investing further in.

The app builder shipping an AI feature. You've got a web app or mobile app, you want to add an image-generation feature or a voice transcription pipeline or a small LLM call, and you don't want to manage inference infrastructure. Replicate is the right substrate for this from day one — per-second billing means you pay only what your users actually use, cold starts are manageable with reserved deployments on hot paths, and the SDKs are genuinely good. You'll know when the bill justifies migrating; until then, this is the straight-line path.

The early-stage startup piloting AI features. You're seven engineers, you're trying five different AI directions to see which one sticks, and you can't afford to stand up inference infrastructure for each hypothesis. Replicate lets you run all five experiments in parallel with trivial setup cost; the ones that stick can eventually migrate if the economics demand it. The ones that don't stick cost you a few dollars instead of a sprint of infra work. This is exactly the substrate early-stage product iteration wants.

The team running a long tail of models. If you're exposing ten different models to support ten different features and none of them have the volume to justify their own dedicated deployment, Replicate's per-second public model billing is cheaper than standing up ten small always-warm deployments somewhere else. The long-tail case is a genuine sweet spot for the platform.

Who should not use it: anyone running a single hot inference workload at steady-state volume where self-hosting on a provider like RunPod would be two-to-three times cheaper; anyone moving regulated data; anyone whose business case depends on the cheapest possible marginal cost per prediction. For those cases, Replicate's abstraction isn't buying enough to justify the premium.

Verdict

Replicate is the best place in the category to start and one of the worst places to live permanently at high volume. Both of those facts are features, not bugs. The platform is optimized for time-to-working-API and catalog breadth; the premium you pay is for the abstraction layer that makes those two things possible. If the abstraction is earning its keep — during prototyping, during long-tail feature support, during exploratory product work — the price is fair. If the abstraction isn't earning its keep — during steady-state high-volume inference on a single model you've settled on — you're overpaying by a factor of two or more, and you should migrate.

We rate it 8.3 / 10. It loses points for steady-state economics and cold-start tail latency on less-trafficked models; it gains them decisively for time-to-API, Cog ergonomics, and the sheer breadth of the community catalog. For most teams most of the time, this is where open-source model exploration should start.

If you're on the fence, pick a model you've been curious about, paste the snippet into your terminal, and send a request. Five minutes from now you'll have your answer — and you'll have spent about a cent doing it.

Frequently asked

TAP TO EXPAND

Replicate if you want the fastest path to an open-source model as an API and you don't want to manage GPU infrastructure. RunPod if you want to rent the GPU directly and run your own container, typically at half the per-hour cost. Rule of thumb: prototype on Replicate, move the hot path to RunPod once the economics justify the operational overhead. Most teams end up using both — Replicate for the long tail and experiments, RunPod for the one or two workloads that have graduated to steady-state production.

Model-dependent. Popular models (SDXL, Flux, mainstream LLMs) usually have enough warm capacity that first-request latency is a few seconds. Less-trafficked community models can run 30–120 seconds on a genuinely cold worker, because weights have to be pulled to a fresh container. If your workload depends on a less-popular model and you can't tolerate that tail, use a scheduled deployment with reserved capacity — cold starts disappear at the cost of always-on billing.

Replicate does not train on your inputs and has reasonable data-handling practices for commercial use. It's fine for typical startup and SMB workloads. It's not the right fit for HIPAA-covered data, strict PII residency, or enterprise procurement reviews that require SOC 2 Type II as a hard gate. For those workloads, look at enterprise-first inference providers or self-host on a compliant cloud.

Yes. Package it with Cog (cog.yaml + predict.py) and cog push. Private models are visible only to you and accounts you share with. Important billing note: private models deployed as dedicated deployments bill for setup + idle + active time, not just active. If you want inference-only billing on a private model, use a fast-booting fine-tune or call the model on-demand without a reserved deployment.

For consumer-facing features that aren't latency-critical and for internal tooling, yes. For hot paths where first-request latency matters, use a scheduled deployment with min replicas to eliminate cold starts — the extra cost is the price of predictability. For anything needing a signed SLA and named enterprise support, the answer is weaker; Replicate's support and SLAs are not at the level of a hyperscaler. Many teams run Replicate in production successfully; the ones who do it well have thought carefully about which lane (public / reserved) each workload belongs in.

Fine-tuning bills at the same per-second rate as inference on the underlying hardware — a Flux LoRA fine-tune running on an H100 for 20 minutes costs roughly $1.83. A LLaMA fine-tune on an A100 80GB for two hours costs roughly $10. For most LoRA-style fine-tunes the all-in cost is trivial. For full fine-tunes or long pre-training runs on larger models, RunPod or Modal are cheaper because you're paying steady-state rental rates rather than inference-abstraction rates.

Replicate's "free to try" semantics on public models mean initial exploration is essentially free — you can hit a dozen different models with dozens of requests before the bill becomes noticeable. There's no fixed free-tier credit in the classical sense; it's pay-per-use from the first request, but per-second billing at these rates means a few test calls cost fractions of a cent. Don't rely on the free-feeling exploration phase as a production lane — any real volume will show up on the invoice.

DONE READING?

Pick a model, paste the snippet, send the request. Five minutes. That's the whole pitch.

TRY REPLICATE → RE-RUN CALCULATOR →

[ INSTANT COMPARE ]

vs

Got an open-source model you want to try? Five minutes to API.

TRY REPLICATE → OR SCOPE A BUILD WITH US →