The best Python-first serverless GPU experience shipping today.
Pay for the polish. Cold starts in seconds, decorators
instead of YAML, and a set of primitives — Volumes, Dicts, scheduled
jobs, web endpoints — that remove half the orchestration you were
going to write anyway.
Drag the slider for expected active compute hours. 720 is 24/7.
Modal bills per-second, so real usage is typically much lower
than your wall-clock deploy time. Memory and egress are billed
separately and not included here.
ESTIMATED MONTHLY SPEND
$340
USD / MONTH
Compute only · memory (~$0.024/GB-hr) and bandwidth not included. $30 free credit applied to new accounts.
Modal is a serverless compute platform built, unusually, around a
single language: Python. You write normal Python functions, decorate
them with @app.function(), and Modal handles everything
downstream — containerization, image building, GPU provisioning,
autoscaling, networking, and billing. There is no YAML. There is no
separate "deploy" artifact. The Python file you ran locally is the
thing that runs in the cloud, on an A100, in a few seconds.
That design choice is the whole product. Modal sits deliberately
between two nearby neighbors:
Replicate, which turns
open-source models into HTTP APIs but assumes you're deploying
someone else's model, and a self-hosted stack on
RunPod or AWS, which gives
you full control but makes you write the plumbing. Modal lets you
bring arbitrary Python — your own fine-tunes, your own pipelines,
your own logic — and still skip most of the infrastructure work.
The primitives are well-chosen. Functions are the
basic unit of compute; decorate a Python function, pick a GPU, and
it runs serverlessly on demand. Volumes are
persistent file storage you mount into containers — the right
abstraction for model weights, training checkpoints, and cached
datasets. Dicts and Queues handle shared state and
work queues between functions without wiring Redis yourself.
Scheduled jobs give you cron-style recurrence as a
decorator (@app.function(schedule=modal.Cron("0 3 * * *"))).
Web endpoints turn any function into an HTTPS API
with @modal.web_endpoint(). Each of these replaces
something you were going to build or glue together anyway.
Pricing is per-second across the board — CPU cores, memory, and
GPUs each bill independently. New accounts get $30 of monthly free
credit, which is enough to actually kick the tires on real
workloads (not a marketing crumb). Committed-spend discounts kick
in for teams running above roughly $2k/month, and there's an
Enterprise tier above that with custom pricing and SOC 2 posture.
Modal doesn't try to be cheap. It tries to be the platform that
makes a Python developer productive on a GPU in twenty minutes.
That's the bet, and it's a bet that mostly pays off — with caveats
we'll get to.
What we tested
In our hands-on use across client builds and internal experiments,
we've pushed Modal through essentially every workload shape it's
pitched for. We've deployed ML inference endpoints backed by
fine-tuned 7B and 13B open-weights models, with both always-warm
replicas and scale-to-zero configurations. We've run multi-hour
fine-tuning jobs on A100 80GB and H100 80GB hardware. We've built
nightly scheduled batch pipelines that process embeddings,
transcriptions, and document extractions on a cron schedule. We've
exposed arbitrary Python functions as HTTPS APIs backing internal
tools. And we've stress-tested Volumes as the persistence layer
under all of it.
Hardware coverage: T4, L4, A10G, A100 40GB, A100 80GB, and H100
80GB. We've used single-GPU and multi-GPU configurations; we've
used CPU-only functions for lightweight work alongside GPU-heavy
ones in the same app. We've hit cold starts on purpose — killing
containers and measuring time-to-first-response — and we've
compared cold-start behavior directly against RunPod Serverless on
matched models.
On the evaluation side, five dimensions mattered. First,
developer experience: from pip install modal
to a deployed function, how much friction? Second,
cold-start latency, which is the axis on which
Modal explicitly competes and the one most visible to end users.
Third, true cost — not just the sticker rate, but
what a real workload actually bills when memory, idle warmth, and
egress are included. Fourth, operational reliability:
how often do deploys fail, how does the platform behave under
burst load, what happens when a function crashes? Fifth,
escape hatches: when Modal's abstractions don't
fit, how hard is it to drop down to a lower level?
None of this is a formal benchmark. The serverless-GPU category
has plenty of those. What we can offer is the texture of running
real Python workloads on Modal across a year of client work and
living with the bills and the 3am pages.
Pricing, in detail
VERIFIED FROM MODAL.COM · 2026-04
CPU CORE
$0.135/ CORE-HR
Base compute for non-GPU work. Data pipelines, web endpoints, orchestration logic.
Per-second billing, no minimum
~$3.24/day for 1 core 24/7
Scale-to-zero by default
T4 · 16GB
$0.59/ HR
Cheapest GPU tier. Great for small models, embeddings, Whisper-class workloads.
16GB VRAM — small-model inference
Strong value for embeddings + ASR
Limited for >7B LLMs
L4 · 24GB
$0.91/ HR
Newer-gen inference card. Better perf-per-watt than T4 and more VRAM headroom.
24GB VRAM for 7B-class models
Lower power draw than T4
Sweet spot for quantized inference
A10G · 24GB
$1.10/ HR
Workhorse for small-to-mid inference and light fine-tuning. Solid $/throughput ratio.
24GB VRAM, proven at scale
Good default for production inference
Cheaper than A100, enough for most 7B-13B
A100 · 80GB
$3.40/ HR
Production default for serious inference and fine-tuning up to 30B-class models.
80GB HBM for long context
Native 30B+ fine-tuning
The "just use this" tier
H100 · 80GB
$5.70/ HR
Top-tier throughput. FP8 + Transformer Engine make wall-clock savings real when time matters.
FP8 + TE for 2-3× throughput
70B fine-tuning with multi-GPU
Best when hours matter more than dollars
WHAT ELSE BILLS · VERIFIED
Memory billed separately at ~$0.024/GB-hr.
New accounts get $30/mo free credit.
Committed-spend discounts available for teams above ~$2k/mo. Enterprise pricing is custom.
All rates shown are Modal's base per-second pricing normalized to per-hour. Regional selection and non-preemptible execution can apply multipliers on top — Modal documents these on their pricing page. Plan for 1.1-1.3× the sticker rate on typical production configurations.
What's good
The single biggest reason to use Modal is the
Python-native decorator DX. You take a function
that already works locally, add a decorator specifying the GPU and
the dependencies, and it runs in the cloud. That's not marketing
copy — it's the actual workflow. No Dockerfile (unless you want
one). No separate YAML. No "compile your requirements into a
production artifact" step. The same .py file is the
thing you develop against and the thing that runs in production.
Once you've felt how much friction that removes, the rest of the
serverless GPU category starts to feel archaic.
Cold starts are faster than RunPod Serverless on matched workloads.
This is the axis where Modal has quietly pulled ahead of the
cheaper competition. Layered image builds, smart caching of
dependencies, and a container runtime tuned for fast scheduling
mean a 10GB-model function routinely comes up in 5-15 seconds
cold — not the 30-60 seconds we see on equivalent RunPod
Serverless deployments. For latency-sensitive apps, that gap is
the difference between "usable" and "needs always-warm workers."
Volumes are the abstraction nobody else gets right.
Model weights, training checkpoints, cached datasets — the stuff
you want to persist between invocations without paying for S3
glue code. A Modal Volume is literally a mounted directory that
survives function restarts, shares across functions, and syncs
atomically. Compared to wiring up S3 with mount tools, signing
URLs, and managing cache layers manually, this saves days of
work per project. It's a small feature that disproportionately
changes what you'll actually build.
Scheduled jobs are built in at the primitive level. A nightly
batch is a decorator, not a separate cron service. For teams
who've bolted Celery or a dedicated scheduler onto their stack
just to get periodic runs, this collapses a whole layer of
infrastructure into one line of code. We've moved multiple
clients off of Airflow-for-simple-crons to Modal scheduled
functions and saved measurable ops overhead.
Web endpoints are the last magic trick. Annotate a function with
@modal.web_endpoint() and Modal generates an HTTPS
URL backed by the function — autoscaling, with a valid
certificate, no configuration. For internal tools, for demos,
for "glue this ML model to the rest of the stack" use cases,
this removes the need for a separate API layer entirely. Some
of our best production value comes from Modal functions fronted
directly by their own web endpoints.
Where Modal earns its keep
Decorator-based deploys: Python file today, running on A100 in the same session.
Cold starts routinely 5-15 seconds for 10GB models — fast enough to skip always-warm for many workloads.
Volumes handle model-weight persistence without S3 plumbing.
Scheduled jobs replace a separate cron/Airflow layer with a decorator.
Web endpoints turn functions into APIs without writing FastAPI or standing up a load balancer.
No separate orchestrator needed — one platform covers inference, training, batch, and APIs.
The mental model: every piece of infrastructure you were about to
write — the cache, the scheduler, the API layer, the queue — is
already a decorator. That's the thing Modal gets right that the
rest of the category keeps trying to copy.
The image build system deserves its own note. Modal builds images
in layers that cache aggressively between deploys, which means
iteration on a large ML stack isn't punished with 10-minute
rebuilds every time you change a line of code. The first build
takes real time; every subsequent build is seconds. This sounds
mundane until you've spent a week shipping at Docker speed on a
competitor and then tried Modal for comparison.
Pros & cons
OUR HONEST TAKE
WHAT WORKS
Decorator-based Python DX is in a class by itself — no YAML, no separate deploy artifact.
Cold starts are meaningfully faster than RunPod Serverless on matched models.
Volumes replace S3+mount glue with a first-class persistence primitive.
Scheduled jobs and web endpoints eliminate whole layers of infrastructure.
Per-second billing across CPU, memory, and GPU means right-sizing is actually cheap.
$30/month free credit is enough to actually prototype real workloads, not a token.
WHAT DOESN'T
Price floor higher than RunPod Community — you pay for the polish.
Python-only. No first-class Node, Go, Rust, or JVM story.
The decorator model creates real lock-in once your app is built around Modal primitives.
Enterprise procurement story is lighter than AWS/GCP — less leverage in big security reviews.
Memory billing is easy to miss in early estimates and produces credit surprises.
Region/non-preemptible multipliers can push real cost 1.1-1.3× above sticker.
Multi-GPU setups are supported but less ergonomic than the single-GPU path.
Common pitfalls
A handful of failure modes come up repeatedly in the Modal
projects we've seen — none of them dealbreakers, all of them
worth naming upfront before your first bill surprises you.
Not using Volumes for model weights. The single
most common mistake on Modal is baking model weights into the
container image instead of mounting them from a Volume. Weights
in the image mean every deploy rebuilds the image, every cold
start re-downloads, and your iteration loop gets noticeably
slower. Weights in a Volume mean the model is already there the
moment a container starts, deploys are instant, and your
10-minute cold-start problem becomes a 10-second one. If you
take one thing from this review: use Volumes for your weights.
Paying for idle workers via over-eager warm settings.
Modal's min_containers and
keep_warm parameters let you keep replicas
always-on to dodge cold starts. These work — and they bill. A
team that sets min_containers=2 on an A100 function
to "just be safe" has committed to roughly $5,000/month for
standby capacity before a single request hits the endpoint. Set
warmth deliberately, not defensively. If cold starts are already
acceptable (5-15 seconds for most models), zero warm workers is
the correct default.
Ignoring memory billing. The sticker rate on
GPU cards doesn't include memory, which is priced separately at
roughly $0.024/GB-hour. A function requesting 32GB of RAM on
an always-warm A100 adds ~$0.77/hr on top of the GPU rate — a
20% increase nobody noticed when they drafted the budget. Check
memory allocation explicitly, and prefer tighter memory limits
during development to catch accidental over-provisioning before
it ships.
Assuming CPU compute is free. CPU cores bill
too — at roughly $0.135/core-hour. Orchestration functions, web
endpoints that receive a lot of idle requests, and scheduled
jobs that poll for work all accumulate CPU-hours in ways that
are invisible until the invoice arrives. Our rule: every
always-on CPU function should justify itself. If it could be a
scheduled run or a webhook, it should be.
Not using committed-spend discounts when spend exceeds
$2k/month. Modal offers material discounts for teams
that commit to spend above roughly $2,000/month. Teams that
hit this threshold organically but never negotiate leave real
money on the table — typically 10-20% of their bill. If your
monthly spend is climbing, email their team. This is table-
stakes procurement for any serious spend level and a surprising
number of customers miss it.
Building non-trivial apps without Modal-specific
orchestration. Modal's primitives (Dicts, Queues,
chained function calls, .spawn() / .map())
are the right way to structure multi-step ML pipelines on the
platform. Teams that try to bolt on Celery, Airflow, or their
own orchestration layer generally end up fighting Modal rather
than using it. Learn the native primitives; they're better than
what you'd bolt on.
What's actually offered
CAPABILITIES AT A GLANCE
PYTHON-NATIVE SDK
Decorators (@app.function, @modal.web_endpoint) turn local Python into deployed serverless code with zero YAML.
SERVERLESS GPUS
Full range: T4, L4, A10G, A100 40/80GB, H100 80GB. Scale to zero by default; scale up per-request.
The price floor is the first thing to name. Modal is not cheap
on a raw $/GPU-hour basis. An A100 80GB at ~$3.40/hr is
competitive with AWS on-demand and noticeably above RunPod
Community at ~$2.31/hr. For workloads where compute is the
dominant cost and DX isn't the bottleneck, the savings on a
cheaper provider can be real — especially at 24/7 inference
volume. Modal's pitch is explicitly that the polish is worth
the premium, and for most teams it is, but if you're running
thousands of GPU-hours monthly on a predictable workload the
math can favor a different provider.
Python-only is not a soft constraint. If your stack is Node, Go,
Rust, or JVM, Modal isn't the answer. You can sometimes wrap
subprocess calls or ship a Python harness around non-Python
binaries, but you're fighting the grain. Teams on heterogeneous
stacks usually end up picking Modal for the Python parts and a
different platform (Cloudflare Workers, Fly.io, traditional
cloud) for the rest. That's workable but it's a split stack,
which has its own costs.
The decorator model creates lock-in. Once your app is structured
around @app.function(), Modal Volumes, Modal Dicts,
and Modal scheduled jobs, porting to another platform is a
meaningful rewrite. Modal is honest about this — they don't
pretend the abstractions are portable — but it's worth
acknowledging before you build a year's worth of infrastructure
on top. For strategic workloads, pin yourself to plain Python
functions and treat Modal primitives as deployment glue rather
than architectural commitments.
Enterprise procurement is lighter than a hyperscaler. Modal has
SOC 2 Type II and enterprise contracts, but it doesn't have
the decade-long legal-and-compliance footprint of AWS or GCP.
For organizations whose procurement departments require
multi-region guarantees, named-DPA terms, and extensive
third-party audits baked in, Modal can clear the bar with
effort but isn't the friction-free pick. This is the standard
startup-vs-hyperscaler tradeoff; Modal is better on this axis
than most of its serverless-GPU peers, but it's not AWS.
Who should use it
Modal is the right call for several specific profiles.
Python-first ML teams shipping inference or training.
If your team lives in PyTorch, Transformers, and pandas, and
you're deploying ML as a core product function, Modal is the
highest-productivity place to do it. The decorator DX matches
how ML engineers already think about code, and the primitives
(Volumes for weights, scheduled jobs for retraining, web
endpoints for model APIs) hit exactly the workflows the job
description already covers. We default to Modal for this
profile and only move off when the bill forces a conversation.
Indie ML engineers and two-person startups. The
$30/month free credit plus the scale-to-zero default mean you
can build a real product without infrastructure cost before the
first user shows up. For the specific pattern of "I have an
open-source model fine-tuned for my niche and I want to ship
it as an API," Modal is the path with the least friction in
the industry. You'll save weeks of plumbing over rolling your
own stack on RunPod.
Data pipelines and scheduled batch. If your
workload is "run this Python process on this cadence and keep
state between runs," Modal is a serious alternative to Airflow
or Prefect — especially once you factor in the cost of keeping
a scheduler running. A nightly embedding pipeline, a weekly
model eval, a continuous ETL job — the decorator + Volume
combination is hard to beat on developer-hours-to-production.
Startups moving off Colab. The common
progression: prototype in a notebook, hit Colab's limits, try
to productionize, drown in Docker and AWS plumbing. Modal is
the graceful next step — you keep your Python, you get real
GPUs, and you get the deployment story without learning a new
paradigm. Many of our clients follow exactly this arc, and
Modal absorbs the complexity that would otherwise block them
for weeks.
Who should not use it: teams whose primary constraint is $/GPU-hour at
24/7 volume (go to RunPod or
Vast.ai); non-Python stacks; teams who
only need "hit this open-source model as an API" with zero
custom code (Replicate is simpler);
and organizations whose procurement process structurally favors
a hyperscaler (AWS SageMaker or Google Vertex AI are probably
already on the approved-vendor list).
Verdict
Modal is the best Python-first serverless GPU experience
shipping in 2026. The decorator DX removes enough friction that
you'll build things you wouldn't have bothered building on a
rougher platform, and the cold-start and Volume primitives
make those things actually work in production. The tradeoff is
transparent: you pay more per GPU-hour than you would on a
community-tier provider, you lock yourself into a Python-only
and Modal-native architecture, and your enterprise procurement
story is lighter than AWS. Each of those is a real cost; none
of them are fatal for most teams.
We rate Modal 8.7 / 10. Take half a point off
if you're cost-dominated at 24/7 scale; add it back if your
bottleneck is ML engineering throughput and you'd trade a
percentage of your compute spend for another shipped model. If
you're on the fence, burn the $30 free credit on a real
workload over a weekend — you'll know by Sunday night.
Frequently asked
TAP TO EXPAND
Modal wins on DX, cold starts, and integrated primitives (Volumes, scheduled jobs, web endpoints). You ship faster and think about less infrastructure. RunPod Serverless wins on raw $/GPU-hour, especially on Community Cloud, and is more flexible if you need a non-Python runtime. For teams whose bottleneck is engineering velocity, Modal. For teams whose bottleneck is the GPU bill at steady 24/7 volume, RunPod. Most of our clients use both — Modal for new builds, RunPod for high-volume stable workloads.
Realistically worried, but not paralyzed. The lock-in is real: Modal primitives (Volumes, Dicts, scheduled decorators, web endpoints) don't have direct equivalents elsewhere, so a port is a rewrite of your deployment layer. The mitigation is to keep your business logic in plain Python functions and treat Modal-specific code as a thin deployment shell. Done that way, moving off Modal is a weekend of glue rewrite, not a quarter of rearchitecture.
For a lot of teams, functionally yes. Modal can cover model inference (web endpoints), fine-tuning (long-running functions on H100/A100), scheduled retraining (cron decorators), data pipelines (chained function calls), and model storage (Volumes). What it doesn't replace: a proper experiment-tracking tool (use W&B), a feature store (use whatever your team already has), or your data warehouse. But the compute/orchestration layer? Often Modal alone.
Much better than RunPod Serverless. With weights in a Volume, a 10GB model typically comes up cold in 5-15 seconds on Modal versus 20-40 on RunPod. Without Volumes (weights baked into the image), expect 30-60 seconds either way. Modal's edge disappears if you misuse the platform; with the recommended patterns, cold starts stop being the thing you design around.
Start with our calculator above for the baseline. Then add three things most teams forget: memory (~$0.024/GB-hr, nontrivial on 32GB+ configs), CPU core time on orchestration functions, and any always-warm replicas (multiply the hourly rate by ~720 to see the 24/7 standby cost). Modal's usage dashboard is honest about what you're spending, so monitor the first week closely and tune from there.
Practically, once your monthly spend clears around $2,000, it's worth reaching out. Modal will offer commitment-based discounts (typically 10-20%) that trade a spend floor for a lower rate. If your spend is growing and stable, take them; if you're still in the exploratory phase and might drop spend next month, don't lock in. The conversation is low-friction — their team is responsive.
Partially. Modal functions are Python, but a Python function can shell out to a binary of your choice, which covers a lot of "my ML core is Python and everything else isn't" cases. If your primary ML inference or training is in Python — even if the rest of your stack is Node or Go — Modal handles the GPU side fine and your main app talks to it over HTTP. If your ML is natively in JVM, Rust, or non-Python, Modal isn't the right platform; look at RunPod or a traditional cloud instead.
DONE READING?
Burn the $30 monthly credit on a real workload over a weekend. You'll know by Sunday.