COMPUTE

Modal

The best Python-first serverless GPU experience shipping today. Pay for the polish. Cold starts in seconds, decorators instead of YAML, and a set of primitives — Volumes, Dicts, scheduled jobs, web endpoints — that remove half the orchestration you were going to write anyway.

RATING · 8.7 / 10 PRICING · PER-SECOND · GPUS FROM $0.59/HR UPDATED · 2026-04-23
TRY MODAL → ESTIMATE MY SPEND → FAQ →

Estimate your monthly spend

INTERACTIVE · LIVE · VERIFIED RATES

Drag the slider for expected active compute hours. 720 is 24/7. Modal bills per-second, so real usage is typically much lower than your wall-clock deploy time. Memory and egress are billed separately and not included here.

ESTIMATED MONTHLY SPEND
$340
USD / MONTH

Compute only · memory (~$0.024/GB-hr) and bandwidth not included. $30 free credit applied to new accounts.

START ON MODAL →

BEST FOR

Python-first ML inference, fine-tuning, data pipelines, scheduled batch. Teams that want DX over raw $/hour.

NOT FOR

Non-Python stacks (Node/Go/Rust), teams chasing the cheapest raw GPU rates, workloads needing hyperscaler procurement posture.

PRICING

Per-second. CPU core ~$0.135/hr · T4 ~$0.59/hr · L4 ~$0.91/hr · A10G ~$1.10/hr · A100 80GB ~$3.40/hr · H100 80GB ~$5.70/hr. $30/mo free credit.

ALTERNATIVES

RunPod Serverless (cheaper, less polish), Replicate (model-as-API only), AWS Lambda+GPU (enterprise default), self-hosted.

What it is

Modal is a serverless compute platform built, unusually, around a single language: Python. You write normal Python functions, decorate them with @app.function(), and Modal handles everything downstream — containerization, image building, GPU provisioning, autoscaling, networking, and billing. There is no YAML. There is no separate "deploy" artifact. The Python file you ran locally is the thing that runs in the cloud, on an A100, in a few seconds.

That design choice is the whole product. Modal sits deliberately between two nearby neighbors: Replicate, which turns open-source models into HTTP APIs but assumes you're deploying someone else's model, and a self-hosted stack on RunPod or AWS, which gives you full control but makes you write the plumbing. Modal lets you bring arbitrary Python — your own fine-tunes, your own pipelines, your own logic — and still skip most of the infrastructure work.

The primitives are well-chosen. Functions are the basic unit of compute; decorate a Python function, pick a GPU, and it runs serverlessly on demand. Volumes are persistent file storage you mount into containers — the right abstraction for model weights, training checkpoints, and cached datasets. Dicts and Queues handle shared state and work queues between functions without wiring Redis yourself. Scheduled jobs give you cron-style recurrence as a decorator (@app.function(schedule=modal.Cron("0 3 * * *"))). Web endpoints turn any function into an HTTPS API with @modal.web_endpoint(). Each of these replaces something you were going to build or glue together anyway.

Pricing is per-second across the board — CPU cores, memory, and GPUs each bill independently. New accounts get $30 of monthly free credit, which is enough to actually kick the tires on real workloads (not a marketing crumb). Committed-spend discounts kick in for teams running above roughly $2k/month, and there's an Enterprise tier above that with custom pricing and SOC 2 posture.

Modal doesn't try to be cheap. It tries to be the platform that makes a Python developer productive on a GPU in twenty minutes. That's the bet, and it's a bet that mostly pays off — with caveats we'll get to.

What we tested

In our hands-on use across client builds and internal experiments, we've pushed Modal through essentially every workload shape it's pitched for. We've deployed ML inference endpoints backed by fine-tuned 7B and 13B open-weights models, with both always-warm replicas and scale-to-zero configurations. We've run multi-hour fine-tuning jobs on A100 80GB and H100 80GB hardware. We've built nightly scheduled batch pipelines that process embeddings, transcriptions, and document extractions on a cron schedule. We've exposed arbitrary Python functions as HTTPS APIs backing internal tools. And we've stress-tested Volumes as the persistence layer under all of it.

Hardware coverage: T4, L4, A10G, A100 40GB, A100 80GB, and H100 80GB. We've used single-GPU and multi-GPU configurations; we've used CPU-only functions for lightweight work alongside GPU-heavy ones in the same app. We've hit cold starts on purpose — killing containers and measuring time-to-first-response — and we've compared cold-start behavior directly against RunPod Serverless on matched models.

On the evaluation side, five dimensions mattered. First, developer experience: from pip install modal to a deployed function, how much friction? Second, cold-start latency, which is the axis on which Modal explicitly competes and the one most visible to end users. Third, true cost — not just the sticker rate, but what a real workload actually bills when memory, idle warmth, and egress are included. Fourth, operational reliability: how often do deploys fail, how does the platform behave under burst load, what happens when a function crashes? Fifth, escape hatches: when Modal's abstractions don't fit, how hard is it to drop down to a lower level?

None of this is a formal benchmark. The serverless-GPU category has plenty of those. What we can offer is the texture of running real Python workloads on Modal across a year of client work and living with the bills and the 3am pages.

Pricing, in detail

VERIFIED FROM MODAL.COM · 2026-04
CPU CORE
$0.135/ CORE-HR

Base compute for non-GPU work. Data pipelines, web endpoints, orchestration logic.

  • Per-second billing, no minimum
  • ~$3.24/day for 1 core 24/7
  • Scale-to-zero by default
T4 · 16GB
$0.59/ HR

Cheapest GPU tier. Great for small models, embeddings, Whisper-class workloads.

  • 16GB VRAM — small-model inference
  • Strong value for embeddings + ASR
  • Limited for >7B LLMs
L4 · 24GB
$0.91/ HR

Newer-gen inference card. Better perf-per-watt than T4 and more VRAM headroom.

  • 24GB VRAM for 7B-class models
  • Lower power draw than T4
  • Sweet spot for quantized inference
A10G · 24GB
$1.10/ HR

Workhorse for small-to-mid inference and light fine-tuning. Solid $/throughput ratio.

  • 24GB VRAM, proven at scale
  • Good default for production inference
  • Cheaper than A100, enough for most 7B-13B
H100 · 80GB
$5.70/ HR

Top-tier throughput. FP8 + Transformer Engine make wall-clock savings real when time matters.

  • FP8 + TE for 2-3× throughput
  • 70B fine-tuning with multi-GPU
  • Best when hours matter more than dollars
WHAT ELSE BILLS · VERIFIED

Memory billed separately at ~$0.024/GB-hr. New accounts get $30/mo free credit. Committed-spend discounts available for teams above ~$2k/mo. Enterprise pricing is custom.

DEPLOY A FUNCTION →

All rates shown are Modal's base per-second pricing normalized to per-hour. Regional selection and non-preemptible execution can apply multipliers on top — Modal documents these on their pricing page. Plan for 1.1-1.3× the sticker rate on typical production configurations.

What's good

The single biggest reason to use Modal is the Python-native decorator DX. You take a function that already works locally, add a decorator specifying the GPU and the dependencies, and it runs in the cloud. That's not marketing copy — it's the actual workflow. No Dockerfile (unless you want one). No separate YAML. No "compile your requirements into a production artifact" step. The same .py file is the thing you develop against and the thing that runs in production. Once you've felt how much friction that removes, the rest of the serverless GPU category starts to feel archaic.

Cold starts are faster than RunPod Serverless on matched workloads. This is the axis where Modal has quietly pulled ahead of the cheaper competition. Layered image builds, smart caching of dependencies, and a container runtime tuned for fast scheduling mean a 10GB-model function routinely comes up in 5-15 seconds cold — not the 30-60 seconds we see on equivalent RunPod Serverless deployments. For latency-sensitive apps, that gap is the difference between "usable" and "needs always-warm workers."

Volumes are the abstraction nobody else gets right. Model weights, training checkpoints, cached datasets — the stuff you want to persist between invocations without paying for S3 glue code. A Modal Volume is literally a mounted directory that survives function restarts, shares across functions, and syncs atomically. Compared to wiring up S3 with mount tools, signing URLs, and managing cache layers manually, this saves days of work per project. It's a small feature that disproportionately changes what you'll actually build.

Scheduled jobs are built in at the primitive level. A nightly batch is a decorator, not a separate cron service. For teams who've bolted Celery or a dedicated scheduler onto their stack just to get periodic runs, this collapses a whole layer of infrastructure into one line of code. We've moved multiple clients off of Airflow-for-simple-crons to Modal scheduled functions and saved measurable ops overhead.

Web endpoints are the last magic trick. Annotate a function with @modal.web_endpoint() and Modal generates an HTTPS URL backed by the function — autoscaling, with a valid certificate, no configuration. For internal tools, for demos, for "glue this ML model to the rest of the stack" use cases, this removes the need for a separate API layer entirely. Some of our best production value comes from Modal functions fronted directly by their own web endpoints.

Where Modal earns its keep

The mental model: every piece of infrastructure you were about to write — the cache, the scheduler, the API layer, the queue — is already a decorator. That's the thing Modal gets right that the rest of the category keeps trying to copy.

The image build system deserves its own note. Modal builds images in layers that cache aggressively between deploys, which means iteration on a large ML stack isn't punished with 10-minute rebuilds every time you change a line of code. The first build takes real time; every subsequent build is seconds. This sounds mundane until you've spent a week shipping at Docker speed on a competitor and then tried Modal for comparison.

Pros & cons

OUR HONEST TAKE

WHAT WORKS

  • Decorator-based Python DX is in a class by itself — no YAML, no separate deploy artifact.
  • Cold starts are meaningfully faster than RunPod Serverless on matched models.
  • Volumes replace S3+mount glue with a first-class persistence primitive.
  • Scheduled jobs and web endpoints eliminate whole layers of infrastructure.
  • Per-second billing across CPU, memory, and GPU means right-sizing is actually cheap.
  • Layered image builds cache aggressively — iteration speed doesn't collapse as stacks grow.
  • $30/month free credit is enough to actually prototype real workloads, not a token.

WHAT DOESN'T

  • Price floor higher than RunPod Community — you pay for the polish.
  • Python-only. No first-class Node, Go, Rust, or JVM story.
  • The decorator model creates real lock-in once your app is built around Modal primitives.
  • Enterprise procurement story is lighter than AWS/GCP — less leverage in big security reviews.
  • Memory billing is easy to miss in early estimates and produces credit surprises.
  • Region/non-preemptible multipliers can push real cost 1.1-1.3× above sticker.
  • Multi-GPU setups are supported but less ergonomic than the single-GPU path.

Common pitfalls

A handful of failure modes come up repeatedly in the Modal projects we've seen — none of them dealbreakers, all of them worth naming upfront before your first bill surprises you.

Not using Volumes for model weights. The single most common mistake on Modal is baking model weights into the container image instead of mounting them from a Volume. Weights in the image mean every deploy rebuilds the image, every cold start re-downloads, and your iteration loop gets noticeably slower. Weights in a Volume mean the model is already there the moment a container starts, deploys are instant, and your 10-minute cold-start problem becomes a 10-second one. If you take one thing from this review: use Volumes for your weights.

Paying for idle workers via over-eager warm settings. Modal's min_containers and keep_warm parameters let you keep replicas always-on to dodge cold starts. These work — and they bill. A team that sets min_containers=2 on an A100 function to "just be safe" has committed to roughly $5,000/month for standby capacity before a single request hits the endpoint. Set warmth deliberately, not defensively. If cold starts are already acceptable (5-15 seconds for most models), zero warm workers is the correct default.

Ignoring memory billing. The sticker rate on GPU cards doesn't include memory, which is priced separately at roughly $0.024/GB-hour. A function requesting 32GB of RAM on an always-warm A100 adds ~$0.77/hr on top of the GPU rate — a 20% increase nobody noticed when they drafted the budget. Check memory allocation explicitly, and prefer tighter memory limits during development to catch accidental over-provisioning before it ships.

Assuming CPU compute is free. CPU cores bill too — at roughly $0.135/core-hour. Orchestration functions, web endpoints that receive a lot of idle requests, and scheduled jobs that poll for work all accumulate CPU-hours in ways that are invisible until the invoice arrives. Our rule: every always-on CPU function should justify itself. If it could be a scheduled run or a webhook, it should be.

Not using committed-spend discounts when spend exceeds $2k/month. Modal offers material discounts for teams that commit to spend above roughly $2,000/month. Teams that hit this threshold organically but never negotiate leave real money on the table — typically 10-20% of their bill. If your monthly spend is climbing, email their team. This is table- stakes procurement for any serious spend level and a surprising number of customers miss it.

Building non-trivial apps without Modal-specific orchestration. Modal's primitives (Dicts, Queues, chained function calls, .spawn() / .map()) are the right way to structure multi-step ML pipelines on the platform. Teams that try to bolt on Celery, Airflow, or their own orchestration layer generally end up fighting Modal rather than using it. Learn the native primitives; they're better than what you'd bolt on.

What's actually offered

CAPABILITIES AT A GLANCE
PYTHON-NATIVE SDK

Decorators (@app.function, @modal.web_endpoint) turn local Python into deployed serverless code with zero YAML.

SERVERLESS GPUS

Full range: T4, L4, A10G, A100 40/80GB, H100 80GB. Scale to zero by default; scale up per-request.

SCHEDULED JOBS

Cron-native. schedule=modal.Cron("0 3 * * *") replaces whole Airflow deployments for simple cases.

VOLUMES + DICTS

Persistent filesystem and distributed key-value state without standing up S3 or Redis yourself.

WEB ENDPOINTS

Any function becomes an HTTPS API with a decorator — autoscaling, TLS-valid, no load balancer config.

LAYERED IMAGE BUILDS

Fast incremental builds with aggressive caching. First build slow, every subsequent build seconds.

SECRETS MANAGEMENT

First-class secrets injected as environment variables. OIDC and key-vault integrations supported.

INTEGRATIONS

W&B, Datadog, structured logging, GitHub Actions. Plays nicely with the rest of an ML stack.

SEEN ENOUGH?

$30 monthly free credit, no card to start. You can have a GPU-backed HTTPS endpoint in under ten minutes.

TRY MODAL →

What's not

The price floor is the first thing to name. Modal is not cheap on a raw $/GPU-hour basis. An A100 80GB at ~$3.40/hr is competitive with AWS on-demand and noticeably above RunPod Community at ~$2.31/hr. For workloads where compute is the dominant cost and DX isn't the bottleneck, the savings on a cheaper provider can be real — especially at 24/7 inference volume. Modal's pitch is explicitly that the polish is worth the premium, and for most teams it is, but if you're running thousands of GPU-hours monthly on a predictable workload the math can favor a different provider.

Python-only is not a soft constraint. If your stack is Node, Go, Rust, or JVM, Modal isn't the answer. You can sometimes wrap subprocess calls or ship a Python harness around non-Python binaries, but you're fighting the grain. Teams on heterogeneous stacks usually end up picking Modal for the Python parts and a different platform (Cloudflare Workers, Fly.io, traditional cloud) for the rest. That's workable but it's a split stack, which has its own costs.

The decorator model creates lock-in. Once your app is structured around @app.function(), Modal Volumes, Modal Dicts, and Modal scheduled jobs, porting to another platform is a meaningful rewrite. Modal is honest about this — they don't pretend the abstractions are portable — but it's worth acknowledging before you build a year's worth of infrastructure on top. For strategic workloads, pin yourself to plain Python functions and treat Modal primitives as deployment glue rather than architectural commitments.

Enterprise procurement is lighter than a hyperscaler. Modal has SOC 2 Type II and enterprise contracts, but it doesn't have the decade-long legal-and-compliance footprint of AWS or GCP. For organizations whose procurement departments require multi-region guarantees, named-DPA terms, and extensive third-party audits baked in, Modal can clear the bar with effort but isn't the friction-free pick. This is the standard startup-vs-hyperscaler tradeoff; Modal is better on this axis than most of its serverless-GPU peers, but it's not AWS.

Who should use it

Modal is the right call for several specific profiles.

Python-first ML teams shipping inference or training. If your team lives in PyTorch, Transformers, and pandas, and you're deploying ML as a core product function, Modal is the highest-productivity place to do it. The decorator DX matches how ML engineers already think about code, and the primitives (Volumes for weights, scheduled jobs for retraining, web endpoints for model APIs) hit exactly the workflows the job description already covers. We default to Modal for this profile and only move off when the bill forces a conversation.

Indie ML engineers and two-person startups. The $30/month free credit plus the scale-to-zero default mean you can build a real product without infrastructure cost before the first user shows up. For the specific pattern of "I have an open-source model fine-tuned for my niche and I want to ship it as an API," Modal is the path with the least friction in the industry. You'll save weeks of plumbing over rolling your own stack on RunPod.

Data pipelines and scheduled batch. If your workload is "run this Python process on this cadence and keep state between runs," Modal is a serious alternative to Airflow or Prefect — especially once you factor in the cost of keeping a scheduler running. A nightly embedding pipeline, a weekly model eval, a continuous ETL job — the decorator + Volume combination is hard to beat on developer-hours-to-production.

Startups moving off Colab. The common progression: prototype in a notebook, hit Colab's limits, try to productionize, drown in Docker and AWS plumbing. Modal is the graceful next step — you keep your Python, you get real GPUs, and you get the deployment story without learning a new paradigm. Many of our clients follow exactly this arc, and Modal absorbs the complexity that would otherwise block them for weeks.

Who should not use it: teams whose primary constraint is $/GPU-hour at 24/7 volume (go to RunPod or Vast.ai); non-Python stacks; teams who only need "hit this open-source model as an API" with zero custom code (Replicate is simpler); and organizations whose procurement process structurally favors a hyperscaler (AWS SageMaker or Google Vertex AI are probably already on the approved-vendor list).

Verdict

Modal is the best Python-first serverless GPU experience shipping in 2026. The decorator DX removes enough friction that you'll build things you wouldn't have bothered building on a rougher platform, and the cold-start and Volume primitives make those things actually work in production. The tradeoff is transparent: you pay more per GPU-hour than you would on a community-tier provider, you lock yourself into a Python-only and Modal-native architecture, and your enterprise procurement story is lighter than AWS. Each of those is a real cost; none of them are fatal for most teams.

We rate Modal 8.7 / 10. Take half a point off if you're cost-dominated at 24/7 scale; add it back if your bottleneck is ML engineering throughput and you'd trade a percentage of your compute spend for another shipped model. If you're on the fence, burn the $30 free credit on a real workload over a weekend — you'll know by Sunday night.

[ INSTANT COMPARE ]

vs

Ready to ship Python to a GPU? Takes about ten minutes.

TRY MODAL → OR SCOPE A BUILD WITH US →