Skip to content

Modal Serverless Gpu — Serverless GPU cloud platform for running ML workloads

Serverless GPU cloud platform for running ML workloads. Use when you need on-demand GPU access without infrastructure management, deploying ML models as APIs, or running batch jobs with automatic scaling.

SourceOptional — install with hermes skills install official/mlops/modal
Pathoptional-skills/mlops/modal
Version1.0.0
AuthorOrchestra Research
LicenseMIT
Dependenciesmodal>=0.64.0
TagsInfrastructure, Serverless, GPU, Cloud, Deployment, Modal

The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.

Comprehensive guide to running ML workloads on Modal’s serverless GPU cloud platform.

Use Modal when:

  • Running GPU-intensive ML workloads without managing infrastructure
  • Deploying ML models as auto-scaling APIs
  • Running batch processing jobs (training, inference, data processing)
  • Need pay-per-second GPU pricing without idle costs
  • Prototyping ML applications quickly
  • Running scheduled jobs (cron-like workloads)

Key features:

  • Serverless GPUs: T4, L4, A10G, L40S, A100, H100, H200, B200 on-demand
  • Python-native: Define infrastructure in Python code, no YAML
  • Auto-scaling: Scale to zero, scale to 100+ GPUs instantly
  • Sub-second cold starts: Rust-based infrastructure for fast container launches
  • Container caching: Image layers cached for rapid iteration
  • Web endpoints: Deploy functions as REST APIs with zero-downtime updates

Use alternatives instead:

  • RunPod: For longer-running pods with persistent state
  • Lambda Labs: For reserved GPU instances
  • SkyPilot: For multi-cloud orchestration and cost optimization
  • Kubernetes: For complex multi-service architectures
Окно терминала
pip install modal
modal setup # Opens browser for authentication
import modal
app = modal.App("hello-gpu")
@app.function(gpu="T4")
def gpu_info():
import subprocess
return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout
@app.local_entrypoint()
def main():
print(gpu_info.remote())

Run: modal run hello_gpu.py

import modal
app = modal.App("text-generation")
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")
@app.cls(gpu="A10G", image=image)
class TextGenerator:
@modal.enter()
def load_model(self):
from transformers import pipeline
self.pipe = pipeline("text-generation", model="gpt2", device=0)
@modal.method()
def generate(self, prompt: str) -> str:
return self.pipe(prompt, max_length=100)[0]["generated_text"]
@app.local_entrypoint()
def main():
print(TextGenerator().generate.remote("Hello, world"))
ComponentPurpose
AppContainer for functions and resources
FunctionServerless function with compute specs
ClsClass-based functions with lifecycle hooks
ImageContainer image definition
VolumePersistent storage for models/data
SecretSecure credential storage
CommandDescription
modal run script.pyExecute and exit
modal serve script.pyDevelopment with live reload
modal deploy script.pyPersistent cloud deployment
GPUVRAMBest For
T416GBBudget inference, small models
L424GBInference, Ada Lovelace arch
A10G24GBTraining/inference, 3.3x faster than T4
L40S48GBRecommended for inference (best cost/perf)
A100-40GB40GBLarge model training
A100-80GB80GBVery large models
H10080GBFastest, FP8 + Transformer Engine
H200141GBAuto-upgrade from H100, 4.8TB/s bandwidth
B200LatestBlackwell architecture
# Single GPU
@app.function(gpu="A100")
# Specific memory variant
@app.function(gpu="A100-80GB")
# Multiple GPUs (up to 8)
@app.function(gpu="H100:4")
# GPU with fallbacks
@app.function(gpu=["H100", "A100", "L40S"])
# Any available GPU
@app.function(gpu="any")
# Basic image with pip
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"torch==2.1.0", "transformers==4.36.0", "accelerate"
)
# From CUDA base
image = modal.Image.from_registry(
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
add_python="3.11"
).pip_install("torch", "transformers")
# With system packages
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")
volume = modal.Volume.from_name("model-cache", create_if_missing=True)
@app.function(gpu="A10G", volumes={"/models": volume})
def load_model():
import os
model_path = "/models/llama-7b"
if not os.path.exists(model_path):
model = download_model()
model.save_pretrained(model_path)
volume.commit() # Persist changes
return load_from_path(model_path)
@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(text: str) -> dict:
return {"result": model.predict(text)}
from fastapi import FastAPI
web_app = FastAPI()
@web_app.post("/predict")
async def predict(text: str):
return {"result": await model.predict.remote.aio(text)}
@app.function()
@modal.asgi_app()
def fastapi_app():
return web_app
DecoratorUse Case
@modal.fastapi_endpoint()Simple function → API
@modal.asgi_app()Full FastAPI/Starlette apps
@modal.wsgi_app()Django/Flask apps
@modal.web_server(port)Arbitrary HTTP servers
@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(inputs: list[str]) -> list[dict]:
# Inputs automatically batched
return model.batch_predict(inputs)
Окно терминала
# Create secret
modal secret create huggingface HF_TOKEN=hf_xxx
@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
import os
token = os.environ["HF_TOKEN"]
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily midnight
def daily_job():
pass
@app.function(schedule=modal.Period(hours=1))
def hourly_job():
pass
@app.function(
container_idle_timeout=300, # Keep warm 5 min
allow_concurrent_inputs=10, # Handle concurrent requests
)
def inference():
pass
@app.cls(gpu="A100")
class Model:
@modal.enter() # Run once at container start
def load(self):
self.model = load_model() # Load during warm-up
@modal.method()
def predict(self, x):
return self.model(x)
@app.function()
def process_item(item):
return expensive_computation(item)
@app.function()
def run_parallel():
items = list(range(1000))
# Fan out to parallel containers
results = list(process_item.map(items))
return results
@app.function(
gpu="A100",
memory=32768, # 32GB RAM
cpu=4, # 4 CPU cores
timeout=3600, # 1 hour max
container_idle_timeout=120,# Keep warm 2 min
retries=3, # Retry on failure
concurrency_limit=10, # Max concurrent containers
)
def my_function():
pass
# Test locally
if __name__ == "__main__":
result = my_function.local()
# View logs
# modal app logs my-app
IssueSolution
Cold start latencyIncrease container_idle_timeout, use @modal.enter()
GPU OOMUse larger GPU (A100-80GB), enable gradient checkpointing
Image build failsPin dependency versions, check CUDA compatibility
Timeout errorsIncrease timeout, add checkpointing