What is a Runner?
A Runner is a compute instance of your application running on fal’s infrastructure. Each runner is tied to a specific machine type that determines its hardware resources (CPU cores, RAM, GPU type and count). Runners automatically start when requests arrive and shut down when idle to save costs. When you deploy an application, fal automatically creates and manages runners that:- Run on your configured machine type (e.g.,
GPU-H100,GPU-A100) - Can have 1-8 GPUs depending on your
num_gpusconfiguration - Load your model and dependencies during startup
- Serve requests from your users
- Scale up and down based on demand
- Share cached resources to improve performance
Runner Lifecycle and States
Runners transition through different states during their lifecycle:| State | Description |
|---|---|
| PENDING | Runner is waiting to be scheduled on available hardware |
| DOCKER_PULL | Pulling Docker images from registry (if using custom container) |
| SETUP | Running setup() method - loading model and initializing resources |
| IDLE | Ready and waiting for work - no active requests |
| RUNNING | Actively processing one or more requests |
| DRAINING | Finishing current requests, won’t accept new ones |
| TERMINATING | Shutting down, running teardown() if defined |
| TERMINATED | Runner has stopped and resources are released |
PENDING → DOCKER_PULL → SETUP → IDLE):
- When demand increases, fal schedules a new runner
- If using a custom container, Docker images are pulled
- Your
setup()method runs to load models and initialize - Runner enters IDLE state, ready to serve requests
IDLE ↔ RUNNING):
- When a request arrives, an IDLE runner transitions to RUNNING
- After completing all requests, it returns to IDLE
- Runners can handle multiple concurrent requests if
max_multiplexing > 1
DRAINING → TERMINATING → TERMINATED):
- When scaling down or reaching expiration, runners enter DRAINING
- No new requests are routed, but existing requests continue
- After requests complete (or timeout), runner enters TERMINATING
- Your
teardown()method runs for cleanup - Runner is terminated and resources are freed
How Requests Reach Your App
When a user calls your app (viaqueue.fal.run or fal.run), fal’s infrastructure handles the full path from request to runner:
Scaling from Queue to Runners
- A request enters the queue
- The dispatcher checks if any IDLE runners are available
- If a runner is available, the request is routed immediately
- If all runners are busy, the request waits in the queue while fal scales up new runners based on your scaling parameters:
min_concurrencykeeps runners always warm so requests never waitconcurrency_buffermaintains extra headroom above current demandscaling_delaycontrols how quickly new runners spin upmax_concurrencycaps the total number of runners
What Happens When a Runner Fails
If a runner crashes or returns an error while processing a request:- The request is automatically re-queued and dispatched to another runner
- This retries up to 10 times for server errors (503), timeouts (504), and connection failures
- Failed runners are replaced with healthy ones
skip_retry_conditionsin your App class disables retries for specific error typesX-Fal-Needs-Retryresponse header lets your code force or prevent retries per-response
Your App Doesn’t Know About the Queue
Your endpoint code receives every request as a regular HTTP call — it doesn’t matter whether the caller usedfal.run (synchronous), queue.fal.run (queued), or ws.fal.run (WebSocket). The queue and dispatch layer are transparent to your app code.
How Caching Works
fal uses a multi-layer caching system to reduce cold start times as your application serves traffic. fal’s caching system has three layers, each with different performance characteristics:| Cache Layer | Speed | Scope | Use Case |
|---|---|---|---|
| Local Node Cache | Fastest | Same physical machine | Runners on same node |
| Distributed Cache | Fast | Same datacenter/region | Runners across nodes |
| Object Store | Moderate | Global | Fallback for cache misses |
- Check local node cache first (fastest)
- If not found, check distributed datacenter cache
- If not found, fetch from object store and populate caches
- Docker Image Layers — container images are split into layers, each cached independently
- Model Weights — files downloaded to
/dataare automatically cached (including HuggingFace models at/data/.cache/huggingface) - Compiled Model Caches — PyTorch Inductor compiled models, TensorRT engines, and other JIT compilation artifacts
/data caching system.