Runners and Caching

What is a Runner?

A Runner is a compute instance of your application running on fal’s infrastructure. Each runner is tied to a specific machine type that determines its hardware resources (CPU cores, RAM, GPU type and count). Runners automatically start when requests arrive and shut down when idle to save costs. When you deploy an application, fal automatically creates and manages runners that:

Run on your configured machine type (e.g., GPU-H100, GPU-A100)
Can have 1-8 GPUs depending on your num_gpus configuration
Load your model and dependencies during startup
Serve requests from your users
Scale up and down based on demand
Share cached resources to improve performance

Runner Lifecycle and States

Runners transition through different states during their lifecycle:

State	Description
PENDING	Runner is waiting to be scheduled on available hardware
DOCKER_PULL	Pulling Docker images from registry (if using custom container)
SETUP	Running `setup()` method - loading model and initializing resources
IDLE	Ready and waiting for work - no active requests
RUNNING	Actively processing one or more requests
DRAINING	Finishing current requests, won’t accept new ones
TERMINATING	Shutting down, running `teardown()` if defined
TERMINATED	Runner has stopped and resources are released

Startup Flow (PENDING → DOCKER_PULL → SETUP → IDLE):

When demand increases, fal schedules a new runner
If using a custom container, Docker images are pulled
Your setup() method runs to load models and initialize
Runner enters IDLE state, ready to serve requests

Request Processing (IDLE ↔ RUNNING):

When a request arrives, an IDLE runner transitions to RUNNING
After completing all requests, it returns to IDLE
Runners can handle multiple concurrent requests if max_multiplexing > 1

Shutdown Flow (DRAINING → TERMINATING → TERMINATED):

When scaling down or reaching expiration, runners enter DRAINING
No new requests are routed, but existing requests continue
After requests complete (or timeout), runner enters TERMINATING
Your teardown() method runs for cleanup
Runner is terminated and resources are freed

For details on startup and shutdown hooks, see App Lifecycle.

How Requests Reach Your App

When a user calls your app (via queue.fal.run or fal.run), fal’s infrastructure handles the full path from request to runner:

Request arrives --> Queue --> Dispatcher --> Runner --> Response
                     |            |
                     |            +--> scales up new runners if needed
                     |
                     +--> request waits here if all runners are busy

Scaling from Queue to Runners

A request enters the queue
The dispatcher checks if any IDLE runners are available
If a runner is available, the request is routed immediately
If all runners are busy, the request waits in the queue while fal scales up new runners based on your scaling parameters:
- min_concurrency keeps runners always warm so requests never wait
- concurrency_buffer maintains extra headroom above current demand
- scaling_delay controls how quickly new runners spin up
- max_concurrency caps the total number of runners

Requests are never dropped. There is no queue size limit. If your app can’t keep up, requests accumulate in the queue until runners become available.

What Happens When a Runner Fails

If a runner crashes or returns an error while processing a request:

The request is automatically re-queued and dispatched to another runner
This retries up to 10 times for server errors (503), timeouts (504), and connection failures
Failed runners are replaced with healthy ones

You can control this behavior:

skip_retry_conditions in your App class disables retries for specific error types
X-Fal-Needs-Retry response header lets your code force or prevent retries per-response

See Retries for full details.

Your App Doesn’t Know About the Queue

Your endpoint code receives every request as a regular HTTP call — it doesn’t matter whether the caller used fal.run (synchronous), queue.fal.run (queued), or ws.fal.run (WebSocket). The queue and dispatch layer are transparent to your app code.

How Caching Works

fal uses a multi-layer caching system to reduce cold start times as your application serves traffic. fal’s caching system has three layers, each with different performance characteristics:

Cache Layer	Speed	Scope	Use Case
Local Node Cache	Fastest	Same physical machine	Runners on same node
Distributed Cache	Fast	Same datacenter/region	Runners across nodes
Object Store	Moderate	Global	Fallback for cache misses

When a runner needs a file (model weights, Docker layers, etc.):

Check local node cache first (fastest)
If not found, check distributed datacenter cache
If not found, fetch from object store and populate caches

What gets cached:

Docker Image Layers — container images are split into layers, each cached independently
Model Weights — files downloaded to /data are automatically cached (including HuggingFace models at /data/.cache/huggingface)
Compiled Model Caches — PyTorch Inductor compiled models, TensorRT engines, and other JIT compilation artifacts

As your application serves requests, caches automatically warm up. The first runner downloads everything from the object store and populates the local cache. Subsequent runners on the same node benefit from that local cache, and runners on other nodes benefit from the distributed cache. Over time, cold starts get progressively faster. See Use Persistent Storage for details on the /data caching system.

Setting Up

Model APIs

Serverless

Compute

Organizations

Runners and Caching

What is a Runner?

Runner Lifecycle and States

How Requests Reach Your App

Scaling from Queue to Runners

What Happens When a Runner Fails

Your App Doesn’t Know About the Queue

How Caching Works

Setting Up

Model APIs

Serverless

Compute

Organizations

​What is a Runner?

​Runner Lifecycle and States

​How Requests Reach Your App

​Scaling from Queue to Runners

​What Happens When a Runner Fails

​Your App Doesn’t Know About the Queue

​How Caching Works

What is a Runner?

Runner Lifecycle and States

How Requests Reach Your App

Scaling from Queue to Runners

What Happens When a Runner Fails

Your App Doesn’t Know About the Queue

How Caching Works