Skip to main content

What is a Runner?

A Runner is a compute instance of your application running on fal’s infrastructure. Each runner is tied to a specific machine type that determines its hardware resources (CPU cores, RAM, GPU type and count). Runners automatically start when requests arrive and shut down when idle to save costs. When you deploy an application, fal automatically creates and manages runners that:
  • Run on your configured machine type (e.g., GPU-H100, GPU-A100)
  • Can have 1-8 GPUs depending on your num_gpus configuration
  • Load your model and dependencies during startup
  • Serve requests from your users
  • Scale up and down based on demand
  • Share cached resources to improve performance

Runner Lifecycle and States

Runners transition through different states during their lifecycle:
StateDescription
PENDINGRunner is waiting to be scheduled on available hardware
DOCKER_PULLPulling Docker images from registry (if using custom container)
SETUPRunning setup() method - loading model and initializing resources
IDLEReady and waiting for work - no active requests
RUNNINGActively processing one or more requests
DRAININGFinishing current requests, won’t accept new ones
TERMINATINGShutting down, running teardown() if defined
TERMINATEDRunner has stopped and resources are released
Startup Flow (PENDINGDOCKER_PULLSETUPIDLE):
  1. When demand increases, fal schedules a new runner
  2. If using a custom container, Docker images are pulled
  3. Your setup() method runs to load models and initialize
  4. Runner enters IDLE state, ready to serve requests
Request Processing (IDLERUNNING):
  • When a request arrives, an IDLE runner transitions to RUNNING
  • After completing all requests, it returns to IDLE
  • Runners can handle multiple concurrent requests if max_multiplexing > 1
Shutdown Flow (DRAININGTERMINATINGTERMINATED):
  1. When scaling down or reaching expiration, runners enter DRAINING
  2. No new requests are routed, but existing requests continue
  3. After requests complete (or timeout), runner enters TERMINATING
  4. Your teardown() method runs for cleanup
  5. Runner is terminated and resources are freed
For details on startup and shutdown hooks, see App Lifecycle.

How Requests Reach Your App

When a user calls your app (via queue.fal.run or fal.run), fal’s infrastructure handles the full path from request to runner:
Request arrives --> Queue --> Dispatcher --> Runner --> Response
                     |            |
                     |            +--> scales up new runners if needed
                     |
                     +--> request waits here if all runners are busy

Scaling from Queue to Runners

  1. A request enters the queue
  2. The dispatcher checks if any IDLE runners are available
  3. If a runner is available, the request is routed immediately
  4. If all runners are busy, the request waits in the queue while fal scales up new runners based on your scaling parameters:
    • min_concurrency keeps runners always warm so requests never wait
    • concurrency_buffer maintains extra headroom above current demand
    • scaling_delay controls how quickly new runners spin up
    • max_concurrency caps the total number of runners
Requests are never dropped. There is no queue size limit. If your app can’t keep up, requests accumulate in the queue until runners become available.

What Happens When a Runner Fails

If a runner crashes or returns an error while processing a request:
  1. The request is automatically re-queued and dispatched to another runner
  2. This retries up to 10 times for server errors (503), timeouts (504), and connection failures
  3. Failed runners are replaced with healthy ones
You can control this behavior:
  • skip_retry_conditions in your App class disables retries for specific error types
  • X-Fal-Needs-Retry response header lets your code force or prevent retries per-response
See Retries for full details.

Your App Doesn’t Know About the Queue

Your endpoint code receives every request as a regular HTTP call — it doesn’t matter whether the caller used fal.run (synchronous), queue.fal.run (queued), or ws.fal.run (WebSocket). The queue and dispatch layer are transparent to your app code.

How Caching Works

fal uses a multi-layer caching system to reduce cold start times as your application serves traffic. fal’s caching system has three layers, each with different performance characteristics:
Cache LayerSpeedScopeUse Case
Local Node CacheFastestSame physical machineRunners on same node
Distributed CacheFastSame datacenter/regionRunners across nodes
Object StoreModerateGlobalFallback for cache misses
When a runner needs a file (model weights, Docker layers, etc.):
  1. Check local node cache first (fastest)
  2. If not found, check distributed datacenter cache
  3. If not found, fetch from object store and populate caches
What gets cached:
  • Docker Image Layers — container images are split into layers, each cached independently
  • Model Weights — files downloaded to /data are automatically cached (including HuggingFace models at /data/.cache/huggingface)
  • Compiled Model Caches — PyTorch Inductor compiled models, TensorRT engines, and other JIT compilation artifacts
As your application serves requests, caches automatically warm up. The first runner downloads everything from the object store and populates the local cache. Subsequent runners on the same node benefit from that local cache, and runners on other nodes benefit from the distributed cache. Over time, cold starts get progressively faster. See Use Persistent Storage for details on the /data caching system.