Skip to main content

What is a Runner?

A runner is a compute instance of your application running on fal’s infrastructure. Each runner is tied to a specific machine type that determines its hardware resources (CPU cores, RAM, GPU type and count). When you deploy an application, fal automatically creates and manages runners that:
  • Run on your configured machine type (e.g., GPU-H100, GPU-A100)
  • Can have 1-8 GPUs depending on your num_gpus configuration
  • Load your model and dependencies during startup
  • Serve requests from your users
  • Scale up and down based on demand
  • Share cached resources to improve performance
Each runner is an isolated environment with its own copy of your application code and loaded models. Machine type configuration example:
class MyApp(fal.App):
    machine_type = "GPU-H100"  # Specify GPU type
    num_gpus = 2                # Request 2 GPUs per runner
    # ...
For details on available machine types and how to choose the right one, see Machine Types. For multi-GPU workloads, see Multi-GPU Workloads documentation.

Runner Lifecycle and States

Runners transition through different states during their lifecycle. Understanding these states helps you monitor performance and debug issues.

Runner States

StateDescription
PENDINGRunner is waiting to be scheduled on available hardware
DOCKER_PULLPulling Docker images from registry (if using custom container)
SETUPRunning setup() method - loading model and initializing resources
IDLEReady and waiting for work - no active requests
RUNNINGActively processing one or more requests
DRAININGFinishing current requests, won’t accept new ones
TERMINATINGShutting down, running teardown() if defined
TERMINATEDRunner has stopped and resources are released

State Transitions Explained

Startup Flow (PENDINGDOCKER_PULLSETUPIDLE):
  1. When demand increases, fal schedules a new runner
  2. If using a custom container, Docker images are pulled
  3. Your setup() method runs to load models and initialize
  4. Runner enters IDLE state, ready to serve requests
Request Processing (IDLERUNNING):
  • When a request arrives, an IDLE runner transitions to RUNNING
  • After completing all requests, it returns to IDLE
  • Runners can handle multiple concurrent requests if max_multiplexing > 1
Shutdown Flow (DRAININGTERMINATINGTERMINATED):
  1. When scaling down or reaching expiration, runners enter DRAINING
  2. No new requests are routed, but existing requests continue
  3. After requests complete (or timeout), runner enters TERMINATING
  4. Your teardown() method runs for cleanup
  5. Runner is terminated and resources are freed

How Caching Works in fal

fal uses a sophisticated multi-layer caching system to reduce cold start times as your application serves traffic.

Multi-Layer Cache Architecture

fal’s caching system has three layers, each with different performance characteristics:
Cache LayerSpeedScopeUse Case
Local Node CacheFastestSame physical machineRunners on same node
Distributed CacheFastSame datacenter/regionRunners across nodes
Object StoreModerateGlobalFallback for cache misses
When a runner needs a file (model weights, Docker layers, etc.):
  1. Check local node cache first (fastest)
  2. If not found, check distributed datacenter cache
  3. If not found, fetch from object store and populate caches

What Gets Cached

Docker Image Layers:
  • Container images are split into layers
  • Each layer is cached independently
  • Shared layers across images are reused
Model Weights:
  • Files downloaded to /data are automatically cached
  • HuggingFace models cached at /data/.cache/huggingface
  • Custom weights you download are cached
Compiled Model Caches:
  • PyTorch Inductor compiled models
  • TensorRT engines
  • Other JIT compilation artifacts
See Use Persistent Storage for details on the /data caching system.

Cache Warming with Traffic

As your application serves requests, caches automatically warm up:
  1. First runner: Downloads everything from object store, populates local cache
  2. Same node runners: Benefit from local cache
  3. Other node runners: Benefit from distributed cache
  4. Over time: More nodes have cached data, cold starts get faster
This cache warming effect is why production performance improves significantly over time.

Monitoring Runner States

Understanding your runner states helps optimize performance and debug issues.

CLI Commands

# List all runners with their current states
fal runners list

# Filter by specific state
fal runners list --state idle
fal runners list --state running
fal runners list --state pending

# View runner history (up to 24 hours)
fal runners list --since "1h"
fal runners list --since "2024-01-15T10:00:00Z"

# Get detailed information about a specific runner
fal runners get <runner-id>
See fal runners CLI reference for all available commands.

Dashboard Metrics

The fal dashboard provides visual monitoring:
  • Runner state timeline: See state transitions over time
  • State duration breakdown: Understand where time is spent (PENDING, SETUP, RUNNING)
  • Active runners: Monitor current runner count and states
  • Cold start metrics: Track setup duration and cache effectiveness

Best Practices

Optimize Startup Performance

  • Minimize image size: Use smaller base images, multi-stage builds
  • Lazy loading: Load only what you need in setup()
  • Use persistent storage: Download models to /data for caching
  • Compiled caches: Share compilation artifacts across runners

Maintain Warm Runners

  • Set appropriate keep_alive: Balance cost vs latency based on your traffic patterns
  • Use min_concurrency: Keep minimum runners warm for predictable latency
  • Monitor IDLE runners: Understand how many runners are waiting vs actively serving

Monitor and Debug

  • Track state durations: Identify bottlenecks in startup sequence
  • Watch PENDING times: High PENDING times indicate capacity constraints
  • Monitor IDLE → RUNNING: Understand warm start utilization
  • Review TERMINATED runners: Debug failures using --since flag

Scale Effectively

  • Start conservative: Begin with min_concurrency = 0 or 1
  • Monitor and adjust: Use dashboard metrics to tune scaling parameters
  • Plan for traffic spikes: Use concurrency_buffer for headroom
  • Test production patterns: Simulate realistic traffic during testing