What is a Runner?
A runner is a compute instance of your application running on fal’s infrastructure. Each runner is tied to a specific machine type that determines its hardware resources (CPU cores, RAM, GPU type and count). When you deploy an application, fal automatically creates and manages runners that:- Run on your configured machine type (e.g.,
GPU-H100,GPU-A100) - Can have 1-8 GPUs depending on your
num_gpusconfiguration - Load your model and dependencies during startup
- Serve requests from your users
- Scale up and down based on demand
- Share cached resources to improve performance
Runner Lifecycle and States
Runners transition through different states during their lifecycle. Understanding these states helps you monitor performance and debug issues.Runner States
| State | Description |
|---|---|
| PENDING | Runner is waiting to be scheduled on available hardware |
| DOCKER_PULL | Pulling Docker images from registry (if using custom container) |
| SETUP | Running setup() method - loading model and initializing resources |
| IDLE | Ready and waiting for work - no active requests |
| RUNNING | Actively processing one or more requests |
| DRAINING | Finishing current requests, won’t accept new ones |
| TERMINATING | Shutting down, running teardown() if defined |
| TERMINATED | Runner has stopped and resources are released |
State Transitions Explained
Startup Flow (PENDING → DOCKER_PULL → SETUP → IDLE):
- When demand increases, fal schedules a new runner
- If using a custom container, Docker images are pulled
- Your
setup()method runs to load models and initialize - Runner enters IDLE state, ready to serve requests
IDLE ↔ RUNNING):
- When a request arrives, an IDLE runner transitions to RUNNING
- After completing all requests, it returns to IDLE
- Runners can handle multiple concurrent requests if
max_multiplexing > 1
DRAINING → TERMINATING → TERMINATED):
- When scaling down or reaching expiration, runners enter DRAINING
- No new requests are routed, but existing requests continue
- After requests complete (or timeout), runner enters TERMINATING
- Your
teardown()method runs for cleanup - Runner is terminated and resources are freed
How Caching Works in fal
fal uses a sophisticated multi-layer caching system to reduce cold start times as your application serves traffic.Multi-Layer Cache Architecture
fal’s caching system has three layers, each with different performance characteristics:| Cache Layer | Speed | Scope | Use Case |
|---|---|---|---|
| Local Node Cache | Fastest | Same physical machine | Runners on same node |
| Distributed Cache | Fast | Same datacenter/region | Runners across nodes |
| Object Store | Moderate | Global | Fallback for cache misses |
- Check local node cache first (fastest)
- If not found, check distributed datacenter cache
- If not found, fetch from object store and populate caches
What Gets Cached
Docker Image Layers:- Container images are split into layers
- Each layer is cached independently
- Shared layers across images are reused
- Files downloaded to
/dataare automatically cached - HuggingFace models cached at
/data/.cache/huggingface - Custom weights you download are cached
- PyTorch Inductor compiled models
- TensorRT engines
- Other JIT compilation artifacts
/data caching system.
Cache Warming with Traffic
As your application serves requests, caches automatically warm up:- First runner: Downloads everything from object store, populates local cache
- Same node runners: Benefit from local cache
- Other node runners: Benefit from distributed cache
- Over time: More nodes have cached data, cold starts get faster
Monitoring Runner States
Understanding your runner states helps optimize performance and debug issues.CLI Commands
fal runners CLI reference for all available commands.
Dashboard Metrics
The fal dashboard provides visual monitoring:- Runner state timeline: See state transitions over time
- State duration breakdown: Understand where time is spent (PENDING, SETUP, RUNNING)
- Active runners: Monitor current runner count and states
- Cold start metrics: Track setup duration and cache effectiveness
Best Practices
Optimize Startup Performance
- Minimize image size: Use smaller base images, multi-stage builds
- Lazy loading: Load only what you need in
setup() - Use persistent storage: Download models to
/datafor caching - Compiled caches: Share compilation artifacts across runners
Maintain Warm Runners
- Set appropriate
keep_alive: Balance cost vs latency based on your traffic patterns - Use
min_concurrency: Keep minimum runners warm for predictable latency - Monitor IDLE runners: Understand how many runners are waiting vs actively serving
Monitor and Debug
- Track state durations: Identify bottlenecks in startup sequence
- Watch PENDING times: High PENDING times indicate capacity constraints
- Monitor IDLE → RUNNING: Understand warm start utilization
- Review TERMINATED runners: Debug failures using
--sinceflag
Scale Effectively
- Start conservative: Begin with
min_concurrency = 0or1 - Monitor and adjust: Use dashboard metrics to tune scaling parameters
- Plan for traffic spikes: Use
concurrency_bufferfor headroom - Test production patterns: Simulate realistic traffic during testing
Related Resources
- Optimizing Cold Starts - Strategies to reduce cold start times
- Scale Your Application - Detailed scaling configuration
- Monitor Performance - Performance monitoring and metrics
- Use Persistent Storage - Details on the
/datacaching system - Optimize Startup with Compiled Caches - Share compilation artifacts
- Core Concepts - Learn about
setup()andteardown()lifecycle methods