Optimizing Cold Starts

Cold Starts vs Warm Starts

A cold start occurs when a new runner needs to be created from scratch. The runner goes through: PENDING → DOCKER_PULL → SETUP → IDLE before it can serve requests. A warm start occurs when an existing IDLE runner is reused to handle a new request: IDLE → RUNNING.

What Triggers Cold Starts

No warm runners available (all busy or expired)
Traffic spike exceeds warm runner capacity
First deployment
Runners expired during low traffic periods

Factors Affecting Cold Start Duration

Image size: Larger Docker images take longer to pull
Model size: Larger models take longer to download and load
Setup complexity: Complex initialization in setup() adds time
Cache state: First runs are slower, subsequent runs benefit from caching
Hardware availability: GPU availability varies by region and time

Scaling Parameters

The most effective way to reduce cold starts is maintaining warm runners using scaling parameters.

`keep_alive`

Default: 10 seconds Keep runners alive after their last request completes.

class MyApp(fal.App):
    keep_alive = 300  # Keep alive for 5 minutes

Benefits: Runners stay warm between requests, reduces cold starts for sporadic traffic Trade-offs: Longer keep_alive = higher costs; shorter = more cold starts

`min_concurrency`

Default: 0 Maintain minimum runners alive at all times, regardless of traffic.

class MyApp(fal.App):
    min_concurrency = 2  # Always keep 2 runners warm

Benefits: Guarantees warm runners always available, eliminates cold starts for baseline capacity Trade-offs: Costs money even with zero traffic

`concurrency_buffer`

Default: 0 Maintain extra runners beyond current demand.

class MyApp(fal.App):
    concurrency_buffer = 2  # Keep 2 extra runners ready

Benefits: Cushion for sudden traffic increases, reduces cold starts during bursts Trade-offs: Higher cost during all traffic levels Note: Takes precedence over min_concurrency when higher.

`concurrency_buffer_perc`

Default: 0 Set buffer as a percentage of current request volume.

class MyApp(fal.App):
    concurrency_buffer_perc = 20  # 20% buffer

Benefits: Scales buffer with traffic automatically Trade-offs: No buffer during zero traffic, expensive during high traffic Note: Actual buffer is the maximum of concurrency_buffer and concurrency_buffer_perc / 100 * request volume.

`max_multiplexing`

Default: 1 (code-specific parameter) Number of concurrent requests each runner handles simultaneously.

class MyApp(fal.App):
    max_multiplexing = 4  # Each runner handles up to 4 requests

Benefits: Fewer runners needed, fewer cold starts, better resource utilization Trade-offs: Must use async handlers, each request gets fewer resources, not suitable for all workloads Note: Code-specific parameter - CLI changes reset on next deployment.

`scaling_delay`

Default: 0 seconds Wait time before scaling up when a request is queued.

class MyApp(fal.App):
    scaling_delay = 30  # Wait 30 seconds before scaling

Benefits: Prevents premature scaling for brief spikes, reduces unnecessary cold starts Trade-offs: Requests wait longer during genuine traffic increases

`startup_timeout`

Default: Varies (code-specific parameter) Maximum time allowed for setup() to complete.

class MyApp(fal.App):
    startup_timeout = 600  # 10 minutes for setup

Benefits: Prevents runners from being killed during long setups, accommodates large model loading Trade-offs: Doesn’t reduce cold starts (only prevents failed startups), long timeouts can mask real issues Note: Code-specific parameter - CLI changes reset on next deployment.

Other Optimization Strategies

Image optimization: Use smaller base images, multi-stage builds. See Optimize Container Images. Persistent storage: Download models to /data for automatic caching. See Use Persistent Storage. Compiled caches: Share compilation artifacts across runners. See Optimize Startup with Compiled Caches.

Cost Considerations

More warm runners = lower latency but higher cost. Balance based on your needs:

Latency-critical apps: Accept higher cost for warm runners
Cost-sensitive apps: Optimize cold starts, accept some latency
Variable traffic: Use buffers and scaling delays

Understanding Runners - Runner lifecycle and states
Scale Your Application - Complete scaling parameter reference
Monitor Performance - Performance monitoring and metrics

Getting Started

Reliability

Deployment & Operations

Development

Multi-GPU Workloads

Advanced Optimizations

Migrations

Python SDK

Optimizing Cold Starts

Cold Starts vs Warm Starts

What Triggers Cold Starts

Factors Affecting Cold Start Duration

Scaling Parameters

`keep_alive`

`min_concurrency`

`concurrency_buffer`

`concurrency_buffer_perc`

`max_multiplexing`

`scaling_delay`

`startup_timeout`

Other Optimization Strategies

Cost Considerations

Getting Started

Reliability

Deployment & Operations

Development

Multi-GPU Workloads

Advanced Optimizations

Migrations

Python SDK

​Cold Starts vs Warm Starts

​What Triggers Cold Starts

​Factors Affecting Cold Start Duration

​Scaling Parameters

​keep_alive

​min_concurrency

​concurrency_buffer

​concurrency_buffer_perc

​max_multiplexing

​scaling_delay

​startup_timeout

​Other Optimization Strategies

​Cost Considerations

​Related Resources

Cold Starts vs Warm Starts

What Triggers Cold Starts

Factors Affecting Cold Start Duration

Scaling Parameters

`keep_alive`

`min_concurrency`

`concurrency_buffer`

`concurrency_buffer_perc`

`max_multiplexing`

`scaling_delay`

`startup_timeout`

Other Optimization Strategies

Cost Considerations

Related Resources