Skip to main content

Cold Starts vs Warm Starts

A cold start occurs when a new runner needs to be created from scratch. The runner goes through: PENDINGDOCKER_PULLSETUPIDLE before it can serve requests. A warm start occurs when an existing IDLE runner is reused to handle a new request: IDLERUNNING.

What Triggers Cold Starts

  • No warm runners available (all busy or expired)
  • Traffic spike exceeds warm runner capacity
  • First deployment
  • Runners expired during low traffic periods

Factors Affecting Cold Start Duration

  • Image size: Larger Docker images take longer to pull
  • Model size: Larger models take longer to download and load
  • Setup complexity: Complex initialization in setup() adds time
  • Cache state: First runs are slower, subsequent runs benefit from caching
  • Hardware availability: GPU availability varies by region and time

Scaling Parameters

The most effective way to reduce cold starts is maintaining warm runners using scaling parameters.

keep_alive

Default: 10 seconds Keep runners alive after their last request completes.
class MyApp(fal.App):
    keep_alive = 300  # Keep alive for 5 minutes
Benefits: Runners stay warm between requests, reduces cold starts for sporadic traffic Trade-offs: Longer keep_alive = higher costs; shorter = more cold starts

min_concurrency

Default: 0 Maintain minimum runners alive at all times, regardless of traffic.
class MyApp(fal.App):
    min_concurrency = 2  # Always keep 2 runners warm
Benefits: Guarantees warm runners always available, eliminates cold starts for baseline capacity Trade-offs: Costs money even with zero traffic

concurrency_buffer

Default: 0 Maintain extra runners beyond current demand.
class MyApp(fal.App):
    concurrency_buffer = 2  # Keep 2 extra runners ready
Benefits: Cushion for sudden traffic increases, reduces cold starts during bursts Trade-offs: Higher cost during all traffic levels Note: Takes precedence over min_concurrency when higher.

concurrency_buffer_perc

Default: 0 Set buffer as a percentage of current request volume.
class MyApp(fal.App):
    concurrency_buffer_perc = 20  # 20% buffer
Benefits: Scales buffer with traffic automatically Trade-offs: No buffer during zero traffic, expensive during high traffic Note: Actual buffer is the maximum of concurrency_buffer and concurrency_buffer_perc / 100 * request volume.

max_multiplexing

Default: 1 (code-specific parameter) Number of concurrent requests each runner handles simultaneously.
class MyApp(fal.App):
    max_multiplexing = 4  # Each runner handles up to 4 requests
Benefits: Fewer runners needed, fewer cold starts, better resource utilization Trade-offs: Must use async handlers, each request gets fewer resources, not suitable for all workloads Note: Code-specific parameter - CLI changes reset on next deployment.

scaling_delay

Default: 0 seconds Wait time before scaling up when a request is queued.
class MyApp(fal.App):
    scaling_delay = 30  # Wait 30 seconds before scaling
Benefits: Prevents premature scaling for brief spikes, reduces unnecessary cold starts Trade-offs: Requests wait longer during genuine traffic increases

startup_timeout

Default: Varies (code-specific parameter) Maximum time allowed for setup() to complete.
class MyApp(fal.App):
    startup_timeout = 600  # 10 minutes for setup
Benefits: Prevents runners from being killed during long setups, accommodates large model loading Trade-offs: Doesn’t reduce cold starts (only prevents failed startups), long timeouts can mask real issues Note: Code-specific parameter - CLI changes reset on next deployment.

Other Optimization Strategies

Image optimization: Use smaller base images, multi-stage builds. See Optimize Container Images. Persistent storage: Download models to /data for automatic caching. See Use Persistent Storage. Compiled caches: Share compilation artifacts across runners. See Optimize Startup with Compiled Caches.

Cost Considerations

More warm runners = lower latency but higher cost. Balance based on your needs:
  • Latency-critical apps: Accept higher cost for warm runners
  • Cost-sensitive apps: Optimize cold starts, accept some latency
  • Variable traffic: Use buffers and scaling delays