Skip to main content
Since Serverless billing covers the entire runner lifetime, the most effective cost optimizations focus on reducing idle time, maximizing utilization, and minimizing setup duration.

Reduce Idle Time

Idle time is the single biggest cost lever. Every second a runner is alive but not processing requests is billed.

Tune keep_alive

The keep_alive parameter controls how long a runner stays alive after finishing its last request. Lower values reduce idle billing but increase cold starts.
class MyApp(fal.App, keep_alive=30):  # 30 seconds (default: 60)
    ...
keep_aliveTrade-off
High (300s+)Low latency, higher idle costs
Medium (30-60s)Balanced for moderate traffic
Low (10-30s)Cost-efficient for bursty workloads

Set min_concurrency carefully

Each runner in min_concurrency runs continuously and is billed 24/7. Only use it for latency-critical applications that cannot tolerate cold starts.
class MyApp(fal.App, min_concurrency=0):  # No always-on runners
    ...

Maximize Runner Utilization

Runners that process one request at a time have idle gaps between sequential requests. Multiplexing eliminates these gaps.

Use multiplexing

max_multiplexing allows a single runner to process multiple requests concurrently, filling idle time while one request waits on I/O.
class MyApp(fal.App, max_multiplexing=4):
    ...
This is especially effective for workloads with I/O waits (network calls, file downloads) where the GPU would otherwise sit idle between operations.

Concurrency buffer

concurrency_buffer pre-warms runners before existing ones reach capacity. This reduces latency spikes but keeps more runners alive.
class MyApp(fal.App, max_multiplexing=4, concurrency_buffer=2):
    ...
Use conservatively — each buffered runner is billed while alive.

Reduce Setup Time

Since setup() time is billed, faster initialization directly reduces cost per cold start. This matters most for applications with frequent cold starts (low keep_alive, bursty traffic).

Right-Size Your Machine Type

Don’t over-provision GPU resources. A model that runs fine on an A6000 doesn’t need an H100.
  • Compare inference latency across machine types to find the smallest GPU that meets your requirements
  • Consider that a cheaper machine running slightly longer may cost less than a faster, more expensive one
  • See Machine Types for available options

Use Scaling Parameters Wisely

scaling_delay

Prevents spinning up new runners for short traffic spikes. The platform waits this duration before provisioning additional runners.
class MyApp(fal.App, scaling_delay=15):  # Wait 15s before scaling up
    ...
Useful for workloads with brief bursts that don’t justify new runners.

max_concurrency

Caps the total number of concurrent runners. This sets a hard ceiling on your spend but may increase queue times during traffic spikes.
class MyApp(fal.App, max_concurrency=5):  # At most 5 runners
    ...

Monitor and Iterate

Use the platform’s observability tools to identify cost optimization opportunities:
  • App Analytics: Identify apps with low utilization (high idle time relative to processing time)
  • Error Analytics: Find apps with high error rates that waste compute on failed requests
  • Exporting Metrics: Set up alerts for unusual spending patterns
Start with the defaults, then monitor your App Analytics for a few days. Look for runners with high idle-to-processing ratios — those are your best candidates for keep_alive and min_concurrency adjustments.