Optimizing Costs

Since Serverless billing covers the entire runner lifetime, the most effective cost optimizations focus on reducing idle time, maximizing utilization, and minimizing setup duration.

Reduce Idle Time

Idle time is the single biggest cost lever. Every second a runner is alive but not processing requests is billed.

Tune `keep_alive`

The keep_alive parameter controls how long a runner stays alive after finishing its last request. Lower values reduce idle billing but increase cold starts.

class MyApp(fal.App, keep_alive=30):  # 30 seconds (default: 60)
    ...

`keep_alive`	Trade-off
High (300s+)	Low latency, higher idle costs
Medium (30-60s)	Balanced for moderate traffic
Low (10-30s)	Cost-efficient for bursty workloads

Set `min_concurrency` carefully

Each runner in min_concurrency runs continuously and is billed 24/7. Only use it for latency-critical applications that cannot tolerate cold starts.

class MyApp(fal.App, min_concurrency=0):  # No always-on runners
    ...

Maximize Runner Utilization

Runners that process one request at a time have idle gaps between sequential requests. Multiplexing eliminates these gaps.

Use multiplexing

max_multiplexing allows a single runner to process multiple requests concurrently, filling idle time while one request waits on I/O.

class MyApp(fal.App, max_multiplexing=4):
    ...

This is especially effective for workloads with I/O waits (network calls, file downloads) where the GPU would otherwise sit idle between operations.

Concurrency buffer

concurrency_buffer pre-warms runners before existing ones reach capacity. This reduces latency spikes but keeps more runners alive.

class MyApp(fal.App, max_multiplexing=4, concurrency_buffer=2):
    ...

Use conservatively — each buffered runner is billed while alive.

Reduce Setup Time

Since setup() time is billed, faster initialization directly reduces cost per cold start. This matters most for applications with frequent cold starts (low keep_alive, bursty traffic).

FlashPack

High-throughput tensor loading for faster model initialization.

Optimizing Cold Starts

Strategies for reducing container startup and setup time.

Compiled Caches

Cache compiled kernels to skip recompilation on startup.

Persistent Storage

Use /data to cache downloads across runner restarts.

Right-Size Your Machine Type

Don’t over-provision GPU resources. A model that runs fine on an A6000 doesn’t need an H100.

Compare inference latency across machine types to find the smallest GPU that meets your requirements
Consider that a cheaper machine running slightly longer may cost less than a faster, more expensive one
See Machine Types for available options

Use Scaling Parameters Wisely

`scaling_delay`

Prevents spinning up new runners for short traffic spikes. The platform waits this duration before provisioning additional runners.

class MyApp(fal.App, scaling_delay=15):  # Wait 15s before scaling up
    ...

Useful for workloads with brief bursts that don’t justify new runners.

`max_concurrency`

Caps the total number of concurrent runners. This sets a hard ceiling on your spend but may increase queue times during traffic spikes.

class MyApp(fal.App, max_concurrency=5):  # At most 5 runners
    ...

Monitor and Iterate

Use the platform’s observability tools to identify cost optimization opportunities:

App Analytics: Identify apps with low utilization (high idle time relative to processing time)
Error Analytics: Find apps with high error rates that waste compute on failed requests
Exporting Metrics: Set up alerts for unusual spending patterns

Start with the defaults, then monitor your App Analytics for a few days. Look for runners with high idle-to-processing ratios — those are your best candidates for keep_alive and min_concurrency adjustments.

Setting Up

Model APIs

Serverless

Compute

Organizations

Optimizing Costs

Reduce Idle Time

Tune `keep_alive`

Set `min_concurrency` carefully

Maximize Runner Utilization

Use multiplexing

Concurrency buffer

Reduce Setup Time

FlashPack

Optimizing Cold Starts

Compiled Caches

Persistent Storage

Right-Size Your Machine Type

Use Scaling Parameters Wisely

`scaling_delay`

`max_concurrency`

Monitor and Iterate

Setting Up

Model APIs

Serverless

Compute

Organizations

​Reduce Idle Time

​Tune keep_alive

​Set min_concurrency carefully

​Maximize Runner Utilization

​Use multiplexing

​Concurrency buffer

​Reduce Setup Time

FlashPack

Optimizing Cold Starts

Compiled Caches

Persistent Storage

​Right-Size Your Machine Type

​Use Scaling Parameters Wisely

​scaling_delay

​max_concurrency

​Monitor and Iterate

Reduce Idle Time

Tune `keep_alive`

Set `min_concurrency` carefully

Maximize Runner Utilization

Use multiplexing

Concurrency buffer

Reduce Setup Time

Right-Size Your Machine Type

Use Scaling Parameters Wisely

`scaling_delay`

`max_concurrency`

Monitor and Iterate