Reduce Idle Time
Idle time is the single biggest cost lever. Every second a runner is alive but not processing requests is billed.Tune keep_alive
The keep_alive parameter controls how long a runner stays alive after finishing its last request. Lower values reduce idle billing but increase cold starts.
keep_alive | Trade-off |
|---|---|
| High (300s+) | Low latency, higher idle costs |
| Medium (30-60s) | Balanced for moderate traffic |
| Low (10-30s) | Cost-efficient for bursty workloads |
Set min_concurrency carefully
Each runner in min_concurrency runs continuously and is billed 24/7. Only use it for latency-critical applications that cannot tolerate cold starts.
Maximize Runner Utilization
Runners that process one request at a time have idle gaps between sequential requests. Multiplexing eliminates these gaps.Use multiplexing
max_multiplexing allows a single runner to process multiple requests concurrently, filling idle time while one request waits on I/O.
Concurrency buffer
concurrency_buffer pre-warms runners before existing ones reach capacity. This reduces latency spikes but keeps more runners alive.
Reduce Setup Time
Sincesetup() time is billed, faster initialization directly reduces cost per cold start. This matters most for applications with frequent cold starts (low keep_alive, bursty traffic).
FlashPack
High-throughput tensor loading for faster model initialization.
Optimizing Cold Starts
Strategies for reducing container startup and setup time.
Compiled Caches
Cache compiled kernels to skip recompilation on startup.
Persistent Storage
Use /data to cache downloads across runner restarts.
Right-Size Your Machine Type
Don’t over-provision GPU resources. A model that runs fine on an A6000 doesn’t need an H100.- Compare inference latency across machine types to find the smallest GPU that meets your requirements
- Consider that a cheaper machine running slightly longer may cost less than a faster, more expensive one
- See Machine Types for available options
Use Scaling Parameters Wisely
scaling_delay
Prevents spinning up new runners for short traffic spikes. The platform waits this duration before provisioning additional runners.
max_concurrency
Caps the total number of concurrent runners. This sets a hard ceiling on your spend but may increase queue times during traffic spikes.
Monitor and Iterate
Use the platform’s observability tools to identify cost optimization opportunities:- App Analytics: Identify apps with low utilization (high idle time relative to processing time)
- Error Analytics: Find apps with high error rates that waste compute on failed requests
- Exporting Metrics: Set up alerts for unusual spending patterns