Skip to content
Dashboard

Scale Your Application

Scaling configuration is essential for maintaining performance while controlling costs in production fal Serverless applications. This guide covers scaling parameters, auto-scaling strategies, and cost optimization techniques to help you handle variable traffic loads efficiently.

Configuration Methods

You can configure scaling settings in three ways:

Configure scaling directly in your application code using class parameters. This is the recommended approach for most scenarios as it keeps your configuration close to your code:

import fal
class MyApp(
fal.App,
keep_alive=300,
min_concurrency=1,
max_concurrency=10,
concurrency_buffer=2
):
# ...

Using CLI Commands

Adjust scaling settings for deployed applications using CLI commands. This is useful for production tuning without code changes:

Terminal window
fal apps scale myapp --min-concurrency 2 --max-concurrency 15

Note: Redeploying the application, by default, will not reset these settings.

Using API

Scale applications programmatically using the REST API. This is useful for integrating scaling operations into CI/CD pipelines or automated management systems.

The scale endpoint requires an API key with ADMIN scope. Check the Authentication section for more information.

Endpoint: PUT https://rest.alpha.fal.ai/applications/scale/{app_name}

Headers: Authorization: Key {fal_key}

Updating Scale Settings on Redeployment

By default, redeploying an app uses the --no-scale behavior. This means the scale settings defined in your app’s code are ignored, and the app’s existing scale is preserved.

To apply the scale settings from your app’s code during deployment, use the --reset-scale flag:

Terminal window
fal deploy --reset-scale

Minimum Concurrency

Minimum concurrency is the minimum number of runners (application instances) that your app will keep alive at all times. Think of it as your app’s baseline capacity.

If your app takes a while to start up, or if you anticipate sudden spikes in requests, setting a higher minimum concurrency can ensure there are always enough runners ready to respond immediately.

In Code

class MyApp(fal.App, min_concurrency=2):
# ...

Using CLI

Terminal window
fal apps scale myapp --min-concurrency 2

Using API

Terminal window
curl --location --request PUT 'https://rest.alpha.fal.ai/applications/scale/{app_name}' \
--header 'Authorization: Key {KEY}' \
--header 'Content-Type: application/json' \
--data '{
"min_concurrency": 2
}'

Concurrency Buffer

The concurrency buffer provides a cushion of extra runners above what’s currently needed to handle incoming requests. This is useful for apps with slow startup times, as it ensures there are always warm, ready runners to absorb sudden bursts of traffic without delays.

Unlike min concurrency, which sets a fixed floor, the concurrency buffer aims to keep a specified number of additional runners available beyond the live demand.

The system first calculates the number of runners needed for the current request volume. It then adds concurrency buffer to this number. The result is the total number of runners that will be kept alive.

Note: When you set a concurrency buffer higher than min concurrency, it takes precedence over min concurrency. This means the system will always keep at least the number of runners specified by the buffer (plus current demand), even if this is higher than your min concurrency setting.

In Code

class MyApp(fal.App, concurrency_buffer=2):
# ...

Using CLI

Terminal window
fal apps scale myapp --concurrency-buffer 2

Using API

Terminal window
curl --location --request PUT 'https://rest.alpha.fal.ai/applications/scale/{app_name}' \
--header 'Authorization: Key {KEY}' \
--header 'Content-Type: application/json' \
--data '{
"concurrency_buffer": 2
}'

Max Concurrency

Max concurrency is the absolute upper limit for the total number of runners that your app can scale up to. This cap helps prevent excessive resource usage and ensures cost control, regardless of how many requests pour in.

In Code

class MyApp(fal.App, max_concurrency=10):
# ...

Using CLI

Terminal window
fal apps scale myapp --max-concurrency 10

Using API

Terminal window
curl --location --request PUT 'https://rest.alpha.fal.ai/applications/scale/{app_name}' \
--header 'Authorization: Key {KEY}' \
--header 'Content-Type: application/json' \
--data '{
"max_concurrency": 10
}'

Keep Alive

Keep alive is the amount of seconds a runner (beyond min concurrency) will be kept alive for your app. Depending on your traffic pattern, you might want to set this to a higher number, especially if your app is slow to start up.

In Code

class MyApp(fal.App, keep_alive=300):
# ...

Using CLI

Terminal window
fal apps scale myapp --keep-alive 300

Using API

Terminal window
curl --location --request PUT 'https://rest.alpha.fal.ai/applications/scale/{app_name}' \
--header 'Authorization: Key {KEY}' \
--header 'Content-Type: application/json' \
--data '{
"keep_alive": 300
}'

Max Multiplexing

Maximum multiplexing is the maximum number of requests that can be handled by a single runner at any time. This is useful if your app instance is capable of handling multiple requests at the same time, which typically depends on the machine type and amount of resources that your app needs to process a request.

In Code

class MyApp(fal.App, max_multiplexing=10):
# ...

Using CLI

Terminal window
fal apps scale myapp --max-multiplexing 10

Using API

Terminal window
curl --location --request PUT 'https://rest.alpha.fal.ai/applications/scale/{app_name}' \
--header 'Authorization: Key {KEY}' \
--header 'Content-Type: application/json' \
--data '{
"max_multiplexing": 10
}'

Scaling Examples

No multiplexing

Let’s consider an app with:

  • Min concurrency: 3
  • Concurrency buffer: 2
  • Max multiplexing: 1
  • Max concurrency: 10
No multiplexing

With multiplexing

Let’s consider an app with:

  • Min concurrency: 0
  • Concurrency buffer: 2
  • Max multiplexing: 4
  • Max concurrency: 6
With multiplexing

Since multiplexing of 4 is in place, a single runner can handle 4 requests at the same time. Also notice that even if min concurrency is set to 0, the system will still keep 2 runners alive to handle the buffer.

Cost Optimization Strategies

  • Start with conservative settings and adjust based on actual usage patterns
  • Use concurrency buffer for apps with slow startup times instead of high min concurrency
  • Enable multiplexing when your app can handle concurrent requests efficiently
  • Monitor usage patterns and adjust scaling parameters accordingly
  • Set reasonable max concurrency to prevent runaway costs