Skip to content
Dashboard

Scaling

Machine Types

You can scale your app to a different machine type by using the fal apps scale command. For more info on available machine types, see the resources.

Change Machine Type For New Runners

Changing the machine type for new runners will not affect existing runners, but any new runners will use the new machine type. If you want to change the machine type for existing runners, you can manually kill the existing runners using fal workers kill and they will be replaced with new ones using the new machine type.

Terminal window
fal apps scale myapp --machine-type GPU-A100

Allow Using Multiple Machine Types

Sometimes you may want to allow your app to use multiple machine types. For example, to have a larger pool of available machines.

Terminal window
fal apps scale myapp --machine-type GPU-A100-40G GPU-A100-80G

Min Concurrency

Min concurrency is the minimum number of application instances (runners) that your app will keep alive at all times. Think of it as your app’s baseline capacity.

If your app takes a while to start up, or if you anticipate sudden spikes in requests, setting a higher min concurrency can ensure there are always enough runners ready to respond immediately.

Terminal window
fal apps scale myapp --min-concurrency 2

Concurrency Buffer

The concurrency buffer provides a cushion of extra runners above what’s currently needed to handle incoming requests. This is useful for apps with slow startup times, as it ensures there are always warm, ready runners to absorb sudden bursts of traffic without delays.

Unlike min concurrency, which sets a fixed floor, the concurrency buffer aims to keep a specified number of additional runners available beyond the live demand.

Note: When you set a concurrency buffer higher than min concurrency, it takes precedence over min concurrency. This means the system will always keep at least the number of runners specified by the buffer (plus current demand), even if this is higher than your min concurrency setting.

How it works

The system first calculates the number of runners needed for the current request volume. It then adds concurrency buffer to this number. The result is the total number of runners that will be kept alive.

Make sure to check the examples to see how the system will behave with different settings.

Terminal window
fal apps scale myapp --concurrency-buffer 2

Max Concurrency

Max concurrency is the absolute upper limit for the total number of runners that your app can scale up to. This cap helps prevent excessive resource usage and ensures cost control, regardless of how many requests pour in.

Terminal window
fal apps scale myapp --max-concurrency 10

Keep Alive

Keep alive is the amount of seconds a runner (beyond min concurrency) will be kept alive for your app. Depending on your traffic pattern, you might want to set this to a higher number, especially if your app is slow to start up.

Terminal window
fal apps scale myapp --keep-alive 300

Max Multiplexing

Maximum multiplexing is the maximum number of requests that can be handled by a single runner at any time. This is useful if your app instance is capable of handling multiple requests at the same time, which typically depends on the machine type and amount of resources that your app needs to process a request.

Terminal window
fal apps scale myapp --max-multiplexing 10

Examples

No multiplexing

Let’s consider an app with:

  • Min concurrency: 3
  • Concurrency buffer: 2
  • Max multiplexing: 1
  • Max concurrency: 10
No multiplexing

With multiplexing

Let’s consider an app with:

  • Min concurrency: 0
  • Concurrency buffer: 2
  • Max multiplexing: 4
  • Max concurrency: 6
With multiplexing

Since multiplexing of 4 is in place, a single runner can handle 4 requests at the same time. Also notice that even if min concurrency is set to 0, the system will still keep 2 runners alive to handle the buffer.