Deploy to Production

Deploying to production is the step that turns your local app into a permanent, scalable API endpoint. While fal run gives you a temporary URL for testing, fal deploy creates a persistent deployment that stays available until you explicitly remove it. Your app gets a stable endpoint ID (e.g., your-username/my-model) that callers use with the fal client SDKs or REST API, automatic runner scaling, and built-in reliability through the queue system. Before deploying, make sure you have installed the CLI, written your fal.App, and tested it locally with fal run. This page walks through the deployment process from the initial command through build, configuration, and what happens behind the scenes. For managing deployments after they are live, see Manage Deployments.

Development vs Production

During development, use fal run to get a temporary endpoint for testing. It creates an ephemeral deployment that is destroyed when you stop the process. By default, fal run uses public auth mode so you can test without an API key.

fal run path/to/myapp.py::MyApp

When you are ready for production, switch to fal deploy. This creates a persistent deployment with a permanent URL. By default, fal deploy uses private auth mode, requiring an API key to call your app.

fal deploy path/to/myapp.py::MyApp

Deploying Your App

The simplest deployment requires just the path to your app file and class:

fal deploy path/to/myapp.py::MyApp

fal builds your app into a container image remotely, pushes it to the registry, and makes it available at the endpoint ID your-username/app-name. Callers use this ID with fal_client.subscribe("your-username/app-name", ...) or via the REST API at https://queue.fal.run/your-username/app-name. You do not need Docker installed locally. The build happens entirely in the cloud.

App Name

By default, your app name is derived from the class name (e.g., TextToImage becomes text-to-image). You can override it in your code or via the CLI. The app name is the second part of your endpoint ID (your-username/app-name).

class MyApp(fal.App):
    app_name = "my-custom-name"

fal deploy my_app.py::MyApp --app-name my-custom-name

Machine Type

Set the hardware your app runs on using the machine_type class attribute. For GPU workloads, you can also specify num_gpus for multi-GPU configurations. See Machine Types for the full list of available hardware and guidance on choosing the right option.

class MyApp(fal.App):
    machine_type = "GPU-A100"
    num_gpus = 1

You can also specify multiple machine types to expand your available hardware pool, which reduces queue wait times:

class MyApp(fal.App):
    machine_type = ["GPU-A100", "GPU-H100"]

Regions

By default, your app can run in any available region. If you need data residency compliance or want to minimize latency for a specific geography, restrict it:

class MyApp(fal.App):
    machine_type = "GPU-A100"
    regions = ["us-east"]

Environments

You can deploy to different environments (e.g., staging, production) to isolate configurations and secrets:

fal deploy my_app.py::MyApp --env staging

Authentication Modes

Authentication controls who can call your app and who pays for compute. You set it with the --auth flag or the app_auth class attribute. Private is the default for fal deploy. Only the account owner and their team members can call the app, and they must include an API key. All compute is billed to the owner. Public is the default for fal run. Anyone with the URL can call the app without authentication. All compute is billed to the owner. Use this for open demos or internal tools where you control access at another layer. Shared allows anyone with the URL to call the app, but callers must authenticate with their own API key. Each caller pays for their own usage. This is how all apps in the Model Gallery work. Shared mode requires admin enablement on your account. See Publishing to the Marketplace for the full process of making your app available as a marketplace model.

class MyApp(fal.App):
    app_auth = "private"

Rollout Strategies

When you redeploy an app that is already running, fal needs to transition from the old revision to the new one. You can choose between two strategies: Rolling (default) spins up a runner on the new revision before switching traffic. fal waits for this runner to complete setup() and become healthy. Only after the new runner is confirmed ready does fal switch the alias. This ensures zero downtime but takes longer to complete. If the new runner fails to start (e.g., setup() crashes), the deployment is aborted and traffic stays on the old revision. Recreate instantly switches the app alias to the new revision. No runners are started proactively. The first request after redeployment triggers a cold start on the new revision. If your app has min_concurrency > 0, the scaling system will eventually bring runners up to meet that minimum, but this happens through normal scaling, not as part of the deploy itself.

fal deploy path/to/myapp.py::MyApp --strategy recreate

With rolling deployments, you will see a runner spin up during fal deploy even if there is no traffic. This is the health verification step. The deploy command stays open until the runner is confirmed healthy or fails.

For more on managing revisions and rolling back to previous versions, see Rollbacks.

How Builds Work

When you run fal deploy, your code is built into a container image remotely. You do not need Docker installed locally.

Upload

Your app code is sent to fal’s build service. For non-container apps, files listed in app_files are included. For container apps, files referenced by COPY/ADD in your Dockerfile are included automatically.

Build

fal builds your container image in the cloud using Depot, a remote Docker build service.

Cache check

A hash is computed from your Dockerfile, dependencies, build args, and container files. If an identical image already exists, the build is skipped entirely.

Push

The built image is stored in fal’s container registry.

Ready

Runners pull the image when they start up. Image size directly affects pull time and cold start duration.

Since builds happen remotely, your local machine does not need Docker, GPU drivers, or large amounts of disk space. You only need the fal CLI and your code.

Build Caching

fal uses multiple levels of caching to speed up builds. Build hash deduplication checks whether anything changed since the last build — if not, the previous image is reused with no build at all. Docker layer caching stores layers per-user in the registry, so unchanged layers from previous builds are reused and only modified layers are rebuilt. To force a full rebuild bypassing all caching:

fal deploy my_app.py --no-cache

For tips on structuring your Dockerfile for faster builds, see Optimize Container Images.

What’s Next

Manage Deployments

List, update, and delete your deployed apps

Rollbacks

Roll back to previous revisions or roll out new ones gracefully

Scale Your Application

Configure scaling parameters for production traffic

Optimizing Cold Starts

Reduce image size and startup time

Setting Up

Model APIs

Serverless

Compute

Organizations

Deploy to Production

Development vs Production

Deploying Your App

App Name

Machine Type

Regions

Environments

Authentication Modes

Rollout Strategies

How Builds Work

Build Caching

What’s Next

Manage Deployments

Rollbacks

Scale Your Application

Optimizing Cold Starts

Setting Up

Model APIs

Serverless

Compute

Organizations

​Development vs Production

​Deploying Your App

​App Name

​Machine Type

​Regions

​Environments

​Authentication Modes

​Rollout Strategies

​How Builds Work

​Build Caching

​What’s Next

Manage Deployments

Rollbacks

Scale Your Application

Optimizing Cold Starts

Development vs Production

Deploying Your App

App Name

Machine Type

Regions

Environments

Authentication Modes

Rollout Strategies

How Builds Work

Build Caching

What’s Next