Deployment Overview

Deployment is where your fal App goes from local development to a production API that serves real traffic. When you run fal deploy, fal builds your code into a container image, pushes it to a registry, and makes it available at a permanent endpoint ID. From that point, runners spin up on demand to handle requests, scale automatically based on traffic, and shut down when idle. You can roll back to any previous version instantly, deploy to separate environments for staging and production, and tune scaling parameters without redeploying. Before diving into this section, make sure you have installed the CLI and built your app following the Development guides. The pages here cover everything after your code is written: understanding what runners are, deploying to production, managing versions and environments, choosing hardware, and configuring scaling. If you are migrating from another platform, the migration guides can help you get started faster.

Quick Start

The simplest deployment is a single command:

fal deploy path/to/myapp.py::MyApp

This builds your app remotely, creates a persistent deployment, and gives you an endpoint ID like your-username/my-model that callers use with the fal client SDKs.

Runners and Requests

Before deploying, it helps to understand the execution model. When a caller submits a request, it enters a persistent queue and is dispatched to a runner. Runners are compute instances that pull your container image, run setup(), and serve requests until they scale down. Understanding how requests flow through the queue and how runners start, process, and shut down is essential for debugging latency, configuring scaling, and managing costs.

Understanding Requests

Request lifecycle, retry interaction, and platform architecture diagram

Understanding Runners

Runner lifecycle states, startup, shutdown, and scaling behavior

Caching

How Docker layers, model weights, and compiled artifacts are cached across runners

Deploying and Managing

Deployment creates a versioned revision of your app. Each deploy creates a new revision, so you can roll back to any previous version instantly if something goes wrong. You choose a rollout strategy (recreate for speed, rolling for zero downtime), configure authentication (private, public, or shared billing), and optionally deploy to separate environments for staging and production.

Deploy to Production

Build, configure, and ship your app with rollout strategies and auth modes

Manage Deployments

List, update, and delete deployed apps

Rollbacks

Switch between revisions or restart runners with rollouts

Environments

Isolate staging and production with separate secrets and config

Scaling and Hardware

Once deployed, you control how your app scales. Scaling parameters determine how many runners stay warm, how quickly new ones spin up, and how many requests each runner handles concurrently. Machine types determine the hardware backing each runner, from lightweight CPU instances to H200 and B200 GPUs.

Scaling Parameters

Configure concurrency, keep-alive, multiplexing, and scaling delays

Scaling Configuration

Set parameters via code, CLI, or dashboard, and understand —reset-scale

Machine Types

Choose from CPU and GPU hardware based on memory, compute, and VRAM needs

Setting Up

Model APIs

Serverless

Compute

Organizations

Deployment Overview

Quick Start

Runners and Requests

Understanding Requests

Understanding Runners

Caching

Deploying and Managing

Deploy to Production

Manage Deployments

Rollbacks

Environments

Scaling and Hardware

Scaling Parameters

Scaling Configuration

Machine Types

Setting Up

Model APIs

Serverless

Compute

Organizations

​Quick Start

​Runners and Requests

Understanding Requests

Understanding Runners

Caching

​Deploying and Managing

Deploy to Production

Manage Deployments

Rollbacks

Environments

​Scaling and Hardware

Scaling Parameters

Scaling Configuration

Machine Types

Quick Start

Runners and Requests

Deploying and Managing

Scaling and Hardware