Skip to main content
Deployment is where your fal App goes from local development to a production API that serves real traffic. When you run fal deploy, fal builds your code into a container image, pushes it to a registry, and makes it available at a permanent endpoint ID. From that point, runners spin up on demand to handle requests, scale automatically based on traffic, and shut down when idle. You can roll back to any previous version instantly, deploy to separate environments for staging and production, and tune scaling parameters without redeploying. Before diving into this section, make sure you have installed the CLI and built your app following the Development guides. The pages here cover everything after your code is written: understanding what runners are, deploying to production, managing versions and environments, choosing hardware, and configuring scaling. If you are migrating from another platform, the migration guides can help you get started faster.

Quick Start

The simplest deployment is a single command:
fal deploy path/to/myapp.py::MyApp
This builds your app remotely, creates a persistent deployment, and gives you an endpoint ID like your-username/my-model that callers use with the fal client SDKs.

Runners and Requests

Before deploying, it helps to understand the execution model. When a caller submits a request, it enters a persistent queue and is dispatched to a runner. Runners are compute instances that pull your container image, run setup(), and serve requests until they scale down. Understanding how requests flow through the queue and how runners start, process, and shut down is essential for debugging latency, configuring scaling, and managing costs.

Deploying and Managing

Deployment creates a versioned revision of your app. Each deploy creates a new revision, so you can roll back to any previous version instantly if something goes wrong. You choose a rollout strategy (recreate for speed, rolling for zero downtime), configure authentication (private, public, or shared billing), and optionally deploy to separate environments for staging and production.

Scaling and Hardware

Once deployed, you control how your app scales. Scaling parameters determine how many runners stay warm, how quickly new ones spin up, and how many requests each runner handles concurrently. Machine types determine the hardware backing each runner, from lightweight CPU instances to H200 and B200 GPUs.