Skip to main content
fal Serverless lets you deploy your own AI models, pipelines, and applications on GPU infrastructure that scales automatically. You write a Python class, define your hardware requirements, and fal handles provisioning, scaling, networking, and observability. Your code runs on H100s, A100s, or any other machine type you choose, scaling from zero runners to thousands based on demand, and back to zero when traffic stops. Every model in the Model APIs marketplace is a fal.App running on Serverless. When you deploy your own app, you get the same queue-based reliability, the same analytics dashboard, and the same client SDKs. The difference is that you control the code, the model weights, and the container environment. You can also publish your app to the marketplace so anyone can call it with their own API key.
Enterprise Feature - Please visit the Serverless Get Started page to request access.

How It Works

The best deployment approach depends on where you are starting from. If you are migrating from another provider, you can be up and running with minimal code changes. If you are starting a new project, fal can build and manage the container for you. All three paths give you the same autoscaling, observability, and runner management.

Migrating an existing server

If you already have a working HTTP server (FastAPI, Flask, or any framework), this is the fastest path. Deploy it with @fal.function and exposed_port, and fal routes traffic to your server’s port with no code changes to your existing application.
import subprocess
import fal
from fal.container import ContainerImage

@fal.function(
    image=ContainerImage.from_dockerfile_str("FROM your-existing-image:latest"),
    machine_type="GPU-A100",
    exposed_port=8000,
)
def run_server():
    subprocess.run(
        ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"],
        check=True,
    )
fal.function supports all the same scaling parameters as fal.App (keep_alive, min_concurrency, max_concurrency, and more). See Migrate a Docker Server for a complete walkthrough and the full fal.function parameter reference. There are also step-by-step guides for Replicate, Modal, and RunPod.

Migrating a custom container

If you have a Docker image with your model and dependencies baked in but not a full HTTP server, you can bring it directly. Use ContainerImage to reference your Dockerfile or pull from a registry. You keep full control over the build while using fal’s endpoint system and scaling.
import fal
from fal.container import ContainerImage

class MyModel(fal.App):
    image = ContainerImage.from_dockerfile_str("""
        FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
        RUN apt-get update && apt-get install -y ffmpeg
        RUN pip install diffusers transformers
    """)
    machine_type = "GPU-A100"

    def setup(self):
        from diffusers import StableDiffusionXLPipeline
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0"
        ).to("cuda")

    @fal.endpoint("/")
    def generate(self, input: dict) -> dict:
        image = self.pipe(input["prompt"]).images[0]
        return {"image": image}
You can pull from private registries (Docker Hub, GCP Artifact Registry, AWS ECR). See Custom Container Images for the full guide.

Starting a new project

If you are building from scratch, use a native fal.App with pip requirements. You write a Python class, list your dependencies, and define your endpoints. fal builds the container for you.
import fal
from pydantic import BaseModel

class Input(BaseModel):
    prompt: str

class Output(BaseModel):
    result: str

class MyModel(fal.App):
    machine_type = "GPU-H100"
    requirements = ["torch", "transformers"]

    def setup(self):
        from transformers import pipeline
        self.pipe = pipeline("text-generation", model="gpt2", device="cuda")

    @fal.endpoint("/")
    def generate(self, input: Input) -> Output:
        result = self.pipe(input.prompt, max_length=50)[0]["generated_text"]
        return Output(result=result)
For the full environment setup options, see Defining Your Environment.

Test and deploy

Regardless of which approach you use, the workflow is the same. Test locally with fal run, then deploy with fal deploy. After deployment, your app gets a persistent URL, a Playground for browser-based testing, and automatic scaling based on incoming traffic. See the Quick Start to try it in under two minutes.

Scaling, Observability, and Cost

Once your app is deployed, fal manages the infrastructure for you. Runners spin up when requests arrive and shut down when idle. You control the tradeoff between latency and cost with parameters like keep_alive (how long idle runners stay warm) and min_concurrency (minimum warm runners). To understand how runners transition between states and how caching reduces cold starts, see Runners and Caching. Observability is built in. App Analytics shows request volume, latency percentiles, runner utilization, and startup duration. Error Analytics surfaces failing requests with stack traces. You can export metrics to your own stack via the Prometheus-compatible API or forward logs with Log Drains. For billing, see Serverless Pricing and Optimizing Costs.

Next Steps

If you are new to fal, start with the Quick Start, which walks through building and deploying a Hello World app in under two minutes. The Deploy Your First Image Generator tutorial applies the same workflow to a real Stable Diffusion XL model. Once you are comfortable with the basics, the App Lifecycle page explains how apps are structured, where code runs, and how runners start up and shut down. For more deployment examples covering text-to-image, video, speech, ComfyUI, and custom containers, browse the Examples section. For recent platform updates, check the Changelog or watch the overview below.