fal.App proxy for full control over the API surface. Both approaches give you autoscaling, analytics, and the same infrastructure that powers every model in the marketplace.
This is the fastest path for teams migrating from self-hosted infrastructure, Kubernetes, or other serverless platforms. Your existing server code stays unchanged. You just need to define a Dockerfile (or reference an existing image from a private registry) and tell fal how to start your server. If you are starting from scratch rather than migrating, the Quick Start is a better starting point.
fal.function vs fal.App
Most of the Serverless documentation focuses onfal.App, the class-based approach where you define setup(), endpoints, and teardown() as methods on a class. For server migration, this guide uses @fal.function instead. It is a decorator-based alternative that wraps a single function rather than a class. You pass all configuration (machine type, scaling parameters, container image) as decorator arguments, and the function body runs on the remote machine.
fal.function is the natural fit for existing servers because you typically just need to start a process and expose a port. You do not need lifecycle hooks or multiple endpoints since your server already handles those. Both fal.function and fal.App support the same scaling parameters (keep_alive, min_concurrency, max_concurrency, and more). See the full parameter reference below for the complete list.
Option 1: Direct Server Mode
Useexposed_port to route requests directly to your container’s port. fal forwards all incoming traffic to that port without any intermediate processing. The port can be any valid port number, just ensure it matches the port your server listens on.
Option 2: Proxy App Mode
Usefal.App to wrap your server with custom endpoints. This gives you control over the API surface: you can validate inputs with Pydantic, transform outputs, upload files to the fal CDN, and define a typed schema that powers the Playground UI.
fal.App controls the API. The internal server runs on localhost inside the same container, and your proxy endpoints handle input validation, output processing, and CDN uploads. This approach is ideal when you want a clean typed API over an existing server that has its own internal protocol.
Using an External Registry
If your image is already hosted on an external registry (Docker Hub, Google Artifact Registry, Amazon ECR), you can pull it directly instead of building from a Dockerfile string. This avoids rebuilding the image on every deploy and is the recommended approach for production containers that are already built in CI. See Using Private Docker Registries for setup instructions including authentication for each registry type.fal.function Reference
The@fal.function decorator used in Option 1 accepts all the same infrastructure and scaling parameters that fal.App supports as class attributes. If you have been using fal.function and did not realize you could configure scaling, this table covers every available parameter.
| Parameter | Type | Default | Description |
|---|---|---|---|
image | ContainerImage | None | Custom Docker container for the function |
machine_type | str or list[str] | "XS" (CPU) | Hardware to run on. Use a list for fallback types. |
num_gpus | int | None | Number of GPUs to allocate |
exposed_port | int | None | Route traffic directly to this port (for existing servers) |
requirements | list[str] | None | Pip packages to install (when not using image) |
keep_alive | int | 10 | Seconds an idle runner stays alive before shutting down |
min_concurrency | int | 0 | Minimum runners kept warm at all times |
max_concurrency | int | None | Maximum runners to scale up to |
max_multiplexing | int | 1 | Maximum concurrent requests per runner |
concurrency_buffer | int | 0 | Extra runners to keep warm above current load |
concurrency_buffer_perc | int | 0 | Percentage buffer of runners above current load |
scaling_delay | int | None | Seconds to wait before scaling up for a new request |
request_timeout | int | None | Maximum seconds for a single request |
startup_timeout | int | None | Maximum seconds for the function to start |
setup_function | Callable | None | One-time initialization function (runs before first request) |
regions | list[str] | None | Restrict to specific regions |
serve | bool | False | Run as an HTTP server on port 8080 |
metadata | dict | None | App metadata. Pass {"openapi": {...}} to provide your OpenAPI spec for Playground and endpoint listing. Required for fal.function with exposed_port. |
local_python_modules | list[str] | None | Local Python modules to ship to the remote environment. See Import Code. |
python_version | str | None | Python version to use (for virtualenv kind). |
fal.App configuration, see App Lifecycle. For scaling parameter details, see Scale Your Application.
Providing an OpenAPI Spec via Metadata
When usingfal.function with exposed_port, fal does not automatically generate an OpenAPI spec from your code (unlike fal.App which derives it from your Pydantic models). To enable the Playground, endpoint listing, and schema validation in the dashboard, pass your OpenAPI spec through the metadata parameter.
Even if your server is built with FastAPI (which exposes
/openapi.json automatically), the metadata approach is still required for fal.function deployments. The platform reads the spec from metadata at registration time, not from the running server.Best Practices
Download model weights to persistent storage (/data) in your setup() method rather than baking them into the Docker image. This keeps your image small, speeds up container pulls, and allows weights to be cached across runner restarts. The /data directory is shared across all runners in your account and persists between deploys.
When building your Dockerfile, install fal-specific packages (boto3, protobuf, pydantic) at the end to avoid version conflicts with your existing dependencies. If your base image already includes these packages, the fal runtime will use the versions in your image.
Tune keep_alive based on your app’s cold start time and traffic pattern. If your model takes minutes to load, a longer keep_alive avoids paying that cost repeatedly. If your app starts quickly, a shorter value reduces idle billing. See Optimizing Costs for guidance.
Next Steps
For a complete tutorial that applies this pattern to a real server, see the ComfyUI deployment example. For detailed Dockerfile configuration including build args, multi-stage builds, and private registries, see Custom Container Images. To understand how the/data persistent storage works and what gets cached, see Use Persistent Storage.