Core Concepts

App

An App is a Python class that wraps your AI model for deployment. Your app defines what packages it needs, how to load your model, and how users interact with it.

class MyApp(fal.App):
    machine_type = "GPU-H100"  # Choose your hardware

    def setup(self):
        # Load your model here
        # Executed on each runner

    @fal.endpoint("/")
    def generate(self, input_data):
        # Your endpoint logic here—usually a model call

Where Code Runs: Local vs Remote Execution

What runs locally

Module import / top level: When Python imports your file, all top-level code executes on your machine.
This is where you typically define helpers, constants, and construct objects you might pass to the app.
Building the app object: The class body for your app is defined locally (like any Python class).
Serialization boundary preparation: When you reference local objects from the app (e.g., myobj), we attempt to pickle them locally to ship to the remote runtime.

What runs remotely

Class transfer & instantiation: Your fal.App subclass is pickled locally, then unpickled and instantiated remotely in the runtime you configured (e.g. requirements, container image, etc.).
App methods / entrypoints: Methods of your app class (e.g. setup, @fal.endpoints, etc) execute on the remote machine.
Referenced symbols:
- Pickled objects (closures, small data, simple classes) are shipped as part of the app payload.
- Importable code (installed packages or modules present in the remote image) is imported remotely instead of being shipped.

Example

# --- Local (import-time) ---
import os
import fal
import json

# Local constant (pickled if referenced).
# Environment variable comes from the local environment.
CONFIG = {"myparameter": os.environ["MYPARAMETER"]}

# Local helper (pickled by definition if referenced - code is not executed locally).
def myhelper(x):
    # Runs remotely
    import mylib
    return mylib.helper(x)

class MyApp(fal.App):
    def setup(self):
        # Runs remotely once on each runner
        # Load deps from remote environment (fast, deterministic)
        import mylib  # must be installed in remote image/requirements or dynamically installed before this line
        self.pipeline = mylib.load_pipeline()

    @fal.endpoint("/")
    def generate(self, input: MyInput) -> MyOutput:
        # Runs remotely on each request
        result = self.pipeline(input, k=CONFIG["MYPARAMETER"])
        return MyOutput(result)

Readiness & Liveness

At start

Fal considers an app as ready to serve requests immediately after the setup() method finishes successfully. If there is no setup() method, the app is considered ready as soon as the application web server is up. It is highly advisable to always use setup() to load necessary data and even perform a dummy evaluation to ensure that things work end to end. If setup() does not complete successfully or the web server port never opens, the runner is immediately terminated as unhealthy, and no requests forwarded to it.

Ongoing

Ongoing app readiness/liveness is determined based on the status codes it returns to requests:

2XX status codes indicate a successful request (app is alive and ready)
4XX status codes indicate a user error and are not retried (app is alive and ready)
503 status code indicates the runner is not healthy. The request is retried (when using the queue) and the runner is immediately terminated.
504 status code indicates that an upstream dependency is not healthy (but the app itself is). The request is retried, but the runner is not terminated.
Other 5XX status codes are not retried and the runner is not terminated

There are 2 additional states in case the app misbehaves:

The connection with the app breaks (e.g. the app crashes): requests that use the queue are retried, direct requests get a 503 status. The runner is shut down.
The app takes longer to respond than the request timeout: requests that use the queue are retried, direct requests get a 504 status. The runner is shut down as it may be faulty.

Machine Type

Machine Type specifies the hardware (CPU or GPU) your app runs on. Choose based on your model’s needs: "CPU" for lightweight models, "GPU-H100" for most AI models, or "GPU-B200" for large models.

Runner

A Runner is a compute instance that executes your app using your chosen machine type. Runners automatically start when requests arrive and shut down when idle to save costs.

Endpoint

An Endpoint is a function in your app that users can call via API. It defines how your model processes inputs and returns outputs.

Playground

Each endpoint gets an automatic Playground - a web interface where you can test your model with different inputs before integrating it into your application.

`fal run` vs `fal deploy`

fal run: Test your app on a single cloud gpu during development. Creates a temporary URL that disappears when you stop the command.
fal deploy: Deploy your app to production. Creates a permanent URL that stays available until you delete it.

Use fal run while building and testing, then fal deploy when ready for production use.

Getting Started

Tutorials

Deployment & Operations

Development

Multi-GPU Workloads

Advanced Optimizations

Migrations

CLI Reference

Python SDK

API Reference

Core Concepts

App

Where Code Runs: Local vs Remote Execution

What runs locally

What runs remotely

Example

Readiness & Liveness

At start

Ongoing

Machine Type

Runner

Endpoint

Playground

`fal run` vs `fal deploy`

Getting Started

Tutorials

Deployment & Operations

Development

Multi-GPU Workloads

Advanced Optimizations

Migrations

CLI Reference

Python SDK

API Reference

​App

​Where Code Runs: Local vs Remote Execution

​What runs locally

​What runs remotely

​Example

​Readiness & Liveness

​At start

​Ongoing

​Machine Type

​Runner

​Endpoint

​Playground

​fal run vs fal deploy

App

Where Code Runs: Local vs Remote Execution

What runs locally

What runs remotely

Example

Readiness & Liveness

At start

Ongoing

Machine Type

Runner

Endpoint

Playground

`fal run` vs `fal deploy`