Introduction to Private Serverless Apps
Private App Deployments
fal is a Generative Media Cloud with a model marketplace, private app deployments, and a training acceleration platform. This section covers our private app deployment system, which enables you to host your custom apps and workflows on our infrastructure—the same infrastructure that powers our own models.
Key Features
- A unified framework for running, deploying, and productionizing your AI apps
- Access to tens of thousands of GPUs, with dynamic scale up/down policies
- Full observability into requests, responses, and latencies (including custom metrics)
- Native HTTP and WebSocket clients that can be used for both fal-provided models and your own apps
- Access to fal’s Inference Engine for accelerating your apps/workflows
- And much more
Getting Started
Installation
If you have access to our private apps deployment offering, create a fresh virtual environment (we strongly recommend Python 3.11, but other versions are supported) and install the fal
package:
pip install --upgrade fal
Authentication
Log in to either your personal account or any team you’re a member of. Be careful to select the proper entity, as the private beta access is probably set only for your team account and not your personal account.
fal auth login
When prompted, select your team:
If browser didn't open automatically, on your computer or mobile device navigate to [...]
Confirm it shows the following code: [...]
✓ Authenticated successfully, welcome!
Please choose a team account to use or leave blank to use your personal account:[team1/team2/team3]: team1
Confirm you selected the right team:
fal auth whoami
Running Your First App
Every deployment in fal is a subclass of fal.App
which consists of one or more @fal.endpoint
decorators. For simple apps or workflows with only one endpoint, it generally takes the root (/
) endpoint.
Here’s an example application that runs stabilityai/stable-diffusion-xl-base-1.0, an open-source text-to-image model:
import falfrom pydantic import BaseModel, Fieldfrom fal.toolkit import Image
class Input(BaseModel): prompt: str = Field( description="The prompt to generate an image from", examples=["A beautiful image of a cat"], )
class Output(BaseModel): image: Image
class MyApp(fal.App, keep_alive=300, name="my-demo-app"): machine_type = "GPU-H100" requirements = [ "hf-transfer==0.1.9", "diffusers[torch]==0.32.2", "transformers[sentencepiece]==4.51.0", "accelerate==1.6.0", ]
def setup(self): # Enable HF Transfer for faster downloads import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
import torch from diffusers import StableDiffusionXLPipeline
# Load any model you want, we'll use stabilityai/stable-diffusion-xl-base-1.0 # Huggingface models will be automatically downloaded to # the persistent storage of your account self.pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True, ).to("cuda")
# Warmup the model before the first request self.warmup()
def warmup(self): self.pipe("A beautiful image of a cat")
@fal.endpoint("/") def run(self, request: Input) -> Output: result = self.pipe(request.prompt) image = Image.from_pil(result.images[0]) return Output(image=image)
The application is divided into four parts:
- I/O: Defines inputs for the inference process (which in this case just takes a prompt), and the outputs (using the special
fal.toolkit.Image
which automatically uploads your images to fal’s CDN) - App definition: Each app is completely separated from your local computer. You need to precisely define what dependencies your app requires to run in the cloud. In this case, it’s
diffusers
and other related packages. setup()
function: Before an app starts serving requests, it will always run the user-definedsetup()
function. This is where you download your models, load them to GPU memory, and run warmups.@fal.endpoint("/")
: Using the I/O definitions and the pipeline loaded insetup()
, this is where you implement the inference process. In this example, it simply calls the pipeline with the user’s prompt and wraps the created PIL image withImage.from_pil()
to upload it to the CDN.
Testing Your Application
To run and test your app locally to ensure everything works:
fal run example.py::MyApp
Notes:
- During the first run or after any dependency change, we’ll build a Python environment from scratch. This process might take a couple of minutes, but as long as you don’t change the environment definition, it will reuse the same pre-built environment.
- The initial start will also download the models, but they’ll be saved to a persistent location (under
/data
) where they will always be available. This means the next time you run this app, the model won’t have to be downloaded again.
This command will print two links for you to interact with your app and start streaming the logs:
2025-04-07 21:37:41.001 [info ] Access your exposed service at https://fal.run/your-user/051cf487-8f52-43dc-b793-354507637dd02025-04-07 21:37:41.001 [info ] Access the playground at https://fal.ai/dashboard/sdk/your-user/051cf487-8f52-43dc-b793-354507637dd0==> RunningINFO: Started server process [38]INFO: Waiting for application startup.INFO: Application startup complete.INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
The first link is an HTTP proxy that goes directly to your app, allowing you to call any endpoint without authentication as long as fal run
is active:
curl $FAL_RUN_URL -H 'content-type: application/json' -H 'accept: application/json, */*;q=0.5' -d '{"prompt":"A cat"}'
Alternatively, for root endpoints, you can visit the auto-generated fal playground to interact with your app through the web UI. This option requires authentication (you need to be using your team account if you started the fal run
through a team account).
Deploying & Productionizing
Once you feel your app is ready for production and you don’t want to maintain a fal run
session, you can deploy it. Deployment provides a persistent URL that either routes requests to an existing runner or starts a new one if none are available:
fal deploy example.py::MyApp --auth=privateRegistered a new revision for function 'my-demo-app' (revision='5b23e1b1-af88-4ab0-aebc-415b2b1e34b4').Playground: https://fal.ai/models/fal-ai/my-demo-app/Endpoints: https://fal.run/fal-ai/my-demo-app/
Once deployed, you can go to the playground or make an authenticated HTTP call to use your app.
Note:
- Since this is the first invocation, and you didn’t set any minimum number of runners (
min_concurrency
), it will start a new runner and load the app. Once your request is finished, the runner will remain active forkeep_alive
seconds. This means your subsequent requests within thekeep_alive
window would be instant. - This behavior can be configured with the
fal app scale
command. See the documentation for more details.
Monitoring & Analytics
For production deployments, you’ll want more observability over your deployed app. The Analytics page allows you to see all requests and identify error cases or slower ones along with their payloads.
You can see logs attached to a specific request in the request’s detail page, or view global logs for all runners in the logs page.
Advanced Configuration
keep_alive
The keep_alive
setting enables the server to continue running even when there are no active requests. By setting this parameter, you ensure that if you hit the same application within the specified time frame, you can avoid any startup overhead.
keep_alive
is measured in seconds. In the example below, the application will keep running for at least 300 seconds after the last request:
class MyApp(fal.App, keep_alive=300): ...
Min & Max Concurrency and Concurrency Buffer
fal applications have a managed autoscaling system. You can configure the autoscaling behavior through min_concurrency
, max_concurrency
, and concurrency_buffer
:
class MyApp(fal.App, keep_alive=300, min_concurrency=1, max_concurrency=5, concurrency_buffer=1): ...
min_concurrency
: the number of runners the system should maintain at all time (even when there are no requests)max_concurrency
: the maximum number of runners the system should have. Once this limit is reached, all subsequent requests are placed in a managed queueconcurrency_buffer
: the number of runners the system should start above current request volume to handle spiky traffic
For more details refer to Scaling.
FAQ - First Asked Questions
-
How can I use local files or my repository with fal?
If your project is already a Python package (e.g., has
__init__.py
and can be imported outside of the repo), you should be able to use it as is (import it at the top level and call the relevant functions). Note that if you’re using any external dependencies in your project, you’ll also need to include them in therequirements
field. -
I already have a Dockerfile with all my dependencies. Can I use it?
Yes! You can either pass us a pre-built Docker image as the base or your Dockerfile, and we’ll build it for you. Note that your image’s Python version and your local virtual environment’s Python version need to match.
-
How can I store my secrets?
Use
fal secrets set
and you can read them as environment variables from your code! Docs -
Do you offer persistent storage? How can I use it?
Anything written to
/data
will be persistent and available to all runners. Be careful when storing many small files, as this will increase latencies (prefer large single blobs whenever possible, like model weights). Docs -
How can I scale my app?
The platform offers extensive ways to configure scaling. The simplest approach is to increase the minimum and maximum number of runners via
fal app scale $app --min-concurrency=N --max-concurrency=N
. Check our docs to tune these variables and learn about other concepts (decaying keep-alive, multiplexing, concurrency buffers, etc.). Docs -
What is the best way to deploy from CI?
You can create an
ADMIN
scoped key in your team account and usefal deploy
withFAL_KEY
set. Make sure to check our testing system as well. Docs