fal.App Class Reference
The fal.App
class provides a high-level interface for creating serverless applications. It wraps your AI models and code for deployment with automatic scaling and management.
Basic Usage
import fal
class MyApp(fal.App): def setup(self): # Initialize models and resources once per runner pass
@fal.endpoint("/") def predict(self, input_data): # Process requests return {"result": "..."}
Configuration Options
You can configure your app using class variables or the host_kwargs
dictionary for advanced options.
Environment Configuration
requirements
(list[str])
List of pip packages to install in the environment.
class MyApp(fal.App): requirements = ["numpy==1.24.0", "pandas", "torch>=2.0.0"]
local_python_modules
(list[str])
List of local Python module names to include in the deployment.
class MyApp(fal.App): local_python_modules = ["my_utils", "custom_models"]
Machine Configuration
machine_type
(str | list[str])
Hardware type(s) to use. Can be a single type or a list of types in order of preference. CPU Machines:
"XS"
- 0.50 CPU cores, 512MB RAM"S"
- 1 CPU core, 1GB RAM (default)"M"
- 2 CPU cores, 2GB RAM"L"
- 4 CPU cores, 15GB RAM
GPU Machines:
"GPU-A100"
- 12 CPU cores, 60GB RAM, 1 GPU core (40GB VRAM)"GPU-H100"
- 12 CPU cores, 112GB RAM, 1 GPU core (80GB VRAM)"GPU-H200"
- 12 CPU cores, 112GB RAM, 1 GPU core (141GB VRAM)"GPU-B200"
- 24 CPU cores, 112GB RAM, 1 GPU core (192GB VRAM)
class MyApp(fal.App): # Single machine type machine_type = "GPU-H100"
# Or with multiple options (fal will pick whichever is available) machine_type = ["GPU-H100", "GPU-H200"]
num_gpus
(int)
Number of GPUs required for the application.
class MyApp(fal.App): machine_type = "GPU-H100" num_gpus = 2 # Request 2 H100 GPUs
Timeout Configuration
request_timeout
(int)
Maximum time in seconds for a single request to complete.
class MyApp(fal.App): request_timeout = 300 # 5 minutes
startup_timeout
(int)
Maximum time in seconds for the environment to start up.
class MyApp(fal.App): startup_timeout = 600 # 10 minutes for large model loading
Authentication
app_auth
(str)
Authentication mode for the application.
Options:
"private"
: Only accessible with your API key"public"
: Accessible without authentication"shared"
: Accessible with any valid fal API keyNone
: Inherit from deployment command
class MyApp(fal.App): app_auth = "shared" # Allow access with any valid fal key
App Metadata
app_name
(str)
Custom name for the application. Auto-generated from class name if not specified.
class MyApp(fal.App): app_name = "image-generator-v2"
Scaling Configuration
Control how your application scales to handle traffic. These options help balance performance and cost.
keep_alive
(int)
Time in seconds to keep idle runners alive. Default: 10 seconds.
class MyApp(fal.App): keep_alive = 300 # Keep runners alive for 5 minutes after last request
min_concurrency
(int)
Minimum number of runners to keep running at all times. Default: 0.
class MyApp(fal.App): min_concurrency = 2 # Always keep 2 runners ready
max_concurrency
(int)
Maximum number of runners that can be created. Default: 10.
class MyApp(fal.App): max_concurrency = 50 # Allow up to 50 runners during peak traffic
concurrency_buffer
(int)
Number of extra runners to provision beyond current demand. Default: 0.
class MyApp(fal.App): concurrency_buffer = 2 # Keep 2 extra runners ready for traffic spikes
max_multiplexing
(int)
Maximum number of requests a single runner can handle concurrently. Default: 1.
class MyApp(fal.App): max_multiplexing = 5 # Each runner can handle 5 concurrent requests
Note: See the Scaling Guide for detailed explanations and examples of these options.
Complete Example
Here’s a comprehensive example showing all common configuration options:
import falfrom typing import Dict, Any
class ImageGenerationApp(fal.App): # Environment setup requirements = [ "torch==2.1.0", "transformers==4.35.0", "diffusers==0.24.0", "accelerate", "pillow", ] local_python_modules = ["custom_pipeline"]
# Machine configuration machine_type = ["GPU-H100", "GPU-H200"] # will pick whichever is available num_gpus = 1
# Timeouts request_timeout = 600 # 10 minutes per request startup_timeout = 900 # 15 minutes for model loading
# Authentication app_auth = "shared" # Accessible with any valid fal key app_name = "stable-diffusion-xl"
# Scaling configuration keep_alive = 300 # 5 minutes min_concurrency = 1 # Keep 1 runner always ready max_concurrency = 10 # Scale up to 10 runners max concurrency_buffer = 1 # Keep 1 extra runner for spikes max_multiplexing = 1 # 1 request per runner (GPU-bound workload)
def setup(self): """Initialize models once per runner.""" import torch from diffusers import DiffusionPipeline
self.pipe = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True, ) self.pipe.to("cuda")
@fal.endpoint("/generate") def generate(self, prompt: str, negative_prompt: str = "", steps: int = 30, width: int = 1024, height: int = 1024) -> Dict[str, Any]: """Generate an image from a text prompt."""
image = self.pipe( prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=steps, width=width, height=height, ).images[0]
# Convert to base64 for API response import io import base64
buffer = io.BytesIO() image.save(buffer, format="PNG") image_base64 = base64.b64encode(buffer.getvalue()).decode()
return { "image": image_base64, "content_type": "image/png", "width": width, "height": height, }
@fal.endpoint("/health") def health_check(self) -> Dict[str, str]: """Simple health check endpoint.""" return {"status": "healthy", "model": "sdxl"}
See Also
- Getting Started Guide - Quick introduction to building your first app
- Deployment Operations - Production deployment best practices
- Scaling Guide - Detailed scaling configuration