Build with fal

The generative media platform powering the world’s top AI apps.

Call 1,000+ optimized models through a unified API, or deploy your own on the same infrastructure. Image, video, audio, music, speech, 3D, and real-time streaming. Built to scale to billions of requests.

99.99%+

Uptime

Billions+

Requests/day

1,000+

Endpoints

Model APIs

Call 1,000+ models with one API. Image, video, audio, and multimodal generation. Optimized and production-ready.

Serverless

Deploy your own models. Same infrastructure, same autoscaling, same reliability. From zero to thousands of GPUs.

Compute

Dedicated GPU instances. Full SSH access for training, fine-tuning, and persistent workloads.

Platform APIs

REST APIs for model metadata, pricing, usage tracking, logs, files, and metrics

Start with a Model API Call

Most users start here. Pick a model from the Marketplace, get an API key, and make a request. Three lines of code, no infrastructure to manage.

import fal_client

result = fal_client.subscribe("fal-ai/nano-banana-2", 
    arguments={"prompt": "a sunset over mountains"}
)
print(result["images"][0]["url"])

Every model supports synchronous calls, async queue, streaming, real-time WebSocket, and webhooks. You can compare models side-by-side in the Sandbox before committing to one.

Model APIs Quickstart

Browse models, see pricing, and start generating

Deploy Your Own Models

For teams that need to run custom models, proprietary pipelines, or fine-tuned variants, fal Serverless lets you deploy on the same engine that powers the Marketplace. fal has been running this infrastructure for over 3 years, and every model on the platform goes through the same lifecycle below.

Develop

A fal.App is a Python class where your setup() method runs once per runner to load model weights and initialize resources. Your @fal.endpoint methods then serve incoming requests using the initialized state. You declare hardware needs and environment alongside your code, so infrastructure is versioned with your app.

import fal

class MyModel(fal.App):
    machine_type = "GPU-H100"
    
    def setup(self):
        self.model = load_my_model()
    
    @fal.endpoint("/")
    def generate(self, prompt: str):
        return self.model(prompt)

Test

fal run spins up a cloud GPU runner and gives you a temporary URL so you can test on the same hardware you’ll use in production. It also generates a playground UI automatically. For CI, AppClient lets you run tests against ephemeral deployments.

fal run my_app.py

Deploy

fal deploy creates a persistent, authenticated endpoint with autoscaling and built-in retries. Every deploy creates a new revision for instant rollbacks. For staging and production separation, fal supports multiple environments per app.

fal deploy my_app.py

Observe

The dashboard gives you real-time logs, request-level analytics, and error tracking out of the box. Trace individual requests, spot latency regressions, and monitor runner utilization. For external stacks, fal supports Prometheus metrics and log drains to Datadog, Splunk, and Elasticsearch.

Scale

fal scales runners from zero to thousands of GPUs based on demand, with a multi-layer caching system that reduces cold starts over time. Scaling parameters let you control the tradeoff: min_concurrency keeps runners warm, max_concurrency caps spend, and concurrency_buffer pre-warms ahead of spikes. See optimizing cold starts and machine types for latency-sensitive workloads.

class MyModel(fal.App):
    min_concurrency = 2
    max_concurrency = 100
    concurrency_buffer = 3

Distribute

Endpoints start as private. You can deploy in public mode for open access, or shared mode where callers pay for their own usage. To list on the Marketplace for broader distribution and revenue, see publishing to the marketplace.

class MyModel(fal.App):
    app_auth = "shared"

Serverless Quickstart

Deploy your first model in minutes

Train Your Own Models

For training runs, fine-tuning, and workloads that need sustained GPU access, fal Compute gives you dedicated instances with full SSH control. No cold starts, no autoscaling, just raw GPU power billed at a fixed hourly rate.

H100 SXM

Single-GPU instances for development, fine-tuning, and single-GPU training

8x H100 SXM

Multi-GPU instances connected over InfiniBand for distributed training

	Compute	Serverless
Best for	Training, fine-tuning, batch jobs	API endpoints, on-demand inference
Billing	Per-hour, fixed rate	Per-second of execution
Scaling	Manual	Automatic
Access	Full SSH	Managed runners

Compute Quickstart

Provision your first GPU instance in minutes

Setting Up

Model APIs

Serverless

Compute

Organizations

Documentation

Build with fal

Model APIs

Serverless

Compute

Platform APIs

Start with a Model API Call

Model APIs Quickstart

Deploy Your Own Models

Serverless Quickstart

Train Your Own Models

H100 SXM

8x H100 SXM

Compute Quickstart

Setting Up

Model APIs

Serverless

Compute

Organizations

​Build with fal

Model APIs

Serverless

Compute

Platform APIs

​Start with a Model API Call

Model APIs Quickstart

​Deploy Your Own Models

Serverless Quickstart

​Train Your Own Models

H100 SXM

8x H100 SXM

Compute Quickstart

Build with fal

Start with a Model API Call

Deploy Your Own Models

Train Your Own Models