Build with fal
The generative media platform powering the world’s top AI apps.
Call 1,000+ optimized models through a unified API, or deploy your own on the same infrastructure. Image, video, audio, music, speech, 3D, and real-time streaming. Built to scale to billions of requests.
99.99%+
Uptime
Billions+
Requests/day
1,000+
Endpoints


Model APIs
Call 1,000+ models with one API. Image, video, audio, and multimodal generation. Optimized and production-ready.
Serverless
Deploy your own models. Same infrastructure, same autoscaling, same reliability. From zero to thousands of GPUs.
Compute
Dedicated GPU instances. Full SSH access for training, fine-tuning, and persistent workloads.
Platform APIs
REST APIs for model metadata, pricing, usage tracking, logs, files, and metrics
Start with a Model API Call
Most users start here. Pick a model from the Marketplace, get an API key, and make a request. Three lines of code, no infrastructure to manage.Model APIs Quickstart
Browse models, see pricing, and start generating
Deploy Your Own Models
For teams that need to run custom models, proprietary pipelines, or fine-tuned variants, fal Serverless lets you deploy on the same engine that powers the Marketplace. fal has been running this infrastructure for over 3 years, and every model on the platform goes through the same lifecycle below.Develop
A fal.App is a Python class where your
setup() method runs once per runner to load model weights and initialize resources. Your @fal.endpoint methods then serve incoming requests using the initialized state. You declare hardware needs and environment alongside your code, so infrastructure is versioned with your app.Test
fal run spins up a cloud GPU runner and gives you a temporary URL so you can test on the same hardware you’ll use in production. It also generates a playground UI automatically. For CI, AppClient lets you run tests against ephemeral deployments.Deploy
fal deploy creates a persistent, authenticated endpoint with autoscaling and built-in retries. Every deploy creates a new revision for instant rollbacks. For staging and production separation, fal supports multiple environments per app.Observe
The dashboard gives you real-time logs, request-level analytics, and error tracking out of the box. Trace individual requests, spot latency regressions, and monitor runner utilization. For external stacks, fal supports Prometheus metrics and log drains to Datadog, Splunk, and Elasticsearch.
Scale
fal scales runners from zero to thousands of GPUs based on demand, with a multi-layer caching system that reduces cold starts over time. Scaling parameters let you control the tradeoff:
min_concurrency keeps runners warm, max_concurrency caps spend, and concurrency_buffer pre-warms ahead of spikes. See optimizing cold starts and machine types for latency-sensitive workloads.Distribute
Endpoints start as private. You can deploy in
public mode for open access, or shared mode where callers pay for their own usage. To list on the Marketplace for broader distribution and revenue, see publishing to the marketplace.Serverless Quickstart
Deploy your first model in minutes
Train Your Own Models
For training runs, fine-tuning, and workloads that need sustained GPU access, fal Compute gives you dedicated instances with full SSH control. No cold starts, no autoscaling, just raw GPU power billed at a fixed hourly rate.H100 SXM
Single-GPU instances for development, fine-tuning, and single-GPU training
8x H100 SXM
Multi-GPU instances connected over InfiniBand for distributed training
| Compute | Serverless | |
|---|---|---|
| Best for | Training, fine-tuning, batch jobs | API endpoints, on-demand inference |
| Billing | Per-hour, fixed rate | Per-second of execution |
| Scaling | Manual | Automatic |
| Access | Full SSH | Managed runners |
Compute Quickstart
Provision your first GPU instance in minutes