fal.distributed module. We’ll build a production-ready SDXL service that generates multiple image variations simultaneously and supports real-time streaming.
For a comprehensive overview of multi-GPU parallelism strategies and when to use them, see the Multi-GPU Workloads Overview.
🚀 Try this Example
View the complete source code on GitHub. Steps to run:- Install fal:
- Authenticate (if not already done):
- Copy the code below into
parallel_sdxl.py
- Run the app:
Or clone this repository:
Before you run, make sure you have:
- Authenticated with fal:
fal auth login - Activated your virtual environment (recommended):
python -m venv venv && source venv/bin/activate
Key Features
- Multi-GPU Data Parallelism: Each GPU generates images independently with different random seeds
- Real-time Streaming: Stream intermediate results during generation
- Distributed Coordination: Automatic synchronization and result gathering across GPUs
- Memory Efficient: Uses TinyVAE for fast preview generation
- Production Ready: Includes warmup, error handling, and resource cleanup
Architecture Overview
Step-by-Step Implementation
1. Define the Distributed Worker
The worker extendsDistributedWorker and runs on each GPU:
self.device: PyTorch CUDA device for this worker (cuda:0,cuda:1, etc.). Always use.to(self.device)when loading models.self.rank: Worker ID (0 to N-1). Rank 0 is typically the “main” worker that returns results.self.world_size: Total number of workers (GPUs).setup(**kwargs): Called once duringrunner.start()to initialize each worker.
2. Implement the Worker Logic
- Independent Generation: Each GPU uses a different random seed automatically
- Distributed Gather:
dist.gather()collects results from all GPUs to rank 0 - Rank-Specific Logic: Only rank 0 prepares the gather list and returns results
- Memory Cleanup: Clear CUDA cache after generation
3. Add Real-Time Streaming (Optional)
Stream intermediate results during generation:- Progressive blur reduction as generation progresses
- Updates every 5 steps to balance frequency and overhead
- Uses TinyVAE for fast latent decoding
- Automatic base64 encoding for browser display
add_streaming_result()sends updates to the client. API Reference →
4. Create the Main Application
Key Methods Used:
DistributedRunner(worker_cls, world_size)- Creates the runner. API Reference →await runner.start()- Initializes all workers. API Reference →await runner.invoke(payload)- Executes workers and returns result. API Reference →runner.stream(payload, as_text_events=True)- Streams intermediate results. API Reference →
num_gpus: Specify number of GPUs to usemachine_type: Choose GPU type (H100, A100, etc.)pyzmq: Required dependency for distributed communication
5. Define Input/Output Models
Running the Application
Local Development
Production Deployment
After running
fal run or fal deploy, you’ll see a URL like https://fal.ai/dashboard/sdk/username/app-id/. You can:- Test in the Playground: Click the URL or visit it in your browser to open the interactive playground
- View on Dashboard: Visit fal.ai/dashboard to see all your apps, monitor usage, and manage deployments
Using the Application
Test in the Playground: After deploying, open the URL provided by the CLI (e.g.,https://fal.ai/dashboard/sdk/username/app-id/) in your browser to access an interactive playground where you can test your app with a UI.
Call from Code:
Standard Generation:
For other languages (JavaScript, TypeScript, etc.) and advanced client usage, see the Client Libraries documentation.
Next Steps
Multi-GPU Workloads Overview
Learn about other parallelism strategies
Deploy Text-to-Image Model
Single-GPU image generation