fal.distributed module enables you to scale your AI workloads across multiple GPUs, dramatically improving performance for both inference and training tasks. Whether you need to generate multiple images simultaneously or train large models faster, distributed computing on fal makes it straightforward.
Why Use Multiple GPUs?
For Inference
- Higher Throughput: Generate multiple outputs simultaneously (e.g., 4 images at once on 4 GPUs)
- Faster Single Output: Split large models across GPUs for faster generation
- Cost Efficiency: Maximize GPU utilization for batch processing
For Training
- Faster Training: Distribute training across multiple GPUs with synchronized gradient updates
- Larger Batches: Train with bigger batch sizes for better model convergence
- Parallel Preprocessing: Speed up data preprocessing by distributing it across GPUs
Core Concepts
Architecture Overview
Here’s how the distributed computing components work together:DistributedRunner
TheDistributedRunner is the main orchestration class that manages multi-GPU workloads. It handles process management, inter-process communication via ZMQ, and coordination between worker processes.
Key Methods
DistributedRunner(worker_cls, world_size)
Creates a runner that will spawn world_size instances of your worker_cls.
await runner.start(**kwargs)
Launches all worker processes and calls their setup() method. Any **kwargs are passed to each worker’s setup. Call this once in your app’s setup().
await runner.invoke(payload)
Executes your worker’s __call__() method with the given payload dict and returns the final result. Use this for standard (non-streaming) requests.
runner.stream(payload, as_text_events=True)
Returns an async iterator that streams intermediate results from workers. Use this with StreamingResponse for real-time progress updates.
Example Usage
For complete API documentation with all parameters and examples, see the DistributedRunner API Reference.
DistributedWorker
Your custom worker class extendsDistributedWorker and runs on each GPU. Each worker is independent and has its own copy of the model.
Methods to Override
def setup(self, **kwargs)
Called once when the worker initializes. Load your model, download weights, and prepare resources here. Receives any **kwargs passed to runner.start().
def __call__(self, streaming=False, **kwargs)
Called for each request. Implement your inference or training logic here. Receives the payload dict from runner.invoke() or runner.stream() as **kwargs.
Properties You’ll Use
self.device - The PyTorch CUDA device for this worker (cuda:0, cuda:1, etc.)
self.rank - The worker’s rank (0 to world_size-1). Rank 0 is typically responsible for returning results.
self.world_size - Total number of workers (GPUs).
Utility Methods
self.add_streaming_result(data, as_text_event=True)
Sends intermediate results to the client during streaming. Only call from rank 0 to avoid duplicates.
self.rank_print(message)
Prints a message with the worker’s rank prefix for easier debugging.
For complete API documentation with detailed examples, see the DistributedWorker API Reference.
PyTorch Distributed Primitives
For training and coordinated inference, you can use PyTorch’s distributed primitives:Parallelism Strategies
Inference Strategies
Different parallelism strategies optimize for different inference scenarios. Choose based on your use case and model architecture.Data Parallelism
Each GPU runs an independent model copy with different inputs. Best for high throughput scenarios where you need to generate multiple outputs simultaneously.Use for: Batch processing, generating multiple image variations, high throughput workloads
Example: Parallel SDXL in fal-demos
Example: Parallel SDXL in fal-demos
Pipeline Parallelism (PipeFusion)
Split the model into sequential stages across GPUs, processing like an assembly line where each GPU handles specific layers. Reduces latency for single outputs.Use for: Large DiT models (SD3, FLUX), reducing latency for single image generation
Example: xFuser’s PipeFusion implementation for DiT models
Example: xFuser’s PipeFusion implementation for DiT models
Tensor Parallelism
Split individual layers and tensors across GPUs, computing portions of each layer in parallel. Required when models are too large to fit on a single GPU.Use for: Extremely large models that don’t fit on single GPU memory
Example: Large language models, very large diffusion models
Example: Large language models, very large diffusion models
Sequence Parallelism (Ulysses)
Split attention computation across the sequence or spatial dimensions. Particularly effective for long sequences and can be combined with other strategies.Use for: Very long sequences, high-resolution images, combining with PipeFusion
Example: xFuser’s Ulysses implementation
Example: xFuser’s Ulysses implementation
CFG Parallelism
Parallel conditional and unconditional passes for classifier-free guidance. Runs both guidance passes simultaneously on separate GPUs.Use for: U-Net models (SDXL), requires exactly 2 GPUs
Example: xFuser supports this for U-Net architectures
Example: xFuser supports this for U-Net architectures
Hybrid Strategies
Combine multiple approaches (e.g., PipeFusion + Ulysses + CFG) for maximum scaling efficiency across many GPUs.Use for: Maximum scaling across 4-8+ GPUs, complex production workloads
Example: xFuser configurations for 8 GPU setups
Example: xFuser configurations for 8 GPU setups
Training Strategies
Multi-GPU training strategies focus on distributing the computational and memory requirements of training large models.Distributed Data Parallel (DDP)
Each GPU has a full model copy and processes different data batches. Gradients are synchronized across all GPUs after each backward pass, ensuring all models stay identical.Use for: Standard multi-GPU training, best scaling for most use cases
Example: Flux LoRA training in fal-demos
Example: Flux LoRA training in fal-demos
Pipeline Parallelism
Split model into stages and process microbatches through the pipeline. Requires careful load balancing to avoid GPU idle time (bubble overhead).Use for: Very large models, when combined with other parallelism strategies
Example: GPT-3 style training
Example: GPT-3 style training
Tensor Parallelism
Split model layers across GPUs using Megatron-LM style parallelization. Each layer’s computation is distributed across multiple GPUs.Use for: Models too large for single GPU even with gradient checkpointing
Example: Large transformer training
Example: Large transformer training
FSDP/ZeRO
Fully Sharded Data Parallel (FSDP) or ZeRO optimizer shard optimizer states, gradients, and parameters across GPUs to reduce memory footprint per GPU.Use for: Training very large models with memory constraints, scaling beyond DDP
Example: Large model training (70B+ parameters)
Example: Large model training (70B+ parameters)
Many production workloads benefit from hybrid strategies. For example, xFuser can combine PipeFusion + Ulysses + CFG parallelism to scale across 8+ GPUs efficiently.
Configuration
Specifying GPU Count
Multi-GPU Machine Types
fal supports various multi-GPU configurations:GPU-H100withnum_gpus=2: 2x H100 GPUsGPU-H100withnum_gpus=4: 4x H100 GPUsGPU-H100withnum_gpus=8: 8x H100 GPUsGPU-A100withnum_gpus=2/4/8: 2-8x A100 GPUs
Examples
All examples are available in the fal-demos repository:- Parallel SDXL: Data parallelism for image generation (code)
- xFuser: Model parallelism with DiT models (code)
- Flux LoRA Training: Complete DDP training pipeline (code)
Next Steps
API Reference
Complete API documentation for DistributedRunner and DistributedWorker
Multi-GPU Inference Tutorial
Step-by-step guide with parallel SDXL
Multi-GPU Training Tutorial
Build a distributed Flux LoRA training service
Event Streaming
Stream intermediate results from workers
Additional Resources
- fal-demos distributed examples - Complete working code
- Python SDK Reference - Full Python API
- Core Concepts - Fundamental fal concepts