fal.distributed
module enables you to scale your AI workloads across multiple GPUs, dramatically improving performance for both inference and training tasks. Whether you need to generate multiple images simultaneously or train large models faster, distributed computing on fal makes it straightforward.
Why Use Multiple GPUs?
For Inference
- Higher Throughput: Generate multiple outputs simultaneously (e.g., 4 images at once on 4 GPUs)
- Faster Single Output: Split large models across GPUs for faster generation
- Cost Efficiency: Maximize GPU utilization for batch processing
For Training
- Faster Training: Distribute training across multiple GPUs with synchronized gradient updates
- Larger Batches: Train with bigger batch sizes for better model convergence
- Parallel Preprocessing: Speed up data preprocessing by distributing it across GPUs
Core Concepts
Architecture Overview
Here’s how the distributed computing components work together:DistributedRunner
TheDistributedRunner
is the main orchestration class that manages multi-GPU workloads. It handles:
- Process management across multiple GPUs
- Inter-process communication via ZMQ
- Coordination between worker processes
DistributedWorker
Your custom worker class extendsDistributedWorker
and implements:
setup()
: Initialize models and resources on each GPU__call__()
: Define the processing logic for each worker
self.rank
: GPU rank (0 to N-1)self.world_size
: Total number of GPUsself.device
: PyTorch device for this GPU
PyTorch Distributed Primitives
For training and coordinated inference, you can use PyTorch’s distributed primitives:Parallelism Strategies
Inference Strategies
Different parallelism strategies optimize for different inference scenarios. Choose based on your use case and model architecture.Data Parallelism
Each GPU runs an independent model copy with different inputs. Best for high throughput scenarios where you need to generate multiple outputs simultaneously.Use for: Batch processing, generating multiple image variations, high throughput workloads
Example: Parallel SDXL in fal-demos
Example: Parallel SDXL in fal-demos
Pipeline Parallelism (PipeFusion)
Split the model into sequential stages across GPUs, processing like an assembly line where each GPU handles specific layers. Reduces latency for single outputs.Use for: Large DiT models (SD3, FLUX), reducing latency for single image generation
Example: xFuser’s PipeFusion implementation for DiT models
Example: xFuser’s PipeFusion implementation for DiT models
Tensor Parallelism
Split individual layers and tensors across GPUs, computing portions of each layer in parallel. Required when models are too large to fit on a single GPU.Use for: Extremely large models that don’t fit on single GPU memory
Example: Large language models, very large diffusion models
Example: Large language models, very large diffusion models
Sequence Parallelism (Ulysses)
Split attention computation across the sequence or spatial dimensions. Particularly effective for long sequences and can be combined with other strategies.Use for: Very long sequences, high-resolution images, combining with PipeFusion
Example: xFuser’s Ulysses implementation
Example: xFuser’s Ulysses implementation
CFG Parallelism
Parallel conditional and unconditional passes for classifier-free guidance. Runs both guidance passes simultaneously on separate GPUs.Use for: U-Net models (SDXL), requires exactly 2 GPUs
Example: xFuser supports this for U-Net architectures
Example: xFuser supports this for U-Net architectures
Hybrid Strategies
Combine multiple approaches (e.g., PipeFusion + Ulysses + CFG) for maximum scaling efficiency across many GPUs.Use for: Maximum scaling across 4-8+ GPUs, complex production workloads
Example: xFuser configurations for 8 GPU setups
Example: xFuser configurations for 8 GPU setups
Training Strategies
Multi-GPU training strategies focus on distributing the computational and memory requirements of training large models.Distributed Data Parallel (DDP)
Each GPU has a full model copy and processes different data batches. Gradients are synchronized across all GPUs after each backward pass, ensuring all models stay identical.Use for: Standard multi-GPU training, best scaling for most use cases
Example: Flux LoRA training in fal-demos
Example: Flux LoRA training in fal-demos
Pipeline Parallelism
Split model into stages and process microbatches through the pipeline. Requires careful load balancing to avoid GPU idle time (bubble overhead).Use for: Very large models, when combined with other parallelism strategies
Example: GPT-3 style training
Example: GPT-3 style training
Tensor Parallelism
Split model layers across GPUs using Megatron-LM style parallelization. Each layer’s computation is distributed across multiple GPUs.Use for: Models too large for single GPU even with gradient checkpointing
Example: Large transformer training
Example: Large transformer training
FSDP/ZeRO
Fully Sharded Data Parallel (FSDP) or ZeRO optimizer shard optimizer states, gradients, and parameters across GPUs to reduce memory footprint per GPU.Use for: Training very large models with memory constraints, scaling beyond DDP
Example: Large model training (70B+ parameters)
Example: Large model training (70B+ parameters)
Many production workloads benefit from hybrid strategies. For example, xFuser can combine PipeFusion + Ulysses + CFG parallelism to scale across 8+ GPUs efficiently.
Configuration
Specifying GPU Count
Multi-GPU Machine Types
fal supports various multi-GPU configurations:GPU-H100
withnum_gpus=2
: 2x H100 GPUsGPU-H100
withnum_gpus=4
: 4x H100 GPUsGPU-H100
withnum_gpus=8
: 8x H100 GPUsGPU-A100
withnum_gpus=2/4/8
: 2-8x A100 GPUs
Examples
All examples are available in the fal-demos repository:- Parallel SDXL: Data parallelism for image generation (code)
- xFuser: Model parallelism with DiT models (code)
- Flux LoRA Training: Complete DDP training pipeline (code)
Next Steps
For detailed implementation examples, check out:- fal-demos distributed examples - Complete working code
- Python SDK Reference - API documentation
- Core Concepts - Fundamental fal concepts