Skip to main content
The fal.distributed module enables you to scale your AI workloads across multiple GPUs, dramatically improving performance for both inference and training tasks. Whether you need to generate multiple images simultaneously or train large models faster, distributed computing on fal makes it straightforward.

Why Use Multiple GPUs?

For Inference

  • Higher Throughput: Generate multiple outputs simultaneously (e.g., 4 images at once on 4 GPUs)
  • Faster Single Output: Split large models across GPUs for faster generation
  • Cost Efficiency: Maximize GPU utilization for batch processing

For Training

  • Faster Training: Distribute training across multiple GPUs with synchronized gradient updates
  • Larger Batches: Train with bigger batch sizes for better model convergence
  • Parallel Preprocessing: Speed up data preprocessing by distributing it across GPUs

Core Concepts

Architecture Overview

Here’s how the distributed computing components work together:

DistributedRunner

The DistributedRunner is the main orchestration class that manages multi-GPU workloads. It handles:
  • Process management across multiple GPUs
  • Inter-process communication via ZMQ
  • Coordination between worker processes
from fal.distributed import DistributedRunner, DistributedWorker

class MyWorker(DistributedWorker):
    def setup(self):
        # Initialize model on each GPU
        self.model = load_model().to(self.device)
    
    def __call__(self, input_data):
        # Process data on this GPU
        return self.model(input_data)

class MyApp(fal.App):
    num_gpus = 4  # Use 4 GPUs
    
    def setup(self):
        self.runner = DistributedRunner(
            worker_cls=MyWorker,
            world_size=4
        )

DistributedWorker

Your custom worker class extends DistributedWorker and implements:
  • setup(): Initialize models and resources on each GPU
  • __call__(): Define the processing logic for each worker
Key attributes available in your worker:
  • self.rank: GPU rank (0 to N-1)
  • self.world_size: Total number of GPUs
  • self.device: PyTorch device for this GPU

PyTorch Distributed Primitives

For training and coordinated inference, you can use PyTorch’s distributed primitives:
import torch.distributed as dist

# Gather results from all GPUs to rank 0
dist.gather(tensor, gather_list if self.rank == 0 else None, dst=0)

# Broadcast data from rank 0 to all GPUs
dist.broadcast(tensor, src=0)

# Synchronize all GPUs at a barrier
dist.barrier()

Parallelism Strategies

Inference Strategies

Different parallelism strategies optimize for different inference scenarios. Choose based on your use case and model architecture.

Data Parallelism

Each GPU runs an independent model copy with different inputs. Best for high throughput scenarios where you need to generate multiple outputs simultaneously.Use for: Batch processing, generating multiple image variations, high throughput workloads
Example: Parallel SDXL in fal-demos

Pipeline Parallelism (PipeFusion)

Split the model into sequential stages across GPUs, processing like an assembly line where each GPU handles specific layers. Reduces latency for single outputs.Use for: Large DiT models (SD3, FLUX), reducing latency for single image generation
Example: xFuser’s PipeFusion implementation for DiT models

Tensor Parallelism

Split individual layers and tensors across GPUs, computing portions of each layer in parallel. Required when models are too large to fit on a single GPU.Use for: Extremely large models that don’t fit on single GPU memory
Example: Large language models, very large diffusion models

Sequence Parallelism (Ulysses)

Split attention computation across the sequence or spatial dimensions. Particularly effective for long sequences and can be combined with other strategies.Use for: Very long sequences, high-resolution images, combining with PipeFusion
Example: xFuser’s Ulysses implementation

CFG Parallelism

Parallel conditional and unconditional passes for classifier-free guidance. Runs both guidance passes simultaneously on separate GPUs.Use for: U-Net models (SDXL), requires exactly 2 GPUs
Example: xFuser supports this for U-Net architectures

Hybrid Strategies

Combine multiple approaches (e.g., PipeFusion + Ulysses + CFG) for maximum scaling efficiency across many GPUs.Use for: Maximum scaling across 4-8+ GPUs, complex production workloads
Example: xFuser configurations for 8 GPU setups

Training Strategies

Multi-GPU training strategies focus on distributing the computational and memory requirements of training large models.

Distributed Data Parallel (DDP)

Each GPU has a full model copy and processes different data batches. Gradients are synchronized across all GPUs after each backward pass, ensuring all models stay identical.Use for: Standard multi-GPU training, best scaling for most use cases
Example: Flux LoRA training in fal-demos

Pipeline Parallelism

Split model into stages and process microbatches through the pipeline. Requires careful load balancing to avoid GPU idle time (bubble overhead).Use for: Very large models, when combined with other parallelism strategies
Example: GPT-3 style training

Tensor Parallelism

Split model layers across GPUs using Megatron-LM style parallelization. Each layer’s computation is distributed across multiple GPUs.Use for: Models too large for single GPU even with gradient checkpointing
Example: Large transformer training

FSDP/ZeRO

Fully Sharded Data Parallel (FSDP) or ZeRO optimizer shard optimizer states, gradients, and parameters across GPUs to reduce memory footprint per GPU.Use for: Training very large models with memory constraints, scaling beyond DDP
Example: Large model training (70B+ parameters)
Many production workloads benefit from hybrid strategies. For example, xFuser can combine PipeFusion + Ulysses + CFG parallelism to scale across 8+ GPUs efficiently.

Configuration

Specifying GPU Count

class MyApp(fal.App):
    num_gpus = 8  # Request 8 GPUs
    machine_type = "GPU-H100"  # Each GPU will be an H100

Multi-GPU Machine Types

fal supports various multi-GPU configurations:
  • GPU-H100 with num_gpus=2: 2x H100 GPUs
  • GPU-H100 with num_gpus=4: 4x H100 GPUs
  • GPU-H100 with num_gpus=8: 8x H100 GPUs
  • GPU-A100 with num_gpus=2/4/8: 2-8x A100 GPUs

Examples

All examples are available in the fal-demos repository:
  • Parallel SDXL: Data parallelism for image generation (code)
  • xFuser: Model parallelism with DiT models (code)
  • Flux LoRA Training: Complete DDP training pipeline (code)

Next Steps

For detailed implementation examples, check out:
I