Skip to main content
The /data filesystem on fal is a distributed, parallel storage system. It performs best when multiple files are read concurrently. Sequential file reads — loading one weight file at a time — significantly underutilize the filesystem and result in slower cold starts.

The Problem

Most model loading libraries (HuggingFace, PyTorch) read weight files sequentially — one shard at a time. On a distributed filesystem, this leaves most of the available bandwidth idle:
Sequential:
  time ──────────────────────────────────►
  ┌──file1──┐┌──file2──┐┌──file3──┐┌──file4──┐

Parallel:
  time ──────────────────►
  ┌──file1──┐
  ┌──file2──┐
  ┌──file3──┐
  ┌──file4──┐

The Solution: Pre-Read Files in Parallel

Before loading your model, pre-read all weight files into the OS page cache using parallel I/O. When the model loader then reads the files, they’re already cached in memory and load instantly.
import subprocess

MODEL_DIR = "/data/models/my-model"

subprocess.check_call(
    f"find '{MODEL_DIR}' -type f | xargs -P 32 -I {{}} cat {{}} > /dev/null",
    shell=True
)
This reads up to 32 files simultaneously, pulling them into the page cache. The subsequent from_pretrained() call then reads from cache instead of the network.

Using It in Your App

Add the pre-read step at the beginning of your setup() method, before model loading:
import fal
import subprocess

class MyModel(fal.App):
    machine_type = "GPU-H100"

    def setup(self):
        import torch
        from diffusers import StableDiffusionXLPipeline

        model_dir = "/data/models/stable-diffusion-xl"

        # Pre-read all model files in parallel
        subprocess.check_call(
            f"find '{model_dir}' -type f | xargs -P 32 -I {{}} cat {{}} > /dev/null",
            shell=True
        )

        # Now load the model -- files are already in page cache
        self.pipe = StableDiffusionXLPipeline.from_pretrained(
            model_dir,
            torch_dtype=torch.float16,
        ).to("cuda")

Python Helper

For a cleaner approach, wrap it in a function:
import subprocess
from pathlib import Path

def preload_files(directory: str, parallelism: int = 32):
    """Pre-read all files in a directory into the OS page cache."""
    subprocess.check_call(
        f"find '{directory}' -type f | xargs -P {parallelism} -I {{}} cat {{}} > /dev/null",
        shell=True
    )
def setup(self):
    preload_files("/data/models/my-model")
    self.model = load_model("/data/models/my-model")

When to Use This

ScenarioRecommendation
Loading HuggingFace models from /dataUse parallel pre-reading
Loading custom PyTorch checkpoints with multiple shard filesUse parallel pre-reading
Loading a single large fileLess benefit — the bottleneck is single-file transfer speed. Consider FlashPack instead
Models already in local node cache (warm runners)Minimal benefit — files are already fast to read

How It Works

fal’s /data is a distributed filesystem that can serve many files concurrently at high throughput. When you read files sequentially, you use only a fraction of the available bandwidth. The xargs -P 32 approach fires off 32 concurrent cat commands, each reading a different file. The OS caches the file contents in memory (page cache), so when your model loader reads the same files moments later, it reads from RAM instead of the network.
Adjust the parallelism (-P 32) based on your model. For models with many small files (e.g., 100+ safetensors shards), higher parallelism helps. For models with a few large files, lower parallelism (8-16) is sufficient.

Comparison with FlashPack

Parallel Pre-ReadingFlashPack
What it doesPre-reads files into OS cacheStreams tensors directly to GPU
Works withAny files in any formatPyTorch models (custom format)
Requires conversionNo — works with existing filesYes — must convert to .flashpack format
Best forQuick win with existing modelsMaximum performance with converted models
Both techniques can be combined — use parallel pre-reading as a quick improvement, then migrate to FlashPack for even faster loading.