/data are cached automatically — you don’t need to configure anything. Compiled model caches (PyTorch Inductor) require explicit use of the synchronized_inductor_cache() API. Understanding the cache layers helps you reason about why cold starts improve over time and how to structure your app for the best startup performance. For strategies to reduce cold starts further, see Optimizing Cold Starts.
Cache Layers
Docker images and/data storage use different caching mechanisms.
/data Storage Cache
Files on /data (model weights, compiled caches, any files your app writes) go through a three-layer cache:
| Cache Layer | Speed | Scope | Use Case |
|---|---|---|---|
| Local Node Cache | 10-15 GB/s | Same physical machine | RAID 5 NVMe drives on the node |
| Distributed Cache | 6-8 GB/s | Same datacenter/region | 100 Gbps network across all servers |
| Object Store | 1.5-8 GB/s | Global | Backing store for all files |
/data, it checks the local node cache first. If the file isn’t found locally, it checks the distributed datacenter cache. If that also misses, it fetches from the object store and populates both caches on the way back. This means the first runner in a region pays the full download cost, but every runner after it benefits from progressively faster access.
Docker Image Cache
Docker images are cached separately using the node’s Docker daemon. The scheduler prefers nodes that already have your image cached. If no cached node is available, the image is pulled from the container registry. Build-time layer caching is handled through BuildKit’s registry cache, so rebuilds after small Dockerfile changes only rebuild the affected layers.What Gets Cached
Docker Image Layers — container images are split into layers, and each layer is cached independently. If you change a single dependency in your Dockerfile, only the affected layers are re-pulled. The base image and unchanged layers come from cache. See Optimize Container Images for tips on structuring your Dockerfile for better layer caching. Model Weights — files downloaded to/data are automatically cached across runners through the three-layer /data cache described above. This includes Hugging Face models (cached at /data/.cache/huggingface via the HF_HOME environment variable), weights downloaded with download_model_weights(), and any other files you write to /data. The first runner downloads the weights; subsequent runners read from the local or distributed cache.
Compiled Model Caches — PyTorch Inductor compiled models and other JIT compilation artifacts. If you use torch.compile(), you can persist and share the resulting compiled kernels across runners using synchronized_inductor_cache(). This stores the cache on /data (at /data/inductor-caches/<GPU_TYPE>/), where it benefits from the same three-layer cache. See Compiled Kernel Caching for setup instructions.
How Caches Warm Up
Caches warm progressively as your application serves traffic:- First runner — pulls the Docker image from the registry and downloads model weights from the object store. Populates the local and distributed
/datacaches. This is the slowest cold start. - Second runner on the same node — Docker image is already in the node’s Docker cache, and model weights are in the local NVMe cache. Cold start is significantly faster.
- Runners on other nodes — model weights are served from the distributed
/datacache. The scheduler prefers nodes where the Docker image is already cached, but may pull from the registry if needed. - Over time — as more nodes serve your app, caches spread across the region. Cold starts converge to the time it takes to run
setup()with all files already available locally.
Caches are not permanent. Local caches may be evicted when nodes are recycled or under memory pressure. If your app hasn’t had traffic for a while, the first cold start after a quiet period may be slower as caches are repopulated.