setup() runs once per runner, not once per request. This is where you load model weights into GPU memory, initialize connections, and prepare any state that your endpoint methods will use. Objects you attach to self (like self.model) persist for the lifetime of the runner, so expensive operations only happen at startup. The trade-off is that everything in setup() contributes to your app’s cold start time, which is the delay a request experiences when no idle runner is available and a new one must be provisioned.
setup() downloads SDXL weights (cached to /data automatically for HuggingFace models), loads them into GPU memory, and runs a warmup inference to compile any lazy kernels. All of this happens once. Every subsequent request reuses self.pipe without any initialization overhead.
What to Read Next
The pages in this section cover the infrastructure that supports your setup method. Persistent Storage explains the/data volume where model weights and datasets are cached across runners, including how the distributed filesystem works, how to upload files outside of your app, and how to handle concurrent writes safely. Downloading Models and Files covers the download_file() and download_model_weights() toolkit utilities, along with detailed Hugging Face optimization techniques for faster initial downloads, parallel file pre-reading, and compiled kernel caching.
For strategies to reduce cold start time beyond what is covered here, see Optimizing Cold Starts, which covers container image optimization, scaling parameters, and FlashPack for high-throughput tensor loading.