Running Diffusion Models
In this example, we demonstrate how you can write your own Stable Diffusion tool, run it from your terminal, and deploy it to fal as a scalable and production-grade HTTP API without changing a single line of code.
Starting small
We can start by importing fal
and defining a global variable called MODEL_NAME
to denote which model we want to use (it can be any SD/SDXL model or any fine-tuned ones) from the HF model hub:
Then proceed to have a cached function[link] that loads the model into the GPU when it’s not already present. This should help us save a lot of time when we are invoking our tool multiple times in a row (or lots of API requests hitting in a short time frame)
Taking inputs and returning outputs
For enabling automatic web endpoint fal offers through serve=True
, we’ll have to define our inputs and outputs in a structured way through Pydantic. Although this looks like a web thing, there is actually nothing that prevents you from using the same I/O for the CLI as well which is what we are going to do.
Stable Diffusion App
We can annotate our inference function with the necessary packages (which is just diffusers
and transformers
for this example) and mark it as a served function by setting serve=True
. This workflow can run on different GPUs, check https://fal.ai/pricing for a list of options.
The inference logic itself should be quite self explanatory but if we need to summarize it three steps, at each invocation this function does:
- Gets the diffusers pipeline which includes the actual model. Although the first invocation will be a bit expensive (~15 seconds) all subsequent cached invocations will be free of any cost.
- Run the pipeline with given options to perform inference and generate the image.
- Upload the image to fal’s storage servers, and return a result object.
Using the app in the CLI
To try and play with your new stable diffusion app, you can write a very small interface for it and start running it locally.
As you might have noticed, we are creating a new function that is called local_diffusion
by setting the serve
property to False
when performing our invocations through Python. This is done in a way to ensure that our application works both as a web app when ran through run_stable_diffusion()
(or deployed) and also can be interacted through Python.
Productionizing your API
For sharing this app with others in the form of an HTTP API, all you have to do is call fal
’s serve command and let it deploy the function for you to the serverless runtime. Each HTTP request will automatically wake up a server (if there isn’t already one), process the request, hang around for a while in case there are other subsequent requests (within the defined keep_alive
) and finally shut itself down to prevent incurring costs when idle.
As long as you have the credentials[link] set, you can invoke this API from anywhere (from your terminal or from your own frontend):