Introduction to Private Serverless Models
In addition to using our blazing-fast public API endpoints you can also take advantage of fal’s infrastructure for your private AI models. This section explains how to deploy a custom private AI model to fal’s infrastructure.
Install the fal sdk python package
Speed run Stable Diffusion
The example below uses the diffusers
library to run a simple stable diffusion pipeline.
First time you run this application, fal
will create a virtual environment that satisfies the requirements specified in the requirements
variable. This environment will be cached and used for each subsequent invocation of the API.
Once you see the message above, your application is ready to accept requests!
In this code:
-
MyApp
is a class that inherits fromfal.App
. This structure allows the creation of a complex application with multiple endpoints, which are defined using the@fal.endpoint
decorator. -
machine_type
is a class attribute that specifies the type of machine on which this application will run. Here,"GPU-A100"
is specified. -
requirements
is another class attribute that lists the dependencies needed for the application to run. In this case,"my_requirements"
is a placeholder for actual dependencies. -
The
setup()
method is overridden to initialize the models used in the application. This method is executed once when the application is started. -
The
@fal.endpoint
decorator is used to define the routes or endpoints of the application. In this example, only one endpoint is defined:"/"
.
Deploying your application
Once your application is ready for deployment, you can use the fal CLI to deploy it:
In this command, we instruct fal to deploy the MyApp
class from example.py as an application.
Upon successful deployment, fal will provide a URL, for example, https://fal.run/777/my_app
. This URL is the public access point to your deployed application, allowing you to interact with the API endpoints defined within your MyApp
class.
Setup Functions and keep_alive
keep_alive
“keep_alive” is a configuration setting that enables the server to continue running even when there are no active requests.
By setting keep_alive
, you can ensure that if you hit the same application within the specified time frame, you can avoid incurring any overhead at all.
“keep_alive” is measured in seconds, in the example below the application will keep running for at least 300 seconds after the last request.
setup()
When managing AI workloads, it’s vital to prevent the same model from being reloaded into memory each time the serverless application is invoked. Each application can define a setup() function. This function is invoked once during application startup, and its result is cached in memory for the entire server lifecycle.
Min/Max Concurrency
fal applications have a simple managed autoscaling system. You can configure the autoscaling behavior through min_concurrency
and max_concurrency
.
min_concurrency
- indicated the number of replicas the system should maintain when there are no requests.
max_concurrency
- indicates the maximum number of replicas the system should have. Once this limit is reached, all subsequent requests are placed in a managed queue.