Skip to content

Optimizing Routing Behavior

When there are multiple available runners of the same application, there isn’t a defined behavior for picking which one to use for a particular request with the assumption that all runners would behave identically for the given set of inputs.

This might not be true for applications which hold state and might include an in-memory cache for certain sets of parameters. For example, if you are serving an application that can run any diffusion model but only can keep 3 distinct models in memory, you want to minimize the number of cache misses (because loading that model from scratch incurs a significant performance penalty) depending on a user provided input.

This is where the semantically-aware routing hints come into play. Instead of treating each runner equally, runners can provide hints that allow fal’s router to preferentially route requests to runners that can best support them. To do this:

  1. You need to add a X-Fal-Runner-Hint header to your requests with a semantically meaningful string of your choice
  2. The application should implement a provide_hints() method that returns the list of hints the runner can serve

When both are present, fal’s router will try to match requests to runners. If there is no runner with the same hint, or all runners with the same hint are busy, fal will route the request to any available runner without waiting.

from typing import Any
import fal
from fal.toolkit import Image
from pydantic import BaseModel
class Input(BaseModel):
model: str = Field()
prompt: str = Field()
class Output(BaseModel):
image: Image = Field()
class AnyModelRunner(fal.App):
def setup(self) -> None:
self.models = {}
def provide_hints(self) -> list[str]:
# Choose to specialize on already loaded models; at first this will be empty
# so we'll be picked for any request, but as slowly the cache builds up, the
# runner will be more preferable compared to others.
return self.models.keys()
def load_model(self, name: str) -> Any:
from diffusers import DiffusionPipeline
if name in self.models:
return self.models[name]
pipeline = DiffusionPipeline.from_pretrained(name)
pipeline.to("cuda")
self.models[name] = pipeline
return pipeline
@fal.endpoint("/")
def run_model(self, input: Input) -> Output:
model = self.load_model(input.model)
result = model(input.prompt)
return Output(image=Image.from_pil(result.images[0]))