hint parameter (sent as the X-Fal-Runner-Hint header) that describes what the request needs. On the server side, your app implements a provide_hints() method that tells fal what each runner is currently specialized for. When both are present, fal’s router tries to match requests to runners with compatible hints. If no matching runner is available or all matching runners are busy, the request goes to any available runner without waiting. Hints are best-effort: they improve cache hit rates but never block a request from being processed.
How It Works
The router matches the hint string from the caller against the list of strings each runner reports viaprovide_hints(). The matching is exact: if the caller sends hint="flux-schnell" and a runner’s provide_hints() returns ["flux-schnell", "sd-xl"], that runner is preferred. If no runner has a matching hint, the request goes to any available runner.
provide_hints() is called after every response and the result is sent back to the platform as a response header. This means hints update dynamically as the runner loads and unloads models. A runner that starts empty will initially match any request, and as it loads models, it becomes specialized for those models.
Example
This app serves any Hugging Face diffusion model by name. Each runner maintains a cache of loaded models. The hint is the model name, which matches whatprovide_hints() reports.
Application
provide_hints() returns an empty list, so it matches any request. As it loads models, the hints update to include the loaded model names. Over time, the router naturally specializes runners by directing repeat requests for the same model to the same runner.