Platform GPU Monitoring
fal continuously monitors GPU metrics across all runners, including temperature, clock frequencies, and throttling events. When issues are detected, the operations team is alerted and can cordon the node (preventing new runners from being scheduled), drain existing runners, and perform GPU resets. If the issue persists, the node is escalated for hardware replacement. This monitoring runs automatically and requires no configuration. You benefit from it regardless of whether you have custom health checks enabled.Application Health Checks
Platform monitoring catches hardware-level failures, but it cannot detect application-level problems like a corrupted model state, a leaked GPU memory allocation, or an external dependency that went down. For these, you can define a health check endpoint that fal calls periodically to verify your runner is functioning correctly.failure_threshold consecutive calls (default 3), the runner is terminated and replaced. Health checks run every 15 seconds when call_regularly=True.
Non-Invasive vs Invasive Checks
Health checks withcall_regularly=True run in parallel with request processing. Keep these lightweight since they share GPU and CPU resources with active requests. Check connection status, memory usage, or simple assertions rather than running inference.
For more thorough checks that need exclusive GPU access (e.g., running a test inference), set call_regularly=False. In this mode, the health check only runs when the gateway sends an x-fal-runner-health-check header, which happens between requests or after specific error conditions.