Status Codes and Runner Behavior

The HTTP status code your endpoint returns has consequences beyond the response itself. Different codes determine whether the runner stays alive or is terminated, whether the request is retried on a new runner, and how billing works. Returning the wrong code (e.g., 503 for a normal error) can kill a healthy runner unnecessarily. This page covers how fal interprets each status code, which code to use in different situations, and how connection errors and timeouts interact with the runner lifecycle. For controlling retry behavior beyond status codes, see Retries. For adding proactive health checks, see Health Check Endpoint.

Startup

fal considers a runner ready to serve requests after the setup() method completes successfully. If there is no setup() method, the runner is ready as soon as the web server is up. If setup() fails or the web server port never opens, the runner is immediately terminated as unhealthy and no requests are forwarded to it.

Always use setup() to load models and perform a warmup inference. This ensures your runner is fully functional before receiving real traffic.

Status Code Reference

The status code your endpoint returns determines what happens to the runner and whether queue-based requests are retried. Direct calls via run() and stream() are never retried regardless of status code.

Status Code	Runner Impact	Retried (queue only)
2XX	Healthy	N/A
4XX	Healthy	No
500	TCP health check triggered	No
502	TCP health check triggered	No
503	Immediately terminated	Yes
504	TCP health check triggered	Yes

How each code works

2XX and 4XX — The runner remains healthy and continues serving requests. 4XX responses are treated as client errors and are never retried. 500 and 502 — The platform runs a TCP health check on the runner. If the check passes, the runner stays alive and continues serving requests. If it fails, the runner is terminated and replaced. The request is not automatically retried. 503 — The runner is immediately terminated after a single 503 response. Queue-based requests are automatically retried on a new runner (up to 10 times). Use this only when the runner is genuinely broken (e.g., GPU OOM, corrupted state). 504 — The platform runs a TCP health check and automatically requeues the request for retry. The runner is not immediately terminated but may be replaced if the health check fails. Use this when an upstream dependency timed out but your runner is still functional.

Never return 503 for normal application errors. A single 503 immediately kills your runner. Use 500 for application-level errors where the runner is still functional.

Which Status Code to Use

Situation	Recommended Code	Why
Bad user input, validation failure	422 or 400	Client error, runner stays healthy
Model inference failed but runner is fine	500	TCP health check runs, runner likely survives
External API or dependency timed out	504	Request retried, runner not killed
GPU OOM, corrupted model state, runner broken	503	Runner terminated and replaced
Rate limiting the caller	429	Client error, queue-based requests automatically retried

import fal
from fastapi.responses import JSONResponse

class MyApp(fal.App):
    @fal.endpoint("/")
    def predict(self, input: dict) -> dict:
        try:
            result = self.model.run(input)
            return result
        except ValueError as e:
            return JSONResponse(
                status_code=422,
                content={"detail": str(e)},
            )
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                return JSONResponse(
                    status_code=503,
                    content={"detail": "GPU out of memory"},
                )
            return JSONResponse(
                status_code=500,
                content={"detail": "Inference failed"},
            )

Connection Errors and Timeouts

Beyond status codes, two additional scenarios affect runner lifecycle:

Scenario	What happens	Queue requests	Direct requests
App crashes (connection breaks)	Runner terminated	Retried on new runner	Returns 503
Request timeout exceeded	Runner terminated	Retried on new runner	Returns 504

In both cases the runner is shut down because it may be in a faulty state. The platform spins up a replacement.

Overriding Retry Behavior

You can override the default retry logic on a per-response basis using the X-Fal-Needs-Retry response header. This takes precedence over the status-code-based logic and skip_retry_conditions.

Header Value	Behavior
`1`	Force a retry, even if the status code would not normally trigger one
`0`	Prevent a retry, even if the status code would normally trigger one (e.g., return 503 without killing the runner’s retry)

from fastapi.responses import JSONResponse

@fal.endpoint("/")
def predict(self, input: dict) -> dict:
    try:
        return self.model.run(input)
    except TransientError:
        return JSONResponse(
            status_code=500,
            headers={"X-Fal-Needs-Retry": "1"},
            content={"detail": "Transient error"},
        )

See Retries for the full reference including skip_retry_conditions, client-side retry control, and timeout interactions.

Setting Up

Model APIs

Serverless

Compute

Organizations

Status Codes and Runner Behavior

Startup

Status Code Reference

How each code works

Which Status Code to Use

Connection Errors and Timeouts

Overriding Retry Behavior

Setting Up

Model APIs

Serverless

Compute

Organizations

​Startup

​Status Code Reference

​How each code works

​Which Status Code to Use

​Connection Errors and Timeouts

​Overriding Retry Behavior

Startup

Status Code Reference

How each code works

Which Status Code to Use

Connection Errors and Timeouts

Overriding Retry Behavior