The HTTP status code your endpoint returns has consequences beyond the response itself. Different codes determine whether the runner stays alive or is terminated, whether the request is retried on a new runner, and how billing works. Returning the wrong code (e.g., 503 for a normal error) can kill a healthy runner unnecessarily.
This page covers how fal interprets each status code, which code to use in different situations, and how connection errors and timeouts interact with the runner lifecycle. For controlling retry behavior beyond status codes, see Retries. For adding proactive health checks, see Health Check Endpoint.
Startup
fal considers a runner ready to serve requests after the setup() method completes successfully. If there is no setup() method, the runner is ready as soon as the web server is up.
If setup() fails or the web server port never opens, the runner is immediately terminated as unhealthy and no requests are forwarded to it.
Always use setup() to load models and perform a warmup inference. This ensures your runner is fully functional before receiving real traffic.
Status Code Reference
The status code your endpoint returns determines what happens to the runner and whether queue-based requests are retried. Direct calls via run() and stream() are never retried regardless of status code.
| Status Code | Runner Impact | Retried (queue only) |
|---|
| 2XX | Healthy | N/A |
| 4XX | Healthy | No |
| 500 | TCP health check triggered | No |
| 502 | TCP health check triggered | No |
| 503 | Immediately terminated | Yes |
| 504 | TCP health check triggered | Yes |
How each code works
2XX and 4XX — The runner remains healthy and continues serving requests. 4XX responses are treated as client errors and are never retried.
500 and 502 — The platform runs a TCP health check on the runner. If the check passes, the runner stays alive and continues serving requests. If it fails, the runner is terminated and replaced. The request is not automatically retried.
503 — The runner is immediately terminated after a single 503 response. Queue-based requests are automatically retried on a new runner (up to 10 times). Use this only when the runner is genuinely broken (e.g., GPU OOM, corrupted state).
504 — The platform runs a TCP health check and automatically requeues the request for retry. The runner is not immediately terminated but may be replaced if the health check fails. Use this when an upstream dependency timed out but your runner is still functional.
Never return 503 for normal application errors. A single 503 immediately kills your runner. Use 500 for application-level errors where the runner is still functional.
Which Status Code to Use
| Situation | Recommended Code | Why |
|---|
| Bad user input, validation failure | 422 or 400 | Client error, runner stays healthy |
| Model inference failed but runner is fine | 500 | TCP health check runs, runner likely survives |
| External API or dependency timed out | 504 | Request retried, runner not killed |
| GPU OOM, corrupted model state, runner broken | 503 | Runner terminated and replaced |
| Rate limiting the caller | 429 | Client error, queue-based requests automatically retried |
import fal
from fastapi.responses import JSONResponse
class MyApp(fal.App):
@fal.endpoint("/")
def predict(self, input: dict) -> dict:
try:
result = self.model.run(input)
return result
except ValueError as e:
return JSONResponse(
status_code=422,
content={"detail": str(e)},
)
except RuntimeError as e:
if "out of memory" in str(e).lower():
return JSONResponse(
status_code=503,
content={"detail": "GPU out of memory"},
)
return JSONResponse(
status_code=500,
content={"detail": "Inference failed"},
)
Connection Errors and Timeouts
Beyond status codes, two additional scenarios affect runner lifecycle:
| Scenario | What happens | Queue requests | Direct requests |
|---|
| App crashes (connection breaks) | Runner terminated | Retried on new runner | Returns 503 |
| Request timeout exceeded | Runner terminated | Retried on new runner | Returns 504 |
In both cases the runner is shut down because it may be in a faulty state. The platform spins up a replacement.
Overriding Retry Behavior
You can override the default retry logic on a per-response basis using the X-Fal-Needs-Retry response header. This takes precedence over the status-code-based logic and skip_retry_conditions.
| Header Value | Behavior |
|---|
1 | Force a retry, even if the status code would not normally trigger one |
0 | Prevent a retry, even if the status code would normally trigger one (e.g., return 503 without killing the runner’s retry) |
from fastapi.responses import JSONResponse
@fal.endpoint("/")
def predict(self, input: dict) -> dict:
try:
return self.model.run(input)
except TransientError:
return JSONResponse(
status_code=500,
headers={"X-Fal-Needs-Retry": "1"},
content={"detail": "Transient error"},
)
See Retries for the full reference including skip_retry_conditions, client-side retry control, and timeout interactions.