Retries and Error Handling

When a request fails while being processed through the queue, fal automatically retries it on a new runner. This covers infrastructure-level failures like runner crashes, network issues, and timeouts. Retries happen transparently, up to 10 attempts, before the request is marked as failed. Direct calls via run() and stream() (without the queue) are never retried. The status code your endpoint returns determines whether the runner stays alive, whether the request is retried, and how the platform reacts. Returning the wrong code can kill a healthy runner or prevent a retry that should happen. When a request does fail, the response includes an error_type field and X-Fal-Error-Type header with a machine-readable category (e.g. request_timeout, runner_disconnected) that you can use for programmatic retry logic and monitoring. See Request Error Types for the full reference. This page is the complete reference for understanding failures on fal: what triggers retries, what each status code does, how timeouts interact, and how to override default behavior with response headers. For a broader view of how retries fit into the request lifecycle, see Understanding Requests.

Status Code Reference

The status code your endpoint returns determines what happens to the runner and whether queue-based requests are retried.

Status Code	Runner Impact	Retried (queue only)
2XX	Healthy	N/A
4XX	Healthy	No
500	TCP health check triggered	No
502	TCP health check triggered	No
503	Immediately terminated	Yes
504	TCP health check triggered	Yes

2XX and 4XX — The runner remains healthy and continues serving requests. 4XX responses are treated as client errors and are never retried. 500 and 502 — The platform runs a TCP health check on the runner. If the check passes, the runner stays alive and continues serving requests. If it fails, the runner is terminated and replaced. The request is not automatically retried. 503 — The runner is immediately terminated after a single 503 response. Queue-based requests are automatically retried on a new runner (up to 10 times). Use this only when the runner is genuinely broken (e.g., GPU OOM, corrupted state). 504 — The platform runs a TCP health check and automatically requeues the request for retry. The runner is not immediately terminated but may be replaced if the health check fails.

Never return 503 for normal application errors. A single 503 immediately kills your runner. Use 500 for application-level errors where the runner is still functional.

Which Status Code to Use

Situation	Recommended Code	Why
Bad user input, validation failure	422 or 400	Client error, runner stays healthy
Model inference failed but runner is fine	500	Health check runs, runner likely survives
External API or dependency timed out	504	Request retried, runner not killed
GPU OOM, corrupted model state, runner broken	503	Runner terminated and replaced
Rate limiting the caller	429	Client error, runner stays healthy, no retry

import fal
from fastapi.responses import JSONResponse

class MyApp(fal.App):
    @fal.endpoint("/")
    def predict(self, input: dict) -> dict:
        try:
            result = self.model.run(input)
            return result
        except ValueError as e:
            return JSONResponse(
                status_code=422,
                content={"detail": str(e)},
            )
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                return JSONResponse(
                    status_code=503,
                    content={"detail": "GPU out of memory"},
                )
            return JSONResponse(
                status_code=500,
                content={"detail": "Inference failed"},
            )

Connection Errors and Timeouts

Beyond status codes, two additional scenarios affect runner lifecycle:

Scenario	What happens	Queue requests	Direct requests
App crashes (connection breaks)	Runner terminated	Retried on new runner	Returns 503
Request timeout exceeded	Runner terminated	Retried on new runner	Returns 504

In both cases the runner is shut down because it may be in a faulty state. The platform spins up a replacement.

When Retries Happen

fal retries queue-based requests under three conditions. Each corresponds to a value you can use in skip_retry_conditions to disable it.

Condition	Value	What triggers it	Runner impact
Server error	`"server_error"`	Runner returned HTTP 503, runner disconnected, runner sent an incomplete response, or runner returned HTTP 504	503: runner terminated. 504: health check triggered.
Timeout	`"timeout"`	Request exceeded the app’s request_timeout and the gateway killed the connection	Runner terminated
Connection error	`"connection_error"`	The HTTP session between the gateway and the runner was unexpectedly closed	Runner terminated

Each condition maps to a different failure mode. Server errors indicate the runner is in a bad state. Timeouts indicate the request took too long. Connection errors indicate a network-level failure between the gateway and the runner.

Controlling Retry Behavior

App-Level: skip_retry_conditions

Configure your app to skip retries for specific conditions. Pass one or more of the condition values from the table above.

class MyApp(fal.App):
    skip_retry_conditions = ["timeout"]

This is useful when your model has long-running requests that exceed request_timeout for legitimate reasons. Without this setting, fal would retry the request on a new runner, which wastes compute and delays the final failure response. You can combine multiple conditions:

class MyApp(fal.App):
    skip_retry_conditions = ["timeout", "server_error"]

Per-Response: X-Fal-Needs-Retry

Override the default retry behavior on a per-response basis by returning the X-Fal-Needs-Retry header from your endpoint. This takes precedence over both the status-code-based retry logic and skip_retry_conditions.

Header Value	Behavior
`1`	Force a retry, even if the status code would not normally trigger one
`0`	Prevent a retry, even if the status code would normally trigger one

import fal
from fastapi.responses import JSONResponse

class MyApp(fal.App):
    @fal.endpoint("/")
    def run(self, input: Input) -> Output:
        try:
            result = self.model.run(input)
            return result
        except TransientError:
            return JSONResponse(
                status_code=500,
                headers={"X-Fal-Needs-Retry": "1"},
                content={"detail": "Transient error, please retry"},
            )
        except NonRetryableError:
            return JSONResponse(
                status_code=503,
                headers={"X-Fal-Needs-Retry": "0"},
                content={"detail": "Non-retryable error"},
            )

Per-Response: x-fal-stop-runner

Control whether the runner is terminated after a response, independent of the status code. This header is stripped from the response before it reaches the caller.

Header Value	Behavior
`1` / `true`	Force runner termination (same effect as a 503, but works with any status code)
`0` / `false`	Prevent runner termination (allows returning 503 for retry without killing the runner)

Use this when you want to decouple retry behavior from runner termination. For example, you might want to trigger a retry (X-Fal-Needs-Retry: 1) but keep the runner alive (x-fal-stop-runner: false), or terminate a runner (x-fal-stop-runner: true) without triggering a retry (X-Fal-Needs-Retry: 0).

from fastapi.responses import JSONResponse

@fal.endpoint("/")
def predict(self, input: dict) -> dict:
    try:
        return self.model.run(input)
    except CorruptedStateError:
        return JSONResponse(
            status_code=500,
            headers={
                "x-fal-stop-runner": "true",
                "X-Fal-Needs-Retry": "1",
            },
            content={"detail": "Runner state corrupted, retrying on fresh runner"},
        )

Per-Request: Client-Side Control

When calling your app (or any model) from client code, you can control retry behavior per-request using headers. Pass the x-fal-no-retry header to prevent fal from retrying a specific request:

Python
JavaScript

import fal_client

result = fal_client.subscribe(
    "your-username/your-app-name",
    arguments={"prompt": "a sunset"},
    headers={"x-fal-no-retry": "1"},
)

import { fal } from "@fal-ai/client";

const result = await fal.subscribe("your-username/your-app-name", {
  input: { prompt: "a sunset" },
  headers: { "x-fal-no-retry": "1" },
});

For supported models, fal may route failed requests to equivalent fallback endpoints. To disable this per-request, pass x-app-fal-disable-fallback:

Python
JavaScript

result = fal_client.subscribe(
    "your-username/your-app-name",
    arguments={"prompt": "a sunset"},
    headers={"x-app-fal-disable-fallback": "1"},
)

const result = await fal.subscribe("your-username/your-app-name", {
  input: { prompt: "a sunset" },
  headers: { "x-app-fal-disable-fallback": "1" },
});

Timeouts and Retries

fal has four timeout mechanisms, each operating at a different stage of the request lifecycle. They interact with retries differently. The app-level request_timeout controls how long a single request can execute on a runner. If your endpoint handler exceeds this limit, the gateway kills the connection, terminates the runner, and retries the request (unless you set skip_retry_conditions = ["timeout"]).

class MyApp(fal.App):
    request_timeout = 600  # 10 minutes per request

The app-level startup_timeout controls how long a new runner has to complete setup() and open its HTTP port. If setup takes longer, the runner is terminated and replaced. This is not a retry condition because the request has not started processing yet. The request stays in the queue and waits for a healthy runner.

class MyApp(fal.App):
    startup_timeout = 600  # 10 minutes for setup

The caller-level start_timeout (sent as X-Fal-Request-Timeout) is set by the caller and controls the total deadline for the request, including queue wait time, runner acquisition, and processing. If exceeded, fal returns a 504 with no retry and the runner is not terminated.

result = fal_client.subscribe(
    "fal-ai/nano-banana-2",
    arguments={"prompt": "a sunset"},
    start_timeout=30,
)

The client-level client_timeout (Python SDK only) is enforced entirely on the client side. The client stops polling and raises an exception locally. The request may still be processing on the server.

result = fal_client.subscribe(
    "fal-ai/nano-banana-2",
    arguments={"prompt": "a sunset"},
    client_timeout=60,
)

Timeout	Set by	When it applies	Retries	Runner impact
`request_timeout`	App developer	During request processing	Yes (condition: `"timeout"`)	Terminated
`startup_timeout`	App developer	During runner startup / `setup()`	No (request stays queued)	Terminated and replaced
`start_timeout` / `X-Fal-Request-Timeout`	Caller (server-side)	Total lifecycle including queue	Never	Not affected
`client_timeout`	Caller (client-side)	Total time client waits	N/A (client stops polling)	Not affected

See Scale Your Application for configuring request_timeout and startup_timeout. The caller-level parameters are documented on the Async Inference page.

Request Error Types

When a request fails, the response body includes a detail string and an error_type field identifying the failure category. The same value is available in the X-Fal-Error-Type response header.

{
  "detail": "Request timed out",
  "error_type": "request_timeout"
}

Use error_type to build programmatic retry logic and monitor failure patterns. Runner and timeout errors are typically transient and worth retrying. Client errors (client_disconnected, bad_request) should not be retried.

Error Type	Description	Typical Status Code
`request_timeout`	The request exceeded the allowed processing time.	504
`startup_timeout`	The runner did not start within the allowed time.	504
`runner_scheduling_failure`	No runner could be allocated to handle the request.	503
`runner_connection_timeout`	The connection to the runner timed out.	503
`runner_disconnected`	The runner disconnected unexpectedly during processing.	503
`runner_connection_refused`	The runner refused the connection.	503
`runner_connection_error`	A general connection error occurred with the runner.	503
`runner_incomplete_response`	The runner sent an incomplete response payload.	502
`runner_server_error`	The runner encountered an internal server error.	500
`client_disconnected`	The client closed the connection before the response was sent.	499
`client_cancelled`	The request was cancelled by the client.	499
`bad_request`	The request was malformed (e.g., invalid timeout header).	400
`internal_error`	An unexpected internal error occurred.	500

This error format is different from model validation errors, which return a detail array of typed error objects. Request errors return a flat object with detail as a string.

Setting Up

Model APIs

Serverless

Compute

Organizations

Retries and Error Handling

Status Code Reference

Which Status Code to Use

Connection Errors and Timeouts

When Retries Happen

Controlling Retry Behavior

App-Level: skip_retry_conditions

Per-Response: X-Fal-Needs-Retry

Per-Response: x-fal-stop-runner

Per-Request: Client-Side Control

Timeouts and Retries

Request Error Types

Setting Up

Model APIs

Serverless

Compute

Organizations

​Status Code Reference

​Which Status Code to Use

​Connection Errors and Timeouts

​When Retries Happen

​Controlling Retry Behavior

​App-Level: skip_retry_conditions

​Per-Response: X-Fal-Needs-Retry

​Per-Response: x-fal-stop-runner

​Per-Request: Client-Side Control

​Timeouts and Retries

​Request Error Types

Status Code Reference

Which Status Code to Use

Connection Errors and Timeouts

When Retries Happen

Controlling Retry Behavior

App-Level: skip_retry_conditions

Per-Response: X-Fal-Needs-Retry

Per-Response: x-fal-stop-runner

Per-Request: Client-Side Control

Timeouts and Retries

Request Error Types