Skip to main content
When a request fails while being processed through the queue, fal automatically retries it on a new runner. This covers infrastructure-level failures like runner crashes, network issues, and timeouts. Retries happen transparently, up to 10 attempts, before the request is marked as failed. Direct calls via run() and stream() (without the queue) are never retried. The status code your endpoint returns determines whether the runner stays alive, whether the request is retried, and how the platform reacts. Returning the wrong code can kill a healthy runner or prevent a retry that should happen. When a request does fail, the response includes an error_type field and X-Fal-Error-Type header with a machine-readable category (e.g. request_timeout, runner_disconnected) that you can use for programmatic retry logic and monitoring. See Request Error Types for the full reference. This page is the complete reference for understanding failures on fal: what triggers retries, what each status code does, how timeouts interact, and how to override default behavior with response headers. For a broader view of how retries fit into the request lifecycle, see Understanding Requests.

Status Code Reference

The status code your endpoint returns determines what happens to the runner and whether queue-based requests are retried.
Status CodeRunner ImpactRetried (queue only)
2XXHealthyN/A
4XXHealthyNo
500TCP health check triggeredNo
502TCP health check triggeredNo
503Immediately terminatedYes
504TCP health check triggeredYes
2XX and 4XX — The runner remains healthy and continues serving requests. 4XX responses are treated as client errors and are never retried. 500 and 502 — The platform runs a TCP health check on the runner. If the check passes, the runner stays alive and continues serving requests. If it fails, the runner is terminated and replaced. The request is not automatically retried. 503 — The runner is immediately terminated after a single 503 response. Queue-based requests are automatically retried on a new runner (up to 10 times). Use this only when the runner is genuinely broken (e.g., GPU OOM, corrupted state). 504 — The platform runs a TCP health check and automatically requeues the request for retry. The runner is not immediately terminated but may be replaced if the health check fails.
Never return 503 for normal application errors. A single 503 immediately kills your runner. Use 500 for application-level errors where the runner is still functional.

Which Status Code to Use

SituationRecommended CodeWhy
Bad user input, validation failure422 or 400Client error, runner stays healthy
Model inference failed but runner is fine500Health check runs, runner likely survives
External API or dependency timed out504Request retried, runner not killed
GPU OOM, corrupted model state, runner broken503Runner terminated and replaced
Rate limiting the caller429Client error, runner stays healthy, no retry
import fal
from fastapi.responses import JSONResponse

class MyApp(fal.App):
    @fal.endpoint("/")
    def predict(self, input: dict) -> dict:
        try:
            result = self.model.run(input)
            return result
        except ValueError as e:
            return JSONResponse(
                status_code=422,
                content={"detail": str(e)},
            )
        except RuntimeError as e:
            if "out of memory" in str(e).lower():
                return JSONResponse(
                    status_code=503,
                    content={"detail": "GPU out of memory"},
                )
            return JSONResponse(
                status_code=500,
                content={"detail": "Inference failed"},
            )

Connection Errors and Timeouts

Beyond status codes, two additional scenarios affect runner lifecycle:
ScenarioWhat happensQueue requestsDirect requests
App crashes (connection breaks)Runner terminatedRetried on new runnerReturns 503
Request timeout exceededRunner terminatedRetried on new runnerReturns 504
In both cases the runner is shut down because it may be in a faulty state. The platform spins up a replacement.

When Retries Happen

fal retries queue-based requests under three conditions. Each corresponds to a value you can use in skip_retry_conditions to disable it.
ConditionValueWhat triggers itRunner impact
Server error"server_error"Runner returned HTTP 503, runner disconnected, runner sent an incomplete response, or runner returned HTTP 504503: runner terminated. 504: health check triggered.
Timeout"timeout"Request exceeded the app’s request_timeout and the gateway killed the connectionRunner terminated
Connection error"connection_error"The HTTP session between the gateway and the runner was unexpectedly closedRunner terminated
Each condition maps to a different failure mode. Server errors indicate the runner is in a bad state. Timeouts indicate the request took too long. Connection errors indicate a network-level failure between the gateway and the runner.

Controlling Retry Behavior

App-Level: skip_retry_conditions

Configure your app to skip retries for specific conditions. Pass one or more of the condition values from the table above.
class MyApp(fal.App):
    skip_retry_conditions = ["timeout"]
This is useful when your model has long-running requests that exceed request_timeout for legitimate reasons. Without this setting, fal would retry the request on a new runner, which wastes compute and delays the final failure response. You can combine multiple conditions:
class MyApp(fal.App):
    skip_retry_conditions = ["timeout", "server_error"]

Per-Response: X-Fal-Needs-Retry

Override the default retry behavior on a per-response basis by returning the X-Fal-Needs-Retry header from your endpoint. This takes precedence over both the status-code-based retry logic and skip_retry_conditions.
Header ValueBehavior
1Force a retry, even if the status code would not normally trigger one
0Prevent a retry, even if the status code would normally trigger one
import fal
from fastapi.responses import JSONResponse

class MyApp(fal.App):
    @fal.endpoint("/")
    def run(self, input: Input) -> Output:
        try:
            result = self.model.run(input)
            return result
        except TransientError:
            return JSONResponse(
                status_code=500,
                headers={"X-Fal-Needs-Retry": "1"},
                content={"detail": "Transient error, please retry"},
            )
        except NonRetryableError:
            return JSONResponse(
                status_code=503,
                headers={"X-Fal-Needs-Retry": "0"},
                content={"detail": "Non-retryable error"},
            )

Per-Response: x-fal-stop-runner

Control whether the runner is terminated after a response, independent of the status code. This header is stripped from the response before it reaches the caller.
Header ValueBehavior
1 / trueForce runner termination (same effect as a 503, but works with any status code)
0 / falsePrevent runner termination (allows returning 503 for retry without killing the runner)
Use this when you want to decouple retry behavior from runner termination. For example, you might want to trigger a retry (X-Fal-Needs-Retry: 1) but keep the runner alive (x-fal-stop-runner: false), or terminate a runner (x-fal-stop-runner: true) without triggering a retry (X-Fal-Needs-Retry: 0).
from fastapi.responses import JSONResponse

@fal.endpoint("/")
def predict(self, input: dict) -> dict:
    try:
        return self.model.run(input)
    except CorruptedStateError:
        return JSONResponse(
            status_code=500,
            headers={
                "x-fal-stop-runner": "true",
                "X-Fal-Needs-Retry": "1",
            },
            content={"detail": "Runner state corrupted, retrying on fresh runner"},
        )

Per-Request: Client-Side Control

When calling your app (or any model) from client code, you can control retry behavior per-request using headers. Pass the x-fal-no-retry header to prevent fal from retrying a specific request:
import fal_client

result = fal_client.subscribe(
    "your-username/your-app-name",
    arguments={"prompt": "a sunset"},
    headers={"x-fal-no-retry": "1"},
)
For supported models, fal may route failed requests to equivalent fallback endpoints. To disable this per-request, pass x-app-fal-disable-fallback:
result = fal_client.subscribe(
    "your-username/your-app-name",
    arguments={"prompt": "a sunset"},
    headers={"x-app-fal-disable-fallback": "1"},
)

Timeouts and Retries

fal has four timeout mechanisms, each operating at a different stage of the request lifecycle. They interact with retries differently. The app-level request_timeout controls how long a single request can execute on a runner. If your endpoint handler exceeds this limit, the gateway kills the connection, terminates the runner, and retries the request (unless you set skip_retry_conditions = ["timeout"]).
class MyApp(fal.App):
    request_timeout = 600  # 10 minutes per request
The app-level startup_timeout controls how long a new runner has to complete setup() and open its HTTP port. If setup takes longer, the runner is terminated and replaced. This is not a retry condition because the request has not started processing yet. The request stays in the queue and waits for a healthy runner.
class MyApp(fal.App):
    startup_timeout = 600  # 10 minutes for setup
The caller-level start_timeout (sent as X-Fal-Request-Timeout) is set by the caller and controls the total deadline for the request, including queue wait time, runner acquisition, and processing. If exceeded, fal returns a 504 with no retry and the runner is not terminated.
result = fal_client.subscribe(
    "fal-ai/nano-banana-2",
    arguments={"prompt": "a sunset"},
    start_timeout=30,
)
The client-level client_timeout (Python SDK only) is enforced entirely on the client side. The client stops polling and raises an exception locally. The request may still be processing on the server.
result = fal_client.subscribe(
    "fal-ai/nano-banana-2",
    arguments={"prompt": "a sunset"},
    client_timeout=60,
)
TimeoutSet byWhen it appliesRetriesRunner impact
request_timeoutApp developerDuring request processingYes (condition: "timeout")Terminated
startup_timeoutApp developerDuring runner startup / setup()No (request stays queued)Terminated and replaced
start_timeout / X-Fal-Request-TimeoutCaller (server-side)Total lifecycle including queueNeverNot affected
client_timeoutCaller (client-side)Total time client waitsN/A (client stops polling)Not affected
See Scale Your Application for configuring request_timeout and startup_timeout. The caller-level parameters are documented on the Async Inference page.

Request Error Types

When a request fails, the response body includes a detail string and an error_type field identifying the failure category. The same value is available in the X-Fal-Error-Type response header.
{
  "detail": "Request timed out",
  "error_type": "request_timeout"
}
Use error_type to build programmatic retry logic and monitor failure patterns. Runner and timeout errors are typically transient and worth retrying. Client errors (client_disconnected, bad_request) should not be retried.
Error TypeDescriptionTypical Status Code
request_timeoutThe request exceeded the allowed processing time.504
startup_timeoutThe runner did not start within the allowed time.504
runner_scheduling_failureNo runner could be allocated to handle the request.503
runner_connection_timeoutThe connection to the runner timed out.503
runner_disconnectedThe runner disconnected unexpectedly during processing.503
runner_connection_refusedThe runner refused the connection.503
runner_connection_errorA general connection error occurred with the runner.503
runner_incomplete_responseThe runner sent an incomplete response payload.502
runner_server_errorThe runner encountered an internal server error.500
client_disconnectedThe client closed the connection before the response was sent.499
client_cancelledThe request was cancelled by the client.499
bad_requestThe request was malformed (e.g., invalid timeout header).400
internal_errorAn unexpected internal error occurred.500
This error format is different from model validation errors, which return a detail array of typed error objects. Request errors return a flat object with detail as a string.