Understanding Requests

When you call a model or your own deployed app on fal, the request passes through several layers before your code processes it and returns a result. Understanding these layers helps you reason about latency, retries, timeouts, and why requests behave the way they do. This page covers the request lifecycle from the caller’s perspective down to the runner, and how requests interact with runners and the retry system. fal’s infrastructure is built around a persistent queue that decouples callers from runners. When you use queue-based methods (submit(), subscribe()), your request enters the queue and benefits from automatic retries, status tracking, and durability. Direct methods (run(), stream()) bypass the queue and connect straight to a runner, which is faster but means no retries and no status polling. Both paths share the same runner and scaling infrastructure underneath.

Request Lifecycle (Queue-Based)

When using submit() or subscribe(), a request moves through three states visible to callers via the queue status API. Direct calls via run() and stream() bypass the queue entirely and do not have these states.

State	What is happening	Caller sees
IN_QUEUE	Request is waiting in the queue for an available runner	`queue_position` indicating how many requests are ahead
IN_PROGRESS	A runner is executing your endpoint handler	`logs` from your code (when enabled)
COMPLETED	Result is stored and ready for retrieval	Full response payload

Cancellation is handled separately from the status lifecycle. When a caller cancels a request, queued requests are removed from the queue and in-progress requests receive a cancellation signal. The cancel API returns CANCELLATION_REQUESTED (202) or ALREADY_COMPLETED (400) rather than transitioning to a pollable status.

How Requests Flow

Submission

The caller submits a request via the SDK or REST API. fal assigns a request_id and places the request in the persistent queue. The request enters IN_QUEUE state. By default there is no queue size limit and requests are never dropped. Callers can optionally set fal_max_queue_length to reject requests with 429 if the queue exceeds a threshold.

Dispatch

The dispatcher checks for available IDLE runners. If a runner is free, the request is routed immediately and enters IN_PROGRESS. If all runners are busy, the request waits in the queue while fal scales up new runners. Runners with matching routing hints are preferred when available.

Processing

The runner receives the request as a standard HTTP call. Your endpoint handler runs, processes the input, and returns a response. The runner transitions from RUNNING back to IDLE. If the runner fails, the request is retried automatically.

Result

The response is stored and the request enters COMPLETED. The caller retrieves the result by polling or streaming status, or receives it via webhook. For direct run() calls, the response is returned in the same HTTP connection.

Your endpoint code receives every request as a regular HTTP call. It does not matter whether the caller used run(), submit(), or stream(). The queue and dispatch layer are transparent to your app code.

Requests and Retries

Retries only apply to queue-based requests. Direct calls via run() and stream() return errors immediately with no retry. When a runner fails while processing a queued request, the request is placed in a scheduled requeue with a backoff delay, then re-enters the queue and is dispatched to a different runner. This happens automatically for server errors (503), timeouts (504), and connection failures, up to 10 attempts. The retry is transparent to the caller — they continue polling the same request_id and eventually get a result or a final failure.

The start_timeout clock runs continuously across all retry attempts. If you set start_timeout=30 and the first attempt fails after 20 seconds, the second attempt only has 10 seconds left before the server returns 504. This prevents retries from running indefinitely. You can control retry behavior at three levels: app-level with skip_retry_conditions, per-response with the X-Fal-Needs-Retry header, and per-request with the X-Fal-No-Retry header from the caller.

Understanding Runners

Runner lifecycle states, startup, shutdown, and scaling

Retries and Error Handling

Status codes, response headers, timeouts, and retry control

Caching

How cold start caching affects request latency

Handle Cancellations

Implement cancel endpoints for in-progress requests

Setting Up

Model APIs

Serverless

Compute

Organizations

Understanding Requests

Request Lifecycle (Queue-Based)

How Requests Flow

Requests and Retries

Understanding Runners

Retries and Error Handling

Caching

Handle Cancellations

Setting Up

Model APIs

Serverless

Compute

Organizations

​Request Lifecycle (Queue-Based)

​How Requests Flow

​Requests and Retries

Understanding Runners

Retries and Error Handling

Caching

Handle Cancellations

Request Lifecycle (Queue-Based)

How Requests Flow

Requests and Retries