WebSocket-based inference for ultra-low latency applications
Real-time inference uses WebSockets for persistent connections, enabling sub-100ms image generation. This is ideal for interactive applications like real-time creativity tools and camera-based inputs.Unlike queue-based inference, real-time connections bypass the queue entirely and route inputs directly to a runner. This eliminates queue wait time, and because the WebSocket maintains a persistent connection, the runner stays warm for all subsequent messages after the initial connection. The first connection may still incur a cold start if no runner is already available. Only models with an explicit real-time endpoint are supported.
Only models that explicitly support real-time inference can be used with the realtime client. Standard queue-based models do not have a realtime endpoint.
WebSocket connections from browsers cannot safely embed API keys. There are two approaches for client-side authentication: a proxy URL or a token provider.
For more control, use a tokenProvider function that fetches short-lived JWT tokens from your backend. This is useful when you need per-user authentication or want to restrict which apps a token can access.
Protect your token endpoint with authentication. The endpoint that generates fal tokens should verify that the request comes from an authenticated user in your application. Without proper authentication, anyone could use your endpoint to generate tokens and consume your fal credits.
Real-time WebSocket connections bypass the queue and connect directly to a runner. Several request parameters that work with queue-based inference do not apply:
Parameter
Behavior with Real-Time
start_timeout
No effect. There is no queue wait
priority
No effect. No queue ordering
webhook_url
Not supported. Results stream back over the WebSocket
Automatic retries
Not available. Failed messages return errors on the connection
Both realtime and streaming give you faster feedback than polling, but they serve different use cases.
Feature
Realtime (WebSocket)
Streaming (SSE)
Direction
Bidirectional (client and server)
One-way (server to client)
Connection
Persistent, reusable
New connection per request
Latency
Lower (connection reuse)
Higher (new connection each time)
Best for
Interactive apps, back-to-back requests
Progressive output, previews
Protocol
Binary msgpack
JSON over SSE
Use realtime when clients send multiple requests in quick succession over a persistent connection, like interactive image editing or camera-based inputs. Use streaming when you want to show progressive output from a single request, like image generation previews or LLM tokens.
The realtime client uses msgpack for binary serialization across all SDKs, which is more efficient than JSON for transmitting image data. In Python, realtime() and realtime_async() provide a RealtimeConnection with send() and recv() methods. In JavaScript, fal.realtime.connect() uses callback-based onResult and onError handlers.