Skip to main content
Real-time inference uses WebSockets for persistent connections, enabling sub-100ms image generation. This is ideal for interactive applications like real-time creativity tools and camera-based inputs. Unlike queue-based inference, real-time connections bypass the queue entirely and route inputs directly to a runner. This eliminates queue wait time, and because the WebSocket maintains a persistent connection, the runner stays warm for all subsequent messages after the initial connection. The first connection may still incur a cold start if no runner is already available. Only models with an explicit real-time endpoint are supported.
Only models that explicitly support real-time inference can be used with the realtime client. Standard queue-based models do not have a realtime endpoint.

Supported Models


Quick Start

import { fal } from "@fal-ai/client";

const connection = fal.realtime.connect("fal-ai/fast-lcm-diffusion", {
  onResult: (result) => {
    console.log(result);
  },
  onError: (error) => {
    console.error(error);
  },
});

connection.send({
  prompt: "a sunset over mountains",
  sync_mode: true,
  image_url: "data:image/png;base64,..."
});

Performance Tips

For the fastest inference:
  • Use 512x512 input dimensions (fastest)
  • Provide images as base64 encoded data URLs
  • Set sync_mode: true to receive base64 encoded responses
  • 768x768 and 1024x1024 also work well, but 512x512 is optimal

Keeping API Keys Secure

WebSocket connections from browsers cannot safely embed API keys. There are two approaches for client-side authentication: a proxy URL or a token provider.

Proxy URL

The simplest approach. Point the client at a server-side proxy that adds your API key:
import { fal } from "@fal-ai/client";

fal.config({
  proxyUrl: "/api/fal/proxy",
});

const connection = fal.realtime.connect("fal-ai/fast-lcm-diffusion", {
  connectionKey: "realtime-demo",
  throttleInterval: 128,
  onResult(result) {
    // handle result
  },
});

Proxy Setup

Learn how to set up a server-side proxy

Token Provider

For more control, use a tokenProvider function that fetches short-lived JWT tokens from your backend. This is useful when you need per-user authentication or want to restrict which apps a token can access.
Protect your token endpoint with authentication. The endpoint that generates fal tokens should verify that the request comes from an authenticated user in your application. Without proper authentication, anyone could use your endpoint to generate tokens and consume your fal credits.
Client-side example:
import { fal, type TokenProvider } from "@fal-ai/client";

const myTokenProvider: TokenProvider = async (app) => {
  const response = await fetch(`/api/fal/token?app=${app}`);
  const { token } = await response.json();
  return token;
};

const connection = fal.realtime.connect("fal-ai/fast-lcm-diffusion", {
  tokenProvider: myTokenProvider,
  onResult: (result) => {
    console.log(result);
  },
});

connection.send({
  prompt: "a cat",
  sync_mode: true,
});
Next.js API Route example (app/api/fal/token/route.ts):
import { NextRequest, NextResponse } from "next/server";

export async function GET(request: NextRequest) {
  // IMPORTANT: Add your own authentication logic here
  // const session = await getServerSession();
  // if (!session) {
  //   return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
  // }

  const { searchParams } = new URL(request.url);
  const app = searchParams.get("app");

  if (!app) {
    return NextResponse.json({ error: "Missing app parameter" }, { status: 400 });
  }

  const response = await fetch("https://rest.alpha.fal.ai/tokens/realtime", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Key ${process.env.FAL_KEY}`,
    },
    body: JSON.stringify({
      allowed_apps: [app],
      duration: 120,
    }),
  });

  const data = await response.json();
  return NextResponse.json({ token: data.token });
}
The tokenProvider also works for streaming with connectionMode: "client":
const stream = await fal.stream("fal-ai/flux/dev", {
  connectionMode: "client",
  tokenProvider: myTokenProvider,
  input: { prompt: "a cat" },
});

Differences from Queue-Based Inference

Real-time WebSocket connections bypass the queue and connect directly to a runner. Several request parameters that work with queue-based inference do not apply:
ParameterBehavior with Real-Time
start_timeoutNo effect. There is no queue wait
priorityNo effect. No queue ordering
webhook_urlNot supported. Results stream back over the WebSocket
Automatic retriesNot available. Failed messages return errors on the connection
X-Fal-No-RetryNo effect. No retry mechanism to disable

Realtime vs Streaming

Both realtime and streaming give you faster feedback than polling, but they serve different use cases.
FeatureRealtime (WebSocket)Streaming (SSE)
DirectionBidirectional (client and server)One-way (server to client)
ConnectionPersistent, reusableNew connection per request
LatencyLower (connection reuse)Higher (new connection each time)
Best forInteractive apps, back-to-back requestsProgressive output, previews
ProtocolBinary msgpackJSON over SSE
Use realtime when clients send multiple requests in quick succession over a persistent connection, like interactive image editing or camera-based inputs. Use streaming when you want to show progressive output from a single request, like image generation previews or LLM tokens.

Protocol Details

The realtime client uses msgpack for binary serialization across all SDKs, which is more efficient than JSON for transmitting image data. In Python, realtime() and realtime_async() provide a RealtimeConnection with send() and recv() methods. In JavaScript, fal.realtime.connect() uses callback-based onResult and onError handlers.

Video Tutorial

Build a Real-Time AI Image App with WebSockets, Next.js, and fal.ai: