Real-Time Inference

Real-time inference uses WebSockets for persistent connections, enabling sub-100ms image generation. This is ideal for interactive applications like real-time creativity tools and camera-based inputs. Unlike queue-based inference, real-time connections bypass the queue entirely and route inputs directly to a runner. This eliminates queue wait time, and because the WebSocket maintains a persistent connection, the runner stays warm for all subsequent messages after the initial connection. The first connection may still incur a cold start if no runner is already available. Only models with an explicit real-time endpoint are supported.

Only models that explicitly support real-time inference can be used with the realtime client. Standard queue-based models do not have a realtime endpoint.

Supported Models

fast-lcm-diffusion

SDXL with Latent Consistency Models

fast-turbo-diffusion

Optimized SDXL Turbo

Quick Start

import { fal } from "@fal-ai/client";

const connection = fal.realtime.connect("fal-ai/fast-lcm-diffusion", {
  onResult: (result) => {
    console.log(result);
  },
  onError: (error) => {
    console.error(error);
  },
});

connection.send({
  prompt: "a sunset over mountains",
  sync_mode: true,
  image_url: "data:image/png;base64,..."
});

Performance Tips

For the fastest inference:

Use 512x512 input dimensions (fastest)
Provide images as base64 encoded data URLs
Set sync_mode: true to receive base64 encoded responses
768x768 and 1024x1024 also work well, but 512x512 is optimal

Keeping API Keys Secure

WebSocket connections from browsers cannot safely embed API keys. There are two approaches for client-side authentication: a proxy URL or a token provider.

Proxy URL

The simplest approach. Point the client at a server-side proxy that adds your API key:

import { fal } from "@fal-ai/client";

fal.config({
  proxyUrl: "/api/fal/proxy",
});

const connection = fal.realtime.connect("fal-ai/fast-lcm-diffusion", {
  connectionKey: "realtime-demo",
  throttleInterval: 128,
  onResult(result) {
    // handle result
  },
});

Proxy Setup

Learn how to set up a server-side proxy

Token Provider

For more control, use a tokenProvider function that fetches short-lived JWT tokens from your backend. This is useful when you need per-user authentication or want to restrict which apps a token can access.

Protect your token endpoint with authentication. The endpoint that generates fal tokens should verify that the request comes from an authenticated user in your application. Without proper authentication, anyone could use your endpoint to generate tokens and consume your fal credits.

Client-side example:

import { fal, type TokenProvider } from "@fal-ai/client";

const myTokenProvider: TokenProvider = async (app) => {
  const response = await fetch(`/api/fal/token?app=${app}`);
  const { token } = await response.json();
  return token;
};

const connection = fal.realtime.connect("fal-ai/fast-lcm-diffusion", {
  tokenProvider: myTokenProvider,
  onResult: (result) => {
    console.log(result);
  },
});

connection.send({
  prompt: "a cat",
  sync_mode: true,
});

Next.js API Route example (app/api/fal/token/route.ts):

import { NextRequest, NextResponse } from "next/server";

export async function GET(request: NextRequest) {
  // IMPORTANT: Add your own authentication logic here
  // const session = await getServerSession();
  // if (!session) {
  //   return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
  // }

  const { searchParams } = new URL(request.url);
  const app = searchParams.get("app");

  if (!app) {
    return NextResponse.json({ error: "Missing app parameter" }, { status: 400 });
  }

  const response = await fetch("https://rest.alpha.fal.ai/tokens/realtime", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Key ${process.env.FAL_KEY}`,
    },
    body: JSON.stringify({
      allowed_apps: [app],
      duration: 120,
    }),
  });

  const data = await response.json();
  return NextResponse.json({ token: data.token });
}

The tokenProvider also works for streaming with connectionMode: "client":

const stream = await fal.stream("fal-ai/flux/dev", {
  connectionMode: "client",
  tokenProvider: myTokenProvider,
  input: { prompt: "a cat" },
});

Differences from Queue-Based Inference

Real-time WebSocket connections bypass the queue and connect directly to a runner. Several request parameters that work with queue-based inference do not apply:

Parameter	Behavior with Real-Time
`start_timeout`	No effect. There is no queue wait
`priority`	No effect. No queue ordering
`webhook_url`	Not supported. Results stream back over the WebSocket
Automatic retries	Not available. Failed messages return errors on the connection
`X-Fal-No-Retry`	No effect. No retry mechanism to disable

Realtime vs Streaming

Both realtime and streaming give you faster feedback than polling, but they serve different use cases.

Feature	Realtime (WebSocket)	Streaming (SSE)
Direction	Bidirectional (client and server)	One-way (server to client)
Connection	Persistent, reusable	New connection per request
Latency	Lower (connection reuse)	Higher (new connection each time)
Best for	Interactive apps, back-to-back requests	Progressive output, previews
Protocol	Binary msgpack	JSON over SSE

Use realtime when clients send multiple requests in quick succession over a persistent connection, like interactive image editing or camera-based inputs. Use streaming when you want to show progressive output from a single request, like image generation previews or LLM tokens.

Protocol Details

The realtime client uses msgpack for binary serialization across all SDKs, which is more efficient than JSON for transmitting image data. In Python, realtime() and realtime_async() provide a RealtimeConnection with send() and recv() methods. In JavaScript, fal.realtime.connect() uses callback-based onResult and onError handlers.

Video Tutorial

Build a Real-Time AI Image App with WebSockets, Next.js, and fal.ai:

Setting Up

Model APIs

Serverless

Compute

Organizations

Real-Time Inference

Supported Models

fast-lcm-diffusion

fast-turbo-diffusion

Quick Start

Performance Tips

Keeping API Keys Secure

Proxy URL

Proxy Setup

Token Provider

Differences from Queue-Based Inference

Realtime vs Streaming

Protocol Details

Video Tutorial

Setting Up

Model APIs

Serverless

Compute

Organizations

​Supported Models

fast-lcm-diffusion

fast-turbo-diffusion

​Quick Start

​Performance Tips

​Keeping API Keys Secure

​Proxy URL

Proxy Setup

​Token Provider

​Differences from Queue-Based Inference

​Realtime vs Streaming

​Protocol Details

​Video Tutorial

Supported Models

Quick Start

Performance Tips

Keeping API Keys Secure

Proxy URL

Token Provider

Differences from Queue-Based Inference

Realtime vs Streaming

Protocol Details

Video Tutorial