In this example, we will create a chatbot app using fal-serverless, Streamlit, and FlexGen. The chatbot will respond to user input by generating text using a pre-trained large language model that is handled by the FlexGen library. This example is adapted from the FlexGen chatbot repository.

Step 1: Install dependencies and authenticate

We only need three packages for this app: fal-serverless, streamlit and streamlit-chat.

pip install fal-serverless streamlit streamlit-chat
fal-serverless auth login

More info on authentication.

Step 2: Define isolated function

Next, in, we define the isolated function that will be used to generate responses. Here's the isolated function definition:

from fal_serverless import isolated

def run(context):
import argparse
import os
from transformers import AutoTokenizer
from flexgen.flex_opt import (Policy, OptLM, TorchDevice, TorchDisk, TorchMixedDevice,
CompressionConfig, Env, get_opt_config, add_parser_arguments)

os.environ['TRANSFORMERS_CACHE'] = '/data/hfcache'
def run_chat(args, context):
# Initialize environment
gpu = TorchDevice("cuda:0")
cpu = TorchDevice("cpu")
disk = TorchDisk(args.offload_dir)
env = Env(gpu=gpu, cpu=cpu, disk=disk, mixed=TorchMixedDevice([gpu, cpu, disk]))

# Offloading policy
policy = Policy(1, 1,
args.percent[0], args.percent[1],
args.percent[2], args.percent[3],
args.percent[4], args.percent[5],
overlap=True, sep_layer=True, pin_weight=args.pin_weight,
cpu_cache_compute=False, attn_sparsity=1.0,
num_bits=4, group_size=64,
group_dim=0, symmetric=False),
num_bits=4, group_size=64,
group_dim=2, symmetric=False))

# Model
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", padding_side="left")
tokenizer.add_bos_token = False
stop = tokenizer("\n").input_ids[0]

opt_config = get_opt_config(args.model)
model = OptLM(opt_config, env, args.path, policy)

# Chat
inputs = tokenizer([context])
output_ids = model.generate(
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
index = outputs.index("\n", len(context))
except ValueError:
outputs += "\n"
index = outputs.index("\n", len(context))

outputs = outputs[:index + 1]
return outputs

parser = argparse.ArgumentParser()
args = parser.parse_args([
"--model", "facebook/opt-6.7b",
"--path", "/data/flexgen/weights",
"--offload-dir", "/data/flexgen/offload",
"--percent", "100", "0", "100", "0", "100", "0",
"--pin-weight", "true",

return run_chat(args, context=context)

This function defines another function run_chat, that initializes the environment by setting up the GPU, CPU, and disk devices required for the computation. run_chat then loads a pre-trained language model using the OptLM class from the FlexGen library. The model is initialized with an offloading policy and the devices defined in the environment. The function generates a response to the input text by passing it through the model using the generate method. The run function prepares the argument parameters for run_chat.

Step 3: Set up the Stremlit App

In the same file we will create a Streamlit app that will allow users to interact with our chatbot. We will use the st.session to hold the session history for us. We will also use streamlit_chat to display messages.

Here's the code for the Streamlit app:

import streamlit as st
from streamlit_chat import message

if "context" not in st.session_state:
st.session_state['context'] = (
"A chat between a curious human and a knowledgeable artificial intelligence assistant.\n"
"Human: Hello! What can you do?\n"
"Assistant: As an AI assistant, I can answer questions and chat with you.\n"
"Human: What is the name of the tallest mountain in the world?\n"
"Assistant: Everest.\n"

if "output" not in st.session_state:
st.session_state['output'] = ""

if "chat_history" not in st.session_state:
st.session_state['chat_history'] = []

st.title("fal-serverless bot")

def get_text():
input_text = st.text_input("You: ","", key="input")
return input_text

user_input = get_text()

if user_input:
st.session_state["chat_history"].append((user_input, True))
st.session_state['context'] += "Human: " + user_input + "\n"
st.session_state['output'] = run(st.session_state['context'])
response = st.session_state['output'][len(st.session_state['context']):]
st.session_state["chat_history"].append((response[10:], False))
st.session_state['context'] = st.session_state['output']

if st.session_state["chat_history"]:
for i in reversed(st.session_state["chat_history"]):
message(i[0], is_user=i[1], key=f"message-{i}")

We initialize session state variables to store the chat history and context. The get_text function uses st.text_input to allow users to enter text.

The main app logic generates a response to the user's input using the isolated run function, and appends the response and user input to the chat history. The chat history is then displayed using message function from streamlit_chat.

Step 4: Running the app

In your terminal, you run:

streamlit run

This will launch the Streamlit app in your default web browser.


In this example, we demonstrated how to use the FlexGen library and fal-serverless to create a chatbot app that responds to user input by generating text using a pre-trained language model. With these tools, you can create your own chatbots!