Running Llama 2 with vLLM

vLLM is a powerful tool for running inference with modern text generation / completion models from various architectures. In this example we’ll use it with fal’s serverless runtime to unlock high throughput inference capabilities of vLLM by pairing it with a fast AI accelerator card (like an A100).

Starting the party!

Let’s start the party as usual by importing fal. But before we can precede to setting a static MODEL_NAME, we first have to decide which model to use.

import fal

MODEL_NAME = "..."

Selecting a model

vLLM supports a wide range of models including Llama, Llama 2, GPT-J, OPT, and more (full list of supported models is here).

Using Llama 2

We are going to use Llama 2 as our primary LLM candidate but since its a private model, we first have to request access to it through Hugging Face hub. If you don’t want to deal with this process and use an open source model instead, feel free to skip to the next section.

Go to Llama 2’s landing page by Meta and fill out the form to request access (using the same e-mail address as your HF account).
Then login to your HF account, and ask for access for the model you’d like to use under meta-llama org, we’ll be using meta-llama/Llama-2-13b-hf so you can use this link to request access.
Generate an HF access token and set the HUGGING_FACE_HUB_TOKEN secret in fal
```
$ fal secrets set HUGGING_FACE_HUB_TOKEN="my token"
```
Proceed to change the model name to the model you’ve requested access to.
```
MODEL_NAME="meta-llama/Llama-2-13b-hf"
```

Optional: Using an Open Source model

If you already have a public model in mind, you can use it here (as long as its on the Hugging Face Hub); otherwise, as an example feel free to pass Open Llama 13B to continue with this tutorial.

MODEL_NAME = "openlm-research/open_llama_13b"

Preparing the vLLM engine

As with most of our examples, the model is going to be initialized in a cached function to prevent reloads every time our function is invoked. All we have to do is pass the name of the model from HF hub and let vLLM take care of the rest for the initialization sequence.

@fal.cached
def prepare_engine(model_name: str = MODEL_NAME):
    """Prepare the vLLM text generation engine. You can change the model to any of the
    supported ones as long as you have access to that model on HF Hub (or they are public)."""

    from vllm import LLM

    engine = LLM(model=model_name)
    return engine

Defining I/O

For this example, we can accept only two parameters (the prompt itself and maximum number of tokens) but this part is mainly up to your imagination and you can customize it however you’d like.

import fal
from pydantic import BaseModel

[...]

class ChatOptions(BaseModel):
    prompt: str
    max_tokens: int = 100

class ChatResponse(BaseModel):
    response: str

Putting the band together

Now that we have the model, engine and the I/O, we should be able to put this together and have it ready to go! vLLM (and torch 2.0) will be our only dependencies, and as always this is a ready to be turned into an API endpoint with the serve annotation being enabled.

In terms of the parameters and customizability, vLLM offers really interesting options so depending on your needs (and your inputs) don’t forget to check them out. As for this example, we only need to pass max_tokens to avoid going overboard a certain limit.

@fal.function(
    "virtualenv",
    requirements=[
        "vllm",
        "torch==2.0.1",
    ],
    machine_type="GPU",
    keep_alive=60,
    serve=True,
)
def basic_completion_service(options: ChatOptions) -> ChatResponse:
    from vllm import SamplingParams

    engine = prepare_engine()
    sampling_params = SamplingParams(max_tokens=options.max_tokens)

    result = engine.generate(options.prompt, sampling_params)
    completion = result[0].outputs[0]
    return ChatResponse(response=completion.text)

Let’s deploy this LLM and play with it through our HTTP client of choice!

$ fal deploy app.py::basic_completion_service --app-name ai-friend
Registered a new revision for function 'stable-diffusion' (revision='[...]').
URL: https://fal.run/$USER/ai-friend

And start talking to the imaginary friend of ours:

$ curl $APP_URL \
  -H 'Authorization: Key $FAL_KEY' \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, */*;q=0.5' \
  -d '{"prompt":"once in a time, there was a cat named "}'

# It should print something like this:
{
    "result": "..."
}