Running Llama 2 with vLLM
vLLM is a powerful tool for running inference with modern text generation / completion models from various architectures. In this example we’ll use it with fal’s serverless runtime to unlock high throughput inference capabilities of vLLM by pairing it with a fast AI accelerator card (like an A100).
Starting the party!
Let’s start the party as usual by importing fal. But before we can precede to setting a static MODEL_NAME
, we first have to decide which model to use.
Selecting a model
vLLM supports a wide range of models including Llama, Llama 2, GPT-J, OPT, and more (full list of supported models is here).
Using Llama 2
We are going to use Llama 2 as our primary LLM candidate but since its a private model, we first have to request access to it through Hugging Face hub. If you don’t want to deal with this process and use an open source model instead, feel free to skip to the next section.
-
Go to Llama 2’s landing page by Meta and fill out the form to request access (using the same e-mail address as your HF account).
-
Then login to your HF account, and ask for access for the model you’d like to use under
meta-llama
org, we’ll be usingmeta-llama/Llama-2-13b-hf
so you can use this link to request access. -
Generate an HF access token and set the
HUGGING_FACE_HUB_TOKEN
secret in fal -
Proceed to change the model name to the model you’ve requested access to.
Optional: Using an Open Source model
If you already have a public model in mind, you can use it here (as long as its on the Hugging Face Hub); otherwise, as an example feel free to pass Open Llama 13B to continue with this tutorial.
Preparing the vLLM engine
As with most of our examples, the model is going to be initialized in a cached function to prevent reloads every time our function is invoked. All we have to do is pass the name of the model from HF hub and let vLLM take care of the rest for the initialization sequence.
Defining I/O
For this example, we can accept only two parameters (the prompt itself and maximum number of tokens) but this part is mainly up to your imagination and you can customize it however you’d like.
Putting the band together
Now that we have the model, engine and the I/O, we should be able to put this together and have it ready to go! vLLM (and torch 2.0) will be our only dependencies, and as always this is a ready to be turned into an API endpoint with the serve
annotation being enabled.
In terms of the parameters and customizability, vLLM offers really interesting options so depending on your needs (and your inputs) don’t forget to check them out. As for this example, we only need to pass max_tokens
to avoid going overboard a certain limit.
Sharing your AI friend with others
Let’s deploy this LLM and play with it through our HTTP client of choice!
And start talking to the imaginary friend of ours: