Running Llama 2 with vLLM
vLLM is a powerful tool for running inference with modern text generation / completion models from various architectures. In this example we’ll use it with fal’s serverless runtime to unlock high throughput inference capabilities of vLLM by pairing it with a fast AI accelerator card (like an A100).
Starting the party!
Let’s start the party as usual by importing fal. But before we can precede to setting a static MODEL_NAME
, we first have to decide which model to use.
import fal
MODEL_NAME = "..."
Selecting a model
vLLM supports a wide range of models including Llama, Llama 2, GPT-J, OPT, and more (full list of supported models is here).
Using Llama 2
We are going to use Llama 2 as our primary LLM candidate but since its a private model, we first have to request access to it through Hugging Face hub. If you don’t want to deal with this process and use an open source model instead, feel free to skip to the next section.
-
Go to Llama 2’s landing page by Meta and fill out the form to request access (using the same e-mail address as your HF account).
-
Then login to your HF account, and ask for access for the model you’d like to use under
meta-llama
org, we’ll be usingmeta-llama/Llama-2-13b-hf
so you can use this link to request access. -
Generate an HF access token and set the
HUGGING_FACE_HUB_TOKEN
secret in fal$ fal secrets set HUGGING_FACE_HUB_TOKEN="my token" -
Proceed to change the model name to the model you’ve requested access to.
MODEL_NAME="meta-llama/Llama-2-13b-hf"
Optional: Using an Open Source model
If you already have a public model in mind, you can use it here (as long as its on the Hugging Face Hub); otherwise, as an example feel free to pass Open Llama 13B to continue with this tutorial.
MODEL_NAME = "openlm-research/open_llama_13b"
Preparing the vLLM engine
As with most of our examples, the model is going to be initialized in a cached function to prevent reloads every time our function is invoked. All we have to do is pass the name of the model from HF hub and let vLLM take care of the rest for the initialization sequence.
@fal.cacheddef prepare_engine(model_name: str = MODEL_NAME): """Prepare the vLLM text generation engine. You can change the model to any of the supported ones as long as you have access to that model on HF Hub (or they are public)."""
from vllm import LLM
engine = LLM(model=model_name) return engine
Defining I/O
For this example, we can accept only two parameters (the prompt itself and maximum number of tokens) but this part is mainly up to your imagination and you can customize it however you’d like.
import falfrom pydantic import BaseModel
[...]
class ChatOptions(BaseModel): prompt: str max_tokens: int = 100
class ChatResponse(BaseModel): response: str
Putting the band together
Now that we have the model, engine and the I/O, we should be able to put this together and have it ready to go! vLLM (and torch 2.0) will be our only dependencies, and as always this is a ready to be turned into an API endpoint with the serve
annotation being enabled.
In terms of the parameters and customizability, vLLM offers really interesting options so depending on your needs (and your inputs) don’t forget to check them out. As for this example, we only need to pass max_tokens
to avoid going overboard a certain limit.
@fal.function( "virtualenv", requirements=[ "vllm", "torch==2.0.1", ], machine_type="GPU", keep_alive=60, serve=True,)def basic_completion_service(options: ChatOptions) -> ChatResponse: from vllm import SamplingParams
engine = prepare_engine() sampling_params = SamplingParams(max_tokens=options.max_tokens)
result = engine.generate(options.prompt, sampling_params) completion = result[0].outputs[0] return ChatResponse(response=completion.text)
Sharing your AI friend with others
Let’s deploy this LLM and play with it through our HTTP client of choice!
$ fal deploy app.py::basic_completion_service --app-name ai-friendRegistered a new revision for function 'stable-diffusion' (revision='[...]').URL: https://fal.run/$USER/ai-friend
And start talking to the imaginary friend of ours:
$ curl $APP_URL \ -H 'Authorization: Key $FAL_KEY' \ -H 'Content-Type: application/json' \ -H 'Accept: application/json, */*;q=0.5' \ -d '{"prompt":"once in a time, there was a cat named "}'
# It should print something like this:{ "result": "..."}