vLLM 是一个快速且易于使用的库,专为大型语言模型 (LLM) 的推理和部署而设计。
vLLM安装
pip install vllm
vLLM部署大模型可以使用本地推理服务和API服务的方式。
1、本地推理服务的部署方式:
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
2、API服务
OpenAI-Compatible API Server
(1)使用vllm serve命令
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
(2)直接使用API server entrypoint
python -m vllm.entrypoints.openai.api_server --model <model>
API调用
(1)使用curl
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
(2)使用python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
3、显存优化参数
--gpu-memory-utilization 显存使用百分比,默认为0.9(使用90%显存)(vLLM 使用 gpu_memory_utilization的内存来预分配 GPU 缓存。通过提高此利用率,可以提供更多的 KV 缓存空间)
--max-num-seqs 表示一次推理最大能处理的序列(sequence)数量,默认值为256,值越大,能处理的请求数量也大,但是会增加显存使用
--max-model-len 表示模型上下文长度,如果不声明,会从模型配置中获取,越小占用显存越小
--max-num-batched-tokens 每次迭代批处理的最大数量
4、多GPU部署
将模型部署到多个GPU上,设置tensor parallel size,tensor parallel size表示在每个节点上使用gpu的数量。使用pipeline parallel size可以设置节点的数量。
from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")
或者
vllm serve facebook/opt-13b \
--tensor-parallel-size 4
在多个节点上运行vllm,需要保证每个节点的执行环境都一样,包括模型路径、python环境等。可以使用docker镜像来保证相同的环境。
可以使用如下docker脚本,run_cluster.sh
#!/bin/bash
# Check for minimum number of required arguments
if [ $# -lt 4 ]; then
echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
exit 1
fi
# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3" # Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4
# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")
# Validate node type
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
echo "Error: Node type must be --head or --worker"
exit 1
fi
# Define a function to cleanup on EXIT signal
cleanup() {
docker stop node
docker rm node
}
trap cleanup EXIT
# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if [ "${NODE_TYPE}" == "--head" ]; then
RAY_START_CMD+=" --head --port=6379"
else
RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi
# Run the docker command with the user specified parameters and additional arguments
docker run \
--entrypoint /bin/bash \
--network host \
--name node \
--shm-size 10.24g \
--gpus all \
-v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
"${ADDITIONAL_ARGS[@]}" \
"${DOCKER_IMAGE}" -c "${RAY_START_CMD}"
选择一个节点作为主节点,运行如下脚本
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
--head \
/path/to/the/huggingface/home/in/this/node \
-e VLLM_HOST_IP=ip_of_this_node
其余节点运行如下脚本
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
--worker \
/path/to/the/huggingface/home/in/this/node \
-e VLLM_HOST_IP=ip_of_this_node
使用docker exec -it node /bin/bash 命令进入容器,在容器中,可以将所有gpu当作一个节点使用
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2
或者
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 16
5、GGUF模型部署
vllm目前只支持单GGUF文件部署,如果是多GGUF,可以使用gguf-split将多个GGUF文件合并起来
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
或者
from vllm import LLM, SamplingParams
# In this script, we demonstrate how to pass input to the chat method:
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
使用最初下载模型中的 tokenizer