使用vLLM部署大模型

编程语言 2025-04-11 23:05:55 阅读次数: 0

vLLM 是一个快速且易于使用的库，专为大型语言模型 (LLM) 的推理和部署而设计。

vLLM安装

pip install vllm

vLLM部署大模型可以使用本地推理服务和API服务的方式。

1、本地推理服务的部署方式：

from vllm import LLM, SamplingParams

# Sample prompts.

prompts = [

    "Hello, my name is",

    "The president of the United States is",

    "The capital of France is",

    "The future of AI is",

]

# Create a sampling params object.

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)



# Create an LLM.

llm = LLM(model="facebook/opt-125m")

# Generate texts from the prompts. The output is a list of RequestOutput objects

# that contain the prompt, generated text, and other information.

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.

for output in outputs:

    prompt = output.prompt

    generated_text = output.outputs[0].text

print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

2、API服务

OpenAI-Compatible API Server

（1）使用vllm serve命令

vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123

（2）直接使用API server entrypoint

python -m vllm.entrypoints.openai.api_server --model <model>

API调用

（1）使用curl

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
}'

（2）使用python

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

3、显存优化参数

--gpu-memory-utilization 显存使用百分比，默认为0.9（使用90%显存）（vLLM 使用 gpu_memory_utilization的内存来预分配 GPU 缓存。通过提高此利用率，可以提供更多的 KV 缓存空间）

--max-num-seqs 表示一次推理最大能处理的序列（sequence）数量，默认值为256，值越大，能处理的请求数量也大，但是会增加显存使用

--max-model-len 表示模型上下文长度，如果不声明，会从模型配置中获取，越小占用显存越小

--max-num-batched-tokens 每次迭代批处理的最大数量

4、多GPU部署

将模型部署到多个GPU上，设置tensor parallel size，tensor parallel size表示在每个节点上使用gpu的数量。使用pipeline parallel size可以设置节点的数量。

from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

或者

vllm serve facebook/opt-13b \
     --tensor-parallel-size 4

在多个节点上运行vllm，需要保证每个节点的执行环境都一样，包括模型路径、python环境等。可以使用docker镜像来保证相同的环境。

可以使用如下docker脚本，run_cluster.sh

#!/bin/bash

# Check for minimum number of required arguments
if [ $# -lt 4 ]; then
    echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
    exit 1
fi

# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3"  # Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4

# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")

# Validate node type
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
    echo "Error: Node type must be --head or --worker"
    exit 1
fi

# Define a function to cleanup on EXIT signal
cleanup() {
    docker stop node
    docker rm node
}
trap cleanup EXIT

# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if [ "${NODE_TYPE}" == "--head" ]; then
    RAY_START_CMD+=" --head --port=6379"
else
    RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi

# Run the docker command with the user specified parameters and additional arguments
docker run \
    --entrypoint /bin/bash \
    --network host \
    --name node \
    --shm-size 10.24g \
    --gpus all \
    -v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
    "${ADDITIONAL_ARGS[@]}" \
"${DOCKER_IMAGE}" -c "${RAY_START_CMD}"

选择一个节点作为主节点，运行如下脚本

bash run_cluster.sh \
                vllm/vllm-openai \
                ip_of_head_node \
                --head \
                /path/to/the/huggingface/home/in/this/node \
                -e VLLM_HOST_IP=ip_of_this_node

其余节点运行如下脚本

bash run_cluster.sh \
                vllm/vllm-openai \
                ip_of_head_node \
                --worker \
                /path/to/the/huggingface/home/in/this/node \
                -e VLLM_HOST_IP=ip_of_this_node

使用docker exec -it node /bin/bash 命令进入容器，在容器中，可以将所有gpu当作一个节点使用

vllm serve /path/to/the/model/in/the/container \
     --tensor-parallel-size 8 \
     --pipeline-parallel-size 2

或者

vllm serve /path/to/the/model/in/the/container \
     --tensor-parallel-size 16

5、GGUF模型部署

vllm目前只支持单GGUF文件部署，如果是多GGUF，可以使用gguf-split将多个GGUF文件合并起来

vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0

或者

from vllm import LLM, SamplingParams

# In this script, we demonstrate how to pass input to the chat method:
conversation = [
   {
      "role": "system",
      "content": "You are a helpful assistant"
   },
   {
      "role": "user",
      "content": "Hello"
   },
   {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
   },
   {
      "role": "user",
      "content": "Write an essay about the importance of higher education.",
   },
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
         tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)

# Print the outputs.
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用最初下载模型中的 tokenizer