Deploying the chatglm2-6b model using Triton | JD Cloud Technical Team

1. Technical introduction

NVIDIA Triton Inference Server is a cloud and inference solution optimized for CPUs and GPUs.

Supported model types include TensorRT, TensorFlow, PyTorch (meta-llama/Llama-2-7b), Python (chatglm), ONNX Runtime and OpenVino.

NVIDIA Triton Server is a high-performance inference server with the following features:

1. High performance: Triton Server provides high performance and low latency for workloads that use GPUs for inference. It is able to serve multiple models simultaneously with high throughput and low latency.

2. Memory management: Large models often require large amounts of video memory for inference. Triton Server has a flexible memory management mechanism that can effectively manage and allocate video memory to ensure that inference of large models can be performed efficiently.

3. Scalability: Triton Server supports highly concurrent inference requests through parallel processing and asynchronous inference. It can automatically expand and contract according to the needs of the load.

4. Multi-model support: Triton Server can deploy and manage multiple models at the same time. This allows you to share server resources and deploy and manage different models in a consistent manner.

5. Flexibility: Triton Server supports multiple model formats and inference frameworks, including TensorFlow, PyTorch, ONNX, etc. You can use your favorite models and tools for model development and training, and easily deploy them to Triton Server.

6. Advanced features: Triton Server provides many advanced features, such as model version management, request concurrency control, dynamic batch size optimization, request time tracking, etc. These features enhance model deployment and management capabilities.

2. Practice

Serve a Model in 3 (N) Easy Steps Official Documentation

https://github.com/triton-inference-server/server

Serve a Model in n Easy Steps

Step 1: Pull triton-server code

git clone -b r23.08 https://github.com/triton-inference-server/server.git #

Step 2: Use tritonserver:22.12-py3 image to build triton-server container

docker run --gpus all --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --ulimit stack=67108864 -ti nvcr.io/nvidia/tritonserver:22.12-py3

Please pay attention to -p port mapping, it will be troublesome to change later.

The tritonserver version and the python_backend backend version must correspond.

For example, use 22.12

Step 3: Download the python inference backend python_backend

Documentation : https://github.com/triton-inference-server/python_backend

Download python backend code:

git clone https://github.com/triton-inference-server/python_backend -b r22.12

Operation in the container: If you exit the container halfway, use the command docker exec -it container name/bin/bash to enter the container

If the download cannot be downloaded, you can copy it to the container: docker cp python_backend busy_galileo:/opt

Step 4: Create model directory

cd python_backend

1) Create model directory: mkdir -p models/chatglm2-6b/1/

2) The host copies chatglm2 to the model directory in the container: docker cp chatglm2-6b container name:/path in container/models/chatglm2-6b

3) Create a model configuration file: vi models/chatglm2-6b/ config.pbtxt containing various parameters, input, output parameters, model paths, etc.

name: "chatglm2-6b"
backend: "python"
max_batch_size: 1

input [
  {
    name: "QUERY"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "max_new_tokens"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "length_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "bos_token_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "eos_token_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    optional: true
  },
  {
    name: "do_sample"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "num_beams"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    optional: true
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_STRING
    dims: [ -1, -1 ]
  }
]

instance_group [
  {
    kind: KIND_GPU
  }
]

parameters {
  key: "model_path"
  value: {
    string_value: "/opt/tritonserver/python_backend/models/chatglm2-6b"
  }
}

Create model.py to customize the model inference logic implemented by Python code vi models/chatglm2-6b/1/model.py

The input, output and parameters of the model can be processed using python scripts here

    import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    @staticmethod
    def auto_complete_config(auto_complete_model_config):
        """`auto_complete_config` is called only once when loading the model

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to initialize any state associated with this model.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device
            ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """
        print('Initialized...')

    def execute(self, requests):
        """`execute` must be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference is requested
        for this model.

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """

        responses = []

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is optional. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

Step 5: Install the inference environment and various software

The cuda version and graphics card driver must correspond, and the cuda toolkit and driver version must correspond.

For the corresponding relationship, see the official website: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions

1) Introduction and installation of torch:

The torch scientific computing framework is designed to provide efficient matrix operations and automatic differentiation functions for machine learning and other scientific computing tasks.

Provides a rich library of pre-trained models and algorithms, allowing users to quickly build and train various machine learning tasks.

pip install ./torch-1.12.1+cu116-cp38-cp38-linux_x86_64.whl

2) Graphics card driver:

sh ./NVIDIA-Linux-x86_64-460.106.00.run

3) cudnn introduction and installation:

CUDA Deep Neural Network library is a GPU-accelerated deep neural network (DNN) library provided by NVIDIA. It is designed to optimize and accelerate neural network model training and inference in deep learning tasks.

cuDNN provides a set of core algorithms and functions for common deep learning tasks such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). These algorithms and functions are highly optimized for GPU architecture to provide the best performance and efficiency.

wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/libcudnn8_8.1.1.33-1+cuda11.2_amd64.deb

dpkg -i libcudnn8_8.1.1.33-1+cuda11.2_amd64.deb

4) cuda:

The Compute Unified Device Architecture library is a parallel computing platform and API developed by NVIDIA for GPU programming.

Through the CUDA library, model inference can be performed synchronously or asynchronously on the GPU, while supporting batch processing and multi-card parallel computing to improve the speed and efficiency of model inference.

wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run

sudo sh cuda_11.2.0_460.27.04_linux.run

5) Various software

nohup apt-get update

nohup apt-get install -y autoconf autogen clangd gdb git-lfs libb64-dev libz-dev locales-all mosh openssh-server python3-dev rapidjson-dev sudo tmux unzip zstd zip zsh

Step 6: Start triton-server

CUDA_VISIBLE_DEVICES=0 setsid tritonserver --model-repository=/opt/tritonserver/python_backend/models --backend-config=python,shm-region-prefix-name=prefix1_ --http-port 8000 --grpc-port 8001 --metrics-port 8002 --log-verbose 1 --log-file /opt/tritonserver/logs/triton_server_gpu0.log





Started successfully http port 8000 grpc port 8001 measurement port 8002

3. Test

Simply call python code to call the http interface

import requests
# 定义模型的输入数据
data = {
    "inputs": [
        {

            "name": "QUERY",
            "shape": [1,1],
            "datatype": "BYTES",
            "data": ["川普是不是四川人"]
        },
        {

            "name": "max_new_tokens",
            "shape" : [1,1],
            "datatype": "UINT32",
            "data": [15000]
        },
    ]
}
headers = {
    'Content-Type': 'application/json',
}
# 发送 POST 请求
response = requests.post('http://localhost:8000/v2/models/chatglm2-6b/infer', headers=headers, json=data)
result = response.json()
print(result)

response:

{
	"model_name": "chatglm2-6b",
	"model_version": "1",
	"outputs": [
		{
			"data": [
				"\n\n 川普不是四川人,他出生于美国宾夕法尼亚州,是一个美国政治家、企业家和电视名人。"
			],
			"datatype": "BYTES",
			"name": "OUTPUT",
			"shape": []
		}
	]
}

4. Technical direction

CI (Continuous Integration, continuous integration)/CD (Continuous Delivery, continuous delivery/Continuous Deployment, continuous deployment)

Achievable in the future:

1. Use k8s to automatically operate container deployment-similar to Xingyun

2. Save a complete docker image of a large model running environment, and just download the model file to the corresponding directory to start the service.

3. Deploy multiple open source models on a single machine, provide response interfaces for different models, and compare response effects

4. Create dockerFile to automatically build basic containers

k8s documentation

https://kubernetes.io/zh-cn/docs/tasks/tools/

Install Docker and kubeadm, kubenet on all nodes

Deploy Kubernetes Master

Deploy the container network plug-in kubectl

Deploy Kubernetes Node and add the node to the Kubernetes cluster

Author: JD Technology Yang Jian

Source: JD Cloud Developer Community Please indicate the source when reprinting

Fined 200 yuan and more than 1 million yuan confiscated You Yuxi: The importance of high-quality Chinese documents Musk's hard-core migration server Solon for JDK 21, virtual threads are incredible! ! ! TCP congestion control saves the Internet Flutter for OpenHarmony is here The Linux kernel LTS period will be restored from 6 years to 2 years Go 1.22 will fix the for loop variable error Svelte built a "new wheel" - runes Google celebrates its 25th anniversary
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10114529