1. Technical introduction
NVIDIA Triton Inference Server is a cloud and inference solution optimized for CPUs and GPUs.
Supported model types include TensorRT, TensorFlow, PyTorch (meta-llama/Llama-2-7b), Python (chatglm), ONNX Runtime and OpenVino.
NVIDIA Triton Server is a high-performance inference server with the following features:
1. High performance: Triton Server provides high performance and low latency for workloads that use GPUs for inference. It is able to serve multiple models simultaneously with high throughput and low latency.
2. Memory management: Large models often require large amounts of video memory for inference. Triton Server has a flexible memory management mechanism that can effectively manage and allocate video memory to ensure that inference of large models can be performed efficiently.
3. Scalability: Triton Server supports highly concurrent inference requests through parallel processing and asynchronous inference. It can automatically expand and contract according to the needs of the load.
4. Multi-model support: Triton Server can deploy and manage multiple models at the same time. This allows you to share server resources and deploy and manage different models in a consistent manner.
5. Flexibility: Triton Server supports multiple model formats and inference frameworks, including TensorFlow, PyTorch, ONNX, etc. You can use your favorite models and tools for model development and training, and easily deploy them to Triton Server.
6. Advanced features: Triton Server provides many advanced features, such as model version management, request concurrency control, dynamic batch size optimization, request time tracking, etc. These features enhance model deployment and management capabilities.
2. Practice
Serve a Model in 3 (N) Easy Steps Official Documentation
https://github.com/triton-inference-server/server
Serve a Model in n Easy Steps
Step 1: Pull triton-server code
git clone -b r23.08 https://github.com/triton-inference-server/server.git #
Step 2: Use tritonserver:22.12-py3 image to build triton-server container
docker run --gpus all --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --ulimit stack=67108864 -ti nvcr.io/nvidia/tritonserver:22.12-py3
Please pay attention to -p port mapping, it will be troublesome to change later.
The tritonserver version and the python_backend backend version must correspond.
For example, use 22.12
Step 3: Download the python inference backend python_backend
Documentation : https://github.com/triton-inference-server/python_backend
Download python backend code:
git clone https://github.com/triton-inference-server/python_backend -b r22.12
Operation in the container: If you exit the container halfway, use the command docker exec -it container name/bin/bash to enter the container
If the download cannot be downloaded, you can copy it to the container: docker cp python_backend busy_galileo:/opt
Step 4: Create model directory
cd python_backend
1) Create model directory: mkdir -p models/chatglm2-6b/1/
2) The host copies chatglm2 to the model directory in the container: docker cp chatglm2-6b container name:/path in container/models/chatglm2-6b
3) Create a model configuration file: vi models/chatglm2-6b/ config.pbtxt containing various parameters, input, output parameters, model paths, etc.
name: "chatglm2-6b"
backend: "python"
max_batch_size: 1
input [
{
name: "QUERY"
data_type: TYPE_STRING
dims: [ -1 ]
},
{
name: "max_new_tokens"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "top_k"
data_type: TYPE_UINT32
dims: [ 1 ]
optional: true
},
{
name: "top_p"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "length_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
optional: true
},
{
name: "bos_token_id"
data_type: TYPE_UINT32
dims: [ 1 ]
optional: true
},
{
name: "eos_token_id"
data_type: TYPE_UINT32
dims: [ 1 ]
optional: true
},
{
name: "do_sample"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
},
{
name: "num_beams"
data_type: TYPE_UINT32
dims: [ 1 ]
optional: true
}
]
output [
{
name: "OUTPUT"
data_type: TYPE_STRING
dims: [ -1, -1 ]
}
]
instance_group [
{
kind: KIND_GPU
}
]
parameters {
key: "model_path"
value: {
string_value: "/opt/tritonserver/python_backend/models/chatglm2-6b"
}
}
Create model.py to customize the model inference logic implemented by Python code vi models/chatglm2-6b/1/model.py
The input, output and parameters of the model can be processed using python scripts here
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
@staticmethod
def auto_complete_config(auto_complete_model_config):
"""`auto_complete_config` is called only once when loading the model
def initialize(self, args):
"""`initialize` is called only once when the model is being loaded.
Implementing `initialize` function is optional. This function allows
the model to initialize any state associated with this model.
Parameters
----------
args : dict
Both keys and values are strings. The dictionary keys and values are:
* model_config: A JSON string containing the model configuration
* model_instance_kind: A string containing model instance kind
* model_instance_device_id: A string containing model instance device
ID
* model_repository: Model repository path
* model_version: Model version
* model_name: Model name
"""
print('Initialized...')
def execute(self, requests):
"""`execute` must be implemented in every Python model. `execute`
function receives a list of pb_utils.InferenceRequest as the only
argument. This function is called when an inference is requested
for this model.
Parameters
----------
requests : list
A list of pb_utils.InferenceRequest
Returns
-------
list
A list of pb_utils.InferenceResponse. The length of this list must
be the same as `requests`
"""
responses = []
def finalize(self):
"""`finalize` is called only once when the model is being unloaded.
Implementing `finalize` function is optional. This function allows
the model to perform any necessary clean ups before exit.
"""
print('Cleaning up...')
Step 5: Install the inference environment and various software
The cuda version and graphics card driver must correspond, and the cuda toolkit and driver version must correspond.
For the corresponding relationship, see the official website: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions
1) Introduction and installation of torch:
The torch scientific computing framework is designed to provide efficient matrix operations and automatic differentiation functions for machine learning and other scientific computing tasks.
Provides a rich library of pre-trained models and algorithms, allowing users to quickly build and train various machine learning tasks.
pip install ./torch-1.12.1+cu116-cp38-cp38-linux_x86_64.whl
2) Graphics card driver:
sh ./NVIDIA-Linux-x86_64-460.106.00.run
3) cudnn introduction and installation:
CUDA Deep Neural Network library is a GPU-accelerated deep neural network (DNN) library provided by NVIDIA. It is designed to optimize and accelerate neural network model training and inference in deep learning tasks.
cuDNN provides a set of core algorithms and functions for common deep learning tasks such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). These algorithms and functions are highly optimized for GPU architecture to provide the best performance and efficiency.
wget https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/libcudnn8_8.1.1.33-1+cuda11.2_amd64.deb
dpkg -i libcudnn8_8.1.1.33-1+cuda11.2_amd64.deb
4) cuda:
The Compute Unified Device Architecture library is a parallel computing platform and API developed by NVIDIA for GPU programming.
Through the CUDA library, model inference can be performed synchronously or asynchronously on the GPU, while supporting batch processing and multi-card parallel computing to improve the speed and efficiency of model inference.
wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run
sudo sh cuda_11.2.0_460.27.04_linux.run
5) Various software
nohup apt-get update
nohup apt-get install -y autoconf autogen clangd gdb git-lfs libb64-dev libz-dev locales-all mosh openssh-server python3-dev rapidjson-dev sudo tmux unzip zstd zip zsh
Step 6: Start triton-server
CUDA_VISIBLE_DEVICES=0 setsid tritonserver --model-repository=/opt/tritonserver/python_backend/models --backend-config=python,shm-region-prefix-name=prefix1_ --http-port 8000 --grpc-port 8001 --metrics-port 8002 --log-verbose 1 --log-file /opt/tritonserver/logs/triton_server_gpu0.log
Started successfully http port 8000 grpc port 8001 measurement port 8002
3. Test
Simply call python code to call the http interface
import requests
# 定义模型的输入数据
data = {
"inputs": [
{
"name": "QUERY",
"shape": [1,1],
"datatype": "BYTES",
"data": ["川普是不是四川人"]
},
{
"name": "max_new_tokens",
"shape" : [1,1],
"datatype": "UINT32",
"data": [15000]
},
]
}
headers = {
'Content-Type': 'application/json',
}
# 发送 POST 请求
response = requests.post('http://localhost:8000/v2/models/chatglm2-6b/infer', headers=headers, json=data)
result = response.json()
print(result)
response:
{
"model_name": "chatglm2-6b",
"model_version": "1",
"outputs": [
{
"data": [
"\n\n 川普不是四川人,他出生于美国宾夕法尼亚州,是一个美国政治家、企业家和电视名人。"
],
"datatype": "BYTES",
"name": "OUTPUT",
"shape": []
}
]
}
4. Technical direction
CI (Continuous Integration, continuous integration)/CD (Continuous Delivery, continuous delivery/Continuous Deployment, continuous deployment)
Achievable in the future:
1. Use k8s to automatically operate container deployment-similar to Xingyun
2. Save a complete docker image of a large model running environment, and just download the model file to the corresponding directory to start the service.
3. Deploy multiple open source models on a single machine, provide response interfaces for different models, and compare response effects
4. Create dockerFile to automatically build basic containers
k8s documentation
https://kubernetes.io/zh-cn/docs/tasks/tools/
Install Docker and kubeadm, kubenet on all nodes
Deploy Kubernetes Master
Deploy the container network plug-in kubectl
Deploy Kubernetes Node and add the node to the Kubernetes cluster
Fined 200 yuan and more than 1 million yuan confiscated You Yuxi: The importance of high-quality Chinese documents Musk's hard-core migration server Solon for JDK 21, virtual threads are incredible! ! ! TCP congestion control saves the Internet Flutter for OpenHarmony is here The Linux kernel LTS period will be restored from 6 years to 2 years Go 1.22 will fix the for loop variable error Svelte built a "new wheel" - runes Google celebrates its 25th anniversaryAuthor: JD Technology Yang Jian
Source: JD Cloud Developer Community Please indicate the source when reprinting