ChatGPT replacement - ChatGLM multi-user parallel access deployment

        For the basic environment configuration and deployment of the ChatGLM dialogue model, please refer to the previous blog post "ChatGPT Replacement-ChatGLM Environment Construction and Deployment Operation", the address is "https://blog.csdn.net/suiyingy/article/details/130370190". However, the default deployer only supports single-user access, multiple users require queued access. Several related Github multi-user projects have been tested, but some of them still do not meet the requirements. This section will systematically introduce how to enable multiple users to access the deployment interface of ChatGLM at the same time, including http, websocket (streaming output, stream) and web pages, etc. The main directory is as follows.

        (1) api.py http multi-user parallel

        (2) api.py websocket multi-user parallel (streaming output, stream)

        (3) web_demo.py multi-user parallel

        The programs involved in this section can be written or replaced according to the description in the text, or can be downloaded at "https://download.csdn.net/download/suiyingy/87742178", all the programs in the text are included.

1 api.py http multi-user parallel

1.1 fastapi parallel

        The api.py of the ChatGLM-6B project is an http post service program written based on fastapi. For specific introduction and calling method, please refer to the previous blog post. After running the program, when multiple users call the http interface at the same time, the program needs to be executed in a queue, that is, the current user command needs to wait for the previous user to obtain the result before executing.

        The key to achieve interface parallelism is to remove the async of create_item. The corresponding program is shown below. This function segment is automatically generated by the RdFast applet. We can write the program according to the following description, or go to "https://download.csdn.net/download/suiyingy/87742178" to download, corresponding to the downloaded api_http_one_worker.py file.

#该函数段由RdFast小程序自动生成

from pydantic import BaseModel
class User(BaseModel):
    prompt: str
    history: list

@app.post("/http/noasync")
def create_item(request: User):
    global model, tokenizer
    json_post_raw = request.dict()
    json_post = json.dumps(json_post_raw)
    json_post_list = json.loads(json_post)
    prompt = json_post_list.get('prompt')
    history = json_post_list.get('history')
    max_length = json_post_list.get('max_length')
    top_p = json_post_list.get('top_p')
    temperature = json_post_list.get('temperature')
    response, history = model.chat(tokenizer,
                                   prompt,
                                   history=history,
                                   max_length=max_length if max_length else 2048,
                                   top_p=top_p if top_p else 0.7,
                                   temperature=temperature if temperature else 0.95)
    now = datetime.datetime.now()
    time = now.strftime("%Y-%m-%d %H:%M:%S")
    answer = {
        "response": response,
        "history": history,
        "status": 200,
        "time": time
    }
    log = "[" + time + "] " + '", prompt:"' + prompt + '", response:"' + repr(response) + '"'
    print(log)
    torch_gc()

        We test by typing "hello" and simulate three users accessing at the same time. Before the modification, the time required for the three users to obtain the returned results was 2.08s, 4.05s, and 6.02s, respectively, while after the modification, the time required to obtain the results was 6.73s, 6.78, and 6.88s, respectively. Before the modification, the program was executed sequentially, and the time required for the last user to obtain the result was 6.02s. After modification, the program is executed in parallel, and the three users obtain access results almost at the same time.

        Since the model parameters are shared between multiple threads, and the program will run alternately in the multi-threaded state, the total time to obtain the results in the multi-threaded state will increase instead. Therefore, this modification is not suitable for http mode, but more suitable for websocket streaming output. The test program that simulates multi-user calls is shown below.

import json
import time
import requests
import threading

def get_ans(id, prompt):
    t0 = time.time()
    headers = {'Content-Type': 'application/json'}
    url = 'http://IP:Port/http/noasync'
    data = {'prompt': prompt, 'history': []}
    data = json.dumps(data)
    reponse = requests.post(url=url, data=data, headers=headers)
    print(id, '耗时为:', round(time.time() - t0, 2), 's,结果为:', reponse .text)

if __name__ == '__main__':
    t1 = threading.Thread(target=get_ans, args=('线程1', '你好'))
    t2 = threading.Thread(target=get_ans, args=('线程2', '你好'))
    t3 = threading.Thread(target=get_ans, args=('线程3', '你好'))
    t1.start()
    t2.start()
    t3.start()

1.2 fastapi multi-thread parallelism

        Fastapi multithreading is controlled by starting parameter workers. If the program directly sets the workers in api.py to a value greater than 1, that is, "uvicorn.run(app, host='0.0.0.0', port=8000, workers=2)", then an error will be reported "WARNING: You must pass the application as an import string to enable 'reload' or 'workers'.", the program exits and stops execution after an error is reported. Correctly modify it to "uvicorn.run('api:app', host='0.0.0.0', port=8000, workers=2)", where api represents the name of the current python file.

        This is equivalent to each thread running the app of the api.py file separately, and the number of runs is determined by the workers. It can be seen from here that the program does not recognize the variables in the '__main__' function, so the model definition must be placed in the global position, as shown below, otherwise an error "NameError: name 'model' is not defined" will be reported.

app = FastAPI()
tokenizer = AutoTokenizer.from_pretrained("chatglm-6b-int4-qe", trust_remote_code=True)
model = AutoModel.from_pretrained("chatglm-6b-int4-qe", trust_remote_code=True).half().cuda()
model.eval()

        In the case of multi-threading, similar to the previous section, the running time of the program is basically the same regardless of whether asnc is used. However, the base video memory increases with the number of threads. In actual operation, a single thread should require about 10GB of video memory, including model loading and inference. The model loading video memory under each number of threads is shown below. In the method in Section 1.1, a single thread only needs 3939MB.

Workers=1, 7329MB
Workers=2, 17875MB
Workers=3, 24843MB
Workers=4, 31811MB
Workers=5, 38779MB

        We can write programs according to the above description, or go to "https://download.csdn.net/download/suiyingy/87742178" to download, corresponding to the downloaded api_http_three_worker.py file.

2 api.py websocket multi-user parallel

        The method of creating Fastapi websocket is as follows, the sample program is automatically generated from the RdFast applet.

#该示例程序来源于RdFast小程序自动生成。
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
app = FastAPI()
connected_websockets = {}
@app.websocket("/ws/{client_id}")
async def websocket_endpoint(websocket: WebSocket, client_id: str):
    await websocket.accept()
    connected_websockets[client_id] = websocket
    try:
        while True:
            # 接收客户端websocket发送过来的信息
            data = await websocket.receive_text()
            # 将接收到的信息通过所有已连接的websocket广播给其他客户端
            for ws in connected_websockets.values():
                await ws.send_text(f"Client {client_id}: {data}")
    except WebSocketDisconnect:
        # 连接断开时,从已连接的客户端中移除
        del connected_websockets[client_id]

        Combining the above program with ChatGLM can realize the websocket api interface of ChatGLM. A sample program looks like this:

@app.websocket("/ws/{client_id}")
async def websocket_endpoint(ws: WebSocket, client_id: str):
    await ws.accept()
    print('已连接')
    try:
        while True:
            # 接收客户端websocket发送过来的信息
            data = await ws.receive_text()
            print('收到消息:', data)
            resp0 = ''
            for response, history in model.stream_chat(tokenizer, data, [], max_length= 2048,top_p= 0.7, temperature= 0.95):
                print('response:', response)
                res = response.replace(resp0, '')
                resp0 = response
                await ws.send_text(res)
            await ws.send_text('<rdfast stop>')#自定义结束符
    except WebSocketDisconnect:
        print('连接已断开')

        We can write programs according to the above description, or go to "https://download.csdn.net/download/suiyingy/87742178" to download, corresponding to the downloaded api_http_one_worker.py file. The Websocket test program is shown below.

from websocket import create_connection
def connect_node(ques):
    ans = ''
    url = "ws://IP:Port/ws/2"
    ws = create_connection(url)
    ws.send(ques)
    while True:
        try:
            recv_text = ws.recv()
            print(recv_text)
            if '<rdfast stop>' in recv_text:
                print('break')
                break
            ans += recv_text
        except Exception as e:
            print('except: ', str(e))
            recv_text = ws.recv()
            break
    print(ans)
    ws.close()
connect_node('你好')

        Similar to the http interface, when using async, multiple users calling websocket will queue up to get the results. At this time, the program cannot obtain the result after removing async. Using the multi-thread startup method in 1.2 can achieve multiple users to obtain results at the same time, and the program is basically the same. You can also refer to the api_http_three_worker.py of "https://download.csdn.net/download/suiyingy/87742178".

        In addition, different python packages support different working methods. For example, websocket-server supports multiple users calling the websocket interface at the same time, and the installation method is "pip install websocket-server". An error prompt "KeyError: 'upgrade'" may appear when running the program, but this does not affect the result acquisition. For the corresponding program of websocket-server ChatGLM, see the api_ws.py program of "https://download.csdn.net/download/suiyingy/87742178".

3 web_demo.py multi-user parallel

        By default, multiple users of Web_demo.py will queue up for access, and the results can be obtained at the same time when starting with the following command. Concurrency_count indicates the maximum number of users who can obtain results at the same time, that is, the number of concurrency. max_size indicates the number of queues, that is, how many users are allowed to be in the queue at most. When using it, just replace the last line of web_demo.py with the following startup method. Web_demo2.py is implemented by streamlit, which supports simultaneous access by multiple users by default.

demo.queue(
    concurrency_count=5,
    max_size=500).launch(share=False,
                inbrowser=True,
                server_name="0.0.0.0",
                server_port=8000)

      This article comes from the AIGC column "Python AIGC large model training and reasoning from scratch", the address is "https://blog.csdn.net/suiyingy/article/details/130169592".

The content of the article will be updated synchronously in the official account below.

Guess you like

Origin blog.csdn.net/suiyingy/article/details/130412307