[Shenzhen] Yuanchuanghui: 5.26pm, the party hall is waiting for you.”

This article is based on the speech "Kitex Thrift Streaming Implemented in AI Scenarios" by Du Shaofeng, ByteDance-Flow R&D Engineer, at the "Cloud Native✖️Microservice Architecture and Technical Practice in the AI Era" CloudWeGo Technology Salon Beijing event held in Beijing on March 30, 2024. Compiled from "Practice".

Overview

The ByteDance Prompt platform aims to provide users with comprehensive Prompt development, tuning, evaluation and application full life cycle functions. Among these capabilities, streaming large model results for typewriter effects is a crucial feature. Although implementation based on SSE (Server-Sent Events) is feasible, it requires additional writing of HTTP services, which increases the complexity of development. Although the polling method is simple, the user experience is not ideal and seems too clumsy. As for gRPC, although its performance is excellent, it may introduce compatibility issues, making deployment and maintenance complicated. Therefore, we have successfully implemented the streaming interface with the help of Kitex's Thrift streaming capability, thereby providing users with a smooth and efficient typewriter effect large model result output experience.

1. Business background

With the continuous development of AI technology, people's lives are undergoing profound changes. Take Byte's AI product Doubao as an example. The agents in it have brought many novel experiences to people. Among them, interesting intelligent robots such as AI boyfriend and AI girlfriend are particularly popular. They can not only interact with users in a humorous way, but also show a gentle and considerate side.

All of this is inseparable from a concept closely related to large models - prompts. Simply put, Prompt is the text input to the pre-trained model to guide the model to generate text output that meets specific needs. To put it figuratively, Prompt is like creating an exclusive dream for the big model. Through it, we can guide the big model to give more appropriate and targeted answers in specific scenarios.

Taking the AI girlfriend as an example, we will tell the big model through carefully designed prompts that its role is a gentle and considerate virtual girlfriend. At the same time, we will also set some restrictions, such as requiring it to communicate with users in a gentle and considerate way, and to have skills such as listening, understanding, encouragement and suggestions. In addition, we will also describe its workflow in detail, such as guiding users to say their names when greeting, giving users a suitable nickname, and then conducting in-depth communication with users and providing useful suggestions.

Through such prompts, we build a complete "dream" for the large model, allowing it to understand that it is an AI girlfriend and how it should interact with users. When this prompt is activated, when we have a question and answer session with the large model, it will give corresponding replies based on our prompts. For example, when we say hello to it, it guides us to say our name, gives us a cute nickname, and then gives us encouragement and relief.

As can be seen from this example, Prompt plays a decisive role in the output of large models in specific scenarios. Furthermore, it will also affect the consumption of tokens and the response time of large models during the output process. Therefore, an excellent Prompt is crucial to improving model output.

2. Demand scenarios

The ByteDance Flow team is working to build a comprehensive and mature platform/method designed to help prompt developers design, iterate, evaluate and optimize their prompts, thereby enhancing the expressiveness of LLM (large language model). During the development phase, we plan to provide structured generation and guided generation to assist users in writing efficient and accurate prompts and debugging accordingly.

As development progresses, we will further introduce automatic tuning technologies such as COT (Chain of Thought) and Few shots, as well as the APO (Auto Prompt Optimization) method to help Prompt improve the accuracy of its answers. At the same time, we will also provide the ability to expand prompts to optimize the efficiency of large models in token consumption.

In addition, in order to comprehensively evaluate the effectiveness of Prompt, we will score Prompt based on diverse data sets and conduct in-depth analysis of its performance bottlenecks to make targeted improvements. Eventually, we will provide one-click deployment capabilities, allowing developers to easily integrate Prompt capabilities and the large models behind them into their applications.

Of course, the realization of these functions is inseparable from the support of real-time streaming technology. Just like the AI capabilities you have experienced such as GPT, Doubao, and Baidu AI search, they all use typewriter-style replies after users ask questions, allowing users to feel that data is constantly flowing into the screen, thus improving the fluency of chatting. and response speed. This real-time streaming technology is the most basic capability that our Prompt platform needs to provide. By splitting data into multiple data streams for network transmission, we can effectively reduce network latency, improve performance, and ensure that users have a better experience when interacting with large language models.

3. Solution

In order to implement the streaming output function, we conducted in-depth research and considered a variety of options:

polling
HTTP SSE
Kitex gRPC Streaming(protobuf)
Kitex Thrift Streaming

First, the polling scheme was excluded due to its inflexible nature and did not meet our needs. Secondly, although HTTP-based SSE is a feasible solution, considering that we also have strict requirements for RPC (remote procedure call), we also need to find a more suitable solution. In addition, we found that the streaming support of the Protobuf protocol did not fully meet our needs, especially in terms of the Thrift interface. Finally, we noticed Kitex's support for Thrift Streaming. At that time, Kitex Thrift Streaming was in the development stage, and we decisively decided to become its first users and build the basic framework of the entire Prompt platform based on it.

In terms of architecture design, we first benchmarked LangChain and established LLM engineering services. On this basis, we further build the Prompt service to provide the most basic Prompt management and application capabilities. To interact with the front-end, we provide an HTTP interface through API Gateway. In terms of communication between microservices, we use the Kitex framework to provide support for streaming interfaces and non-streaming interfaces to ensure efficient transmission and processing of data.

Through this solution, we successfully implemented the streaming output function, providing users with a smoother and more efficient AI interaction experience. At the same time, we have also laid a solid foundation for future expansion and optimization.

4. Practice and pitfalls

Streaming call process

The streaming call process starts when the user initiates a question. This request is first sent to the gateway, which then establishes a connection with the downstream Prompt RPC interface. The Prompt RPC interface further establishes communication with the LLM engineering service, which is responsible for continuously interacting with the model and obtaining the output results of the model. These results are transmitted upward layer by layer in a streaming manner until they reach the gateway layer, and are finally displayed to the user in a streaming manner.

During this process, we wrote a streaming interface in the Prompt service to handle streaming calls. The interface first establishes a connection with the downstream by calling the downstream interface, and then continuously receives the result of the streaming packet spitted out to us by the downstream through a for loop. Once the data packet is received, we transparently transmit it to the upper layer through the send method until an error is encountered or the stream is closed, and the cycle ends.

During the implementation process, we experienced the simplicity of Kitex Thrift Streaming. However, we also encountered some problems. Especially in terms of error handling, we found that the code was unable to obtain the expected results when running, and even caused the CPU load to be too high.

After further analysis of the error logs, we found that there were error messages in individual requests, specifically regarding the QPM (Query Per Second) limit issue for the first packet. According to our code logic, we should exit the for loop quickly when encountering such errors, but this is not the actual situation. So, we began to use the troubleshooting methods provided by Kitex to locate the problem. Kitex provides burying points for RPCStart and RPCEnd, as well as more fine-grained burying points for packet receiving and sending events. Through the analysis of these buried points, we found that Kitex recognized the entire request as a normal response, and a large number of data packets were sent on the calling link. Further checking the management information of a single package also shows that it is recognized as a normal response by Kitex.

After preliminary judgment, we believe that business errors may have been ignored in Kitex's streaming processing, resulting in errors not being correctly identified. After communicating with the Kitex team, they made corresponding adjustments, such as adding recognition of biz status errors (business status errors) to the code.

Based on this error handling experience, we further analyzed other abnormal scenarios that may be encountered in streaming calls, such as permission errors in the connection establishment phase, TPM/QPM overruns in the first packet phase, stream timeout and content in the intermediate packet phase Review errors, etc. We focused on the error handling performance of Kitex Thrift Streaming in these scenarios, such as whether it can quickly return error information when establishing a connection, and whether it can quickly stop stream waiting when the first packet and intermediate packets return errors. After joint adjustments and testing with the Kitex team, the error handling in these scenarios was finally in line with expectations.

In terms of service governance

In terms of service governance, we pay special attention to the two key aspects of timeout and current limiting.

First, timeout management is crucial. Since our modules interact with large models, this interaction can involve response times on the order of seconds or even minutes. Therefore, we set minute-level timeout limits for stream processing at both the HTTP layer and the RPC layer. This can avoid service blocking caused by being unable to exit the for loop and ensure the stability and availability of the service.

In terms of current limiting, although Kitex supports current limiting when creating a stream, for the LLM scenario, our focus is not only on QPM current limiting when establishing a connection, but also on the current limiting of large model token consumption. The inference process of large models will generate a large amount of token consumption, which may lead to resource exhaustion and service crashes if not restricted. Therefore, we use Kitex to implement Jianlian's current limit, and at the same time use our own distributed components to calculate token consumption under different models, and implement token-level current limit accordingly. Doing so can effectively control resource usage and avoid service overload.

However, we also have expectations for Kitex. We hope that Kitex can provide customized current limiting capabilities at packet granularity in the future. In this way, we can define current limiting rules more flexibly and control resource usage more accurately, thereby further improving the stability and performance of the service.

5. Future expectations

With the continuous development and application of AI technology, we have higher expectations for the capabilities of microservice frameworks in AI scenarios. Especially in terms of convenience, capabilities in AI scenarios, and adaptation of traditional framework capabilities, we hope to see more innovation and progress.

Convenience

First of all, in terms of convenience, we expect the microservice framework to support the access of more testing tools , especially for testing of streaming interfaces. Currently, there are still certain limitations in testing the Kitex Thrift streaming interface, which mainly relies on writing non-streaming interfaces for packaging calls. In the future, we hope to make the streaming interface more convenient to support various testing tools and improve development efficiency through generalized calling and other methods.

Ability in AI scenarios

With the vigorous development of AI technology, more and more products are beginning to incorporate AI capabilities to optimize user experience and functions. In the AI scenario, we have higher expectations for microservice frameworks such as Kitex, hoping that it can better support the integration and orchestration of AI components and adapt to traditional framework capabilities.

Out-of-the-box AI component orchestration capabilities

In current development practices, when AI capabilities need to be integrated, developers usually need to handle complex logic themselves, such as calling prompts, parsing large model output, and converting results into machine language. This not only increases the difficulty of development, but also reduces development efficiency. Therefore, we expect the Kitex framework to provide out-of-the-box AI component orchestration capabilities.

Specifically, we expect the framework to pre-install a series of encapsulated AI components, such as prompt components, large model components, result parsing components, and RPC calling components. These components should be highly configurable and extensible so that they can be adapted to different business needs. Developers only need to pass business logic into these components without caring about the implementation details inside the components, so they can focus more on the implementation of business logic.

Flexible AI component orchestration capabilities

In addition to providing preset AI components, we also expect the Kitex framework to support flexible AI component orchestration capabilities. This means that the framework should provide an expression language or visualization tool that allows developers to easily orchestrate these AI components according to business needs. In this way, developers can define the order of execution, communication methods, parallel processing strategies, etc. between components without going into the details of the interactions between components. This will greatly improve the development efficiency and maintainability of AI applications.

Traditional framework capabilities are adapted on LLM links

In AI scenarios, traditional framework capabilities such as service governance, metadata transparent transmission, and observability are still of great significance . Therefore, we look forward to the Kitex framework being able to adapt and optimize in these aspects.

First, in terms of service governance, since AI applications may involve long-term inference processes, the framework needs to provide timeout and current limiting strategies for response times on the order of seconds or even minutes. At the same time, you also need to consider how to handle exceptions related to AI components.

Secondly, in terms of metadata transparent transmission, we expect the framework to support the transmission of metadata between AI components for more refined monitoring and debugging. This will help us better understand the operating status of AI applications and quickly locate problems.

Finally, in terms of observability, we expect the Kitex framework to provide comprehensive logging, tracking, and indicator collection functions for comprehensive monitoring and analysis of AI links. This will help us discover potential performance bottlenecks and optimization points in time, thereby improving the performance and stability of AI applications.

To sum up, our future expectations for the Kitex framework in AI scenarios are mainly focused on out-of-the-box AI component orchestration capabilities , flexible AI component orchestration capabilities , and the adaptation of traditional framework capabilities on LLM links . We believe that with the continuous advancement of technology and in-depth team cooperation, these expectations will gradually become reality, bringing greater convenience and efficiency to the development of AI applications.

In fact, our team has conducted in-depth cooperation with the Kitex team to discuss how to better support AI scenarios in the microservices framework. We believe that in the near future, we will be able to launch an MVP version of the solution to provide business developers with a framework that can be easily and seamlessly integrated with AI capabilities. It's going to be an exciting time and we're looking forward to it.

"User Case - ByteDance Flow Team" Kitex Thrift Streaming's practice on the Prompt platform

Overview

1. Business background

2. Demand scenarios

3. Solution

4. Practice and pitfalls

Streaming call process

In terms of service governance

5. Future expectations

Convenience

Ability in AI scenarios

Guess you like