[Shenzhen] Yuanchuanghui: 5.26pm, the party hall is waiting for you.”

This article is the third in the CloudWeGo second anniversary celebration series.

Looking back on the work done by the CloudWeGo Rust Team in the past year, if we want to summarize it with two keywords, it is performance optimization and ecological construction. This article is mainly divided into three points. The first point is to roughly summarize and review the development of Volo this year. The second point is to focus on the performance optimization in Volo. The third point is that our future work will focus on Which aspects.

1. Volo this year

In August 2022, we published an official announcement titled " The first RPC framework in China based on the Rust language - Volo is officially open source ", which introduced in detail some of the features of Volo and the simultaneous open source development around Volo. Components such as Pilota, Motore, Metainfo, etc. After a year, Volo and its related components have undergone many changes. To summarize briefly, they will be Volo - function completion & performance optimization, Pilota - capability upgrade, Motore - stabilization, and Metainfo - ease of use. use.

In particular, we would like to take you to review some of Volo's key nodes and technical updates this year.

Shortly after the open source, we received the first PR from community classmate @anwentec. This PR mainly supports users to use Volo for development on Windows, which greatly complements the multi-platform support of the framework.
Immediately afterwards, we ushered in the first major performance optimization since the release - encoding and decoding reconstruction. This optimization was originally inspired by a PR proposed by community classmate @ii64 in Pilota to support the Thrift protocol. In some cases After discussion and communication, we found that Volo's existing capabilities cannot well support users to input custom codecs, so we have the current Volo codec reconstruction and optimization.
It is worth mentioning that during this process, we also released two crates, linkedbytes and faststr, which not only assist in optimization, but also enrich the related ecosystem of Rust open source.
Finally, in terms of encoding and decoding, we also bypassed some boundary checks through unsafe code, allowing the auxiliary compiler to generate more efficient SIMD parallel operation instructions, greatly improving performance. If you want to know more detailed progress of Volo, you can check our Release Note on the CloudWeGo official website .

2. Performance optimization

In the RPC framework, the most performance-consuming aspects are serialization and network communication . Our performance optimization mainly focuses on these two points. The figure below shows a complete RPC call link. Our optimization work is basically concentrated on these CPU-intensive and IO-intensive tasks. The encoding and decoding reconstruction optimization and unsafe encoding and decoding optimization to be introduced in detail below are mainly focused on In the serialization encode and decode part, if you want to participate in in-depth performance optimization, then this will be a good reference.

2.1 Codec reconstruction optimization

The optimization of this area is mainly the zero-copy operation of memory. We know that when making an RPC call, the user request structure needs to be serialized into a binary byte stream and stored in the user-mode memory, and then written into the kernel-mode memory through the write system call for sending. The zero-copy part we optimized is In the first step, it is stored in user mode memory. In most implementations, there will be a copy overhead for Stringthese Vec<u8>types of serialization, because what is written in the write system call needs to be a contiguous memory. So the question is, if continuous memory writing is not required, can the copy here be omitted? The answer is obvious. We can save the copy overhead by reusing the memory in the user request structure and then stringing the memory together in the form of a linked list for writing.

If you want to reuse memory, it is inevitable to introduce reference counting to determine when this memory can be released. As a result, the original two types of Stringand Vec<u8>do not meet the needs. We need types like Arc<String>and . Arc<Vec<u8>>Fortunately, Vec<u8>in the open source community, there are already Bytesstructures in the bytes library that can be used as alternatives. But Stringthere is no good replacement, and this is one of the reasons why the faststr library was born.

The faststr library mainly provides a FastStrstructure. The representation in the structure is as shown below. In fact, it is a collection of various string types. Users can reduce a lot of mental burden on how to choose string types when using it. In addition to meeting the above requirements for reusing memory, it also has certain optimizations for small strings, such as allocating memory directly on the stack. Of course, some people here will have questions. &strIf it can meet the needs, why is faststr still needed? In fact, it is not the case. In some scenarios, we cannot express its life cycle. For a detailed explanation of this part, you can see the faststr documentation.

StringThe linkedbytes library mainly draws on the idea of linked lists, and Vec<u8>writes the memory we reused above through the writev system call. There are two main parts in LinkedBytes, one is a field that temporarily stores non-sum Stringmemory Vec<u8>, bytesand the other is a field that strings together memories list. You can briefly look at the logic of insert. When inserting Bytes, first split the currently temporarily stored continuous memory and insert it into list, and then insert the incoming Bytes. That's it.

2.2 Unsafe codec optimization

The optimization in this area is mainly to assist the compiler to help us generate efficient assembly code. Taking encode as an example, under normal circumstances when we write to memory Vec<i64>, it will be easy to write the code as shown below, that is, directly traverse this Vec, and then call put_i64()the method to write.

But if we take a closer look at put_i64()the implementation of the method, we will find that every time it writes, it will first determine whether the memory is enough. If it is not enough, it will expand the memory before writing. So if we have allocated enough memory from the beginning, we can completely omit the boundary check here, so we can make a slight modification and write the code as shown below.

After writing the code, the next step is performance testing. Simply write a bench and run it. If you don't run it, you won't know. If you run it, you will be shocked. Maybe you are thinking the same as we thought before, just removing the boundary check, the performance improvement should not be much, but in fact, looking at the figure below, it actually has 7 to 8 times the benefit.

So we have to take a closer look at why? As the saying goes, if you want to optimize deeply, the assembly code cannot run. Let’s convert both into assembly code and look at it again. The following two pictures capture part of the assembly code during the writing process.

After reading this, I believe students who are familiar with SIMD instructions have suddenly woken up. In the assembly code after removing the boundary check, SIMD instructions are used to speed up memory writing. That is, one instruction can write multiple data, and its performance benefits are also It seems reasonable.

3. Future prospects

Finally, I will give you a spoiler on some of the projects we are currently trying and the parts we will focus on optimizing in the future.

1. New projects

The first one is the Shmipc-rs project. Students familiar with CloudWeGo may know that Shmipc is now an open source project, but it is only the implementation of the Spec and Go language versions. Shmipc-rs is the implementation of the Rust language version, which will also be integrated by then. to Volo to improve performance. Shmipc is an inter-process communication based on shared memory and is mainly suitable for large packet and high throughput scenarios.

The second is the Volo-http project, which provides a development experience consistent with the Axum framework widely used in the community, and the middleware part is implemented based on our own open source Motore, which will bring some performance improvements in the future. It is also expected to be combined with the Volo-gRPC project to provide Gateway and other functions. It is currently available and everyone is welcome to experience and build together.

2. Ease of use optimization

The first is the documentation part. Currently, there are many functional features in Volo, but most of them lack some documentation to explain, so that users cannot use and experience them well. In the future, we will put the work of document supplementation in Follow up in the issue, and everyone is welcome to participate.

The second is the best practices part. Currently, there are only some simple example demos in Volo for users to learn and use, but there are no small and medium-sized projects for users to learn from and understand the Volo framework. This is also a part that needs to be strengthened in the future. If everyone If there are any projects that you recommend implementing, you are welcome to open an issue and discuss them together.

The above is a review and outlook on the first anniversary of Volo open source on the occasion of the second anniversary of CloudWeGo. I hope it will be helpful to everyone.

project address

GitHub: https://github.com/cloudwego Official website: www.cloudwego.io

Volo’s first anniversary of open source——Performance optimization and ecological construction