The landing practice of ByteDance Kitex in the Semir e-commerce scene

With the increasing number of enterprise users, the CloudWeGo team will continue to share the implementation practices of different enterprises in the face of needs and technical problems in different scenarios, including technical problems faced by different industries, selection reference, and final implementation performance and usage sharing. Help more users use CloudWeGo.

In recent years, the e-commerce industry has developed rapidly, and Semir's e-commerce online business has surged, facing the needs of high-concurrency and high-performance business scenarios. Semir has officially become an enterprise user of CloudWeGo. By using Kitex to access Istio, it greatly improves the processing capability for high concurrency requirements. The following content comes from the sharing of Semir development engineer Liang Dongpo .

1. Semir e-commerce order circulation center - Tianshu

Business growth

Semir's order circulation center - Tianshu's main function is to connect with major e-commerce platforms, process orders, commodities, refunds and other information in a unified manner and then transfer them to the downstream system. It is the intermediate hub for the connection between the downstream system and the platform. At present, there are dozens of e-commerce platforms operated by Semir e-commerce, such as: Tmall, Doudian, JD.com, Pinduoduo, etc. Since the interface and connection method of each platform are not uniform, we have specially developed this system to go to Unified docking with the e-commerce platform, and then process the data into a unified format and send it to downstream systems, such as OMS and WMS. The system has played an important role in e-commerce events such as 6.18, Double Eleven and other peak traffic orders.

From 2015 to 2021, Semir's Double Eleven business volume has grown very rapidly. In 2015, the performance of Double Eleven was 300 million+, while last year's Double 11 performance was 2 billion+. In 2021, the gross merchandise volume (GMV) exceeded 10 billion. With the growth of business, the requirements for the performance and stability of the order system are getting higher and higher. Moreover, as the scale of the system grows, the number of Pods and services in the cluster continue to increase, which poses a great challenge to the underlying architecture of the system. The platforms currently migrated from the old system include: Youzan, Douyin, Pinduoduo, Kuaishou, etc. The number of Pods in the cluster has exceeded 200. After subsequent access to platforms such as JD.com, Vipshop, Tmall, etc., the number of Pods will become The multi-fold growth requires a mature system architecture as support.

face problems

With the rise of the live broadcast industry, we have invited some Internet celebrity anchors and traffic stars to live broadcast and bring goods. During the live broadcast, the order volume often bursts suddenly within a few seconds. After the order is pushed to the system, if the system is slow to process, the order cannot flow into the downstream system in time, and the OMS of the downstream system does not know that such a large order volume has been generated. There will be a situation that cannot be synchronized, that is, oversold. In the e-commerce industry, oversold is a very serious problem. If the user cannot deliver the order in time after placing the order, it will not only require a lot of manpower to explain the apology to the customer, but also compensate the user for losses in the form of coupons, and even receive A large number of complaints have seriously affected our reputation on the e-commerce platform, and the e-commerce platform will also punish us. We have experienced that when the GMV exceeds 10 million, the order system is delayed for more than half an hour, which has a great impact on us.

Therefore, when encountering events such as Double Eleven and 6.18 big promotions, especially when the order volume during the live broadcast skyrocketed in a short period of time, Semir's original system architecture could no longer support it and could not process order data in a timely manner. This affected our shipment and inventory synchronization, and indirectly resulted in different types of asset losses.

technical challenges

The technical challenges are mainly in the following three aspects:

High concurrency: In the e-commerce business scenario, whether it is user-oriented, such as seckill, or business-oriented, such as order processing, if high concurrency cannot be achieved, it will be difficult for the system to expand and adapt to business growth.
High performance: In addition to using high concurrency to achieve fast business processing, performance is also a challenge. For example, under the current epidemic situation, all walks of life are reducing costs and increasing efficiency. If performance problems cannot be solved, server resources will continue to increase, greatly increasing enterprise costs.
Technical support: Most of the resources and energy of e-commerce companies are on the sales side, the operation side, and the investment in technology is relatively weak. Therefore, technical selection needs to be considered from the dimensions of reliability, security, and support.

2. Project technology selection

how to choose

In terms of the choice of development language, there is no good or bad development language, only the question of whether the language is suitable or not in relevant scenarios. We chose Golang from a comprehensive consideration of performance, multithreading, compilation, and efficiency.

In terms of microservice framework selection, the team used Google's open-source gRPC and ByteDance's open-source CloudWeGo-Kitex for technical evaluation and performance stress testing. After the stress test of professional test students, CloudWeGo-Kitex was finally chosen as our microservice framework.

There are two main reasons for choosing Kitex: First, there is a strong technical team behind Kitex to provide timely and effective technical support; second, after stress testing, Kitex's performance is better than other microservice frameworks.

About Microservices

Using a microservice framework will definitely involve choosing a third-party open source service registry, then choose a common open source registry (Zookeeper, Eureka, Nacos, Consul, and ETCD), or directly choose a cloud-native service mesh (Istio) ? Next, two forms of microservice clusters are introduced from the dimensions of traffic forwarding, service registration, and service discovery.

One is Kubernetes Native. Each node in the Kubernetes cluster deploys a Kube-proxy component, which communicates with the Kubernetes API Server, observes changes in services and nodes, and performs load balancing forwarding. This open source registry uses the TCP protocol by default. Since K8s load balancing does not support the RPC protocol (HTTP2), it requires additional third-party service registry support.

The second is an Istio-based service mesh, which does not require additional registry component support. Istio takes over the K8s network and extracts the traffic control in Kubernetes from the service layer by means of Sidecar Proxy. Istio extends its control plane based on Enovy's xDS protocol, and puts the original Kube-proxy in each Pod Routing forwarding function. Istio has the characteristics of traffic management, policy control, observability, etc., and decouples "application" from "network", so there is no need to use a third-party registry.

So what is the process of registration and discovery of these two services? The left side of the figure below is the commonly used service registration center usage process. The target service first registers the instance to the service registry, and the client obtains the data of the target instance from the service registry, selects a service instance according to the load balancing strategy, and completes the entire request.

On the right is an Istio-based service mesh. The approximate process is that when the Client accesses the target service, the traffic first enters the Proxy of the Service and is intercepted by the Proxy. The Proxy will get the mapping relationship between the service and the service instance from the service discovery (Pilot), and will also get the load balancing strategy to select An instance of Service. In general, these two processes are roughly the same, but the implementation methods are different, and each has its own strengths.

The basic structure of Tianshu system

When mature platforms such as Douyin, Kuaishou, Pinduoduo, and Youzan generate orders, they will send the order to the service grid in the form of message push. We successively forwarded orders to different services of the grid through the Ingress Gateway grid entry management program and VirtualService, and then made calls between different services internally. Among them, Kitex serves as an RPC framework for microservices, and both service discovery and service registration are based on the cloud-native service mesh Istio.

Kitex accesses Istio

So how does Kitex connect to Istio? As shown in the figure below, after the server registers the service, when creating the client, the server-host of the client should write the intranet address in the actual cluster, for example: server-douyin.default.svc.cluster.local, as mentioned above , no longer need to match the third-party service registration center.

Since Kitex uses the gRPC protocol, you need to specify the gRPC protocol when creating the client:

How to deploy our client or server in Istio? There are two ways:

1. Enable automatic injection for the namespace: kubectl label namespace default istio-injection=enabled . After injection, two important containers will be generated, one is Istio-proxy, which is responsible for traffic interception and traffic proxy, such as traffic forwarding; the other is Server-douyin, which is the application container responsible for development.

2. Deploy the image packaged by the Go code to the cluster:

For example, we created a Deployment named Server-douyin, and as a server, we need to create a corresponding Service.

Pressure test comparison

We compared Kitex and gRPC under the same server hardware resources and network environment as follows:

Pressure measurement tool: JMeter;
Alibaba Cloud ECS (8 vCPU, 16 GiB, 5 units);
Cluster: Kubernetes 1.20.11;
Service mesh: Istio v1.10.5.39.

Through comparison, it is found that Kitex processes more orders per unit of time when the specified time is the same. In the case of a specified order quantity, Kitex takes less time to process the same quantity of orders, and the larger the order quantity, the more obvious this performance difference is. Overall, Kitex is very good at handling large orders.

Why Kitex Provides a Performance Benefit

When the CloudWeGo team came to Semir for technical support, they mentioned some performance optimizations for the self-developed network library Netpoll, such as:

connection utilization;
scheduling delay optimization;
Optimize I/O calls;
Serialization/deserialization optimization;
......

For more information, please check the CloudWeGo official website: CloudWeGo

CloudWeGo Team Technical Support

After Semir chose Kitex, the CloudWeGo technical team provided sufficient technical support, including on-site support and remote assistance. This also gives the Semir team confidence in using Kitex, no matter what kind of technical problems they encounter, there will be a strong technical team to help solve them.

3. Follow-up planning

How Thrift and Protobuf Choose

We chose the gRPC protocol Protobuf at the beginning of the project because we chose the Istio service mesh, and the Istio service mesh was chosen mainly because it has multiple functions such as traffic forwarding and service governance. For example, in the e-commerce scenario, the push messages of different platforms are all It can be forwarded to different services through VirtualService, which is quite convenient. However, the original Kube-proxy routing and forwarding function is currently placed in each Pod, which will increase the response delay. Since Sidecar intercepts traffic with more hops, it will consume more resources.

As for Thrift, it is a protocol supported by Kitex by default. Byte has made many performance optimizations for it, such as: using SIMD to optimize Thrift encoding, reducing function calls, reducing memory operations, etc., and open source high-performance Thrift codecs Frugal has the characteristics of no code generation, high performance (in multi-core scenarios, Frugal's performance can reach 5 times that of traditional encoding and decoding methods) and stability , which further improves performance and development efficiency.

Therefore, Semir is currently considering switching to the Thrift protocol in the architecture of the next system version.

Service, win-win cooperation

The e-commerce related products developed by Semir can not only be used by its own e-commerce brands, but also serve other . Therefore, we also hope to have deeper technical cooperation with Kitex officials, such as e-commerce cloud.

Enterprise users are welcome to scan the QR code, fill out the enterprise support questionnaire, and join the Feishu exchange group to obtain enterprise technical support from the CloudWeGo team.

Open Feishu, scan the QR code on the left to fill out the enterprise support questionnaire , and scan the QR code on the right to join the Feishu exchange group .

4. More information

Official website address: CloudWeGo
Project address: https://github.com/cloudwego
Kitex: https://github.com/cloudwego/kitex
Netpoll: https://github.com/cloudwego/netpoll
Hertz: https://github.com/cloudwego/hertz
Thirftgo: https://github.com/cloudwego/thriftgo
Netpoll-http2: https://github.com/cloudwego/netpoll-http2