Know the high-performance service governance framework Kmesh

In March, the openEuler community launched an innovative project, the high-performance service governance framework Kmesh, which brings a new data plane experience to the service mesh through architectural innovation. This article starts from the service grid and takes you to understand the past and present of Kmesh.

The current Kmesh project has been released in openEuler 23.03, and interested partners are welcome to download and use it.

Warehouse address : https://gitee.com/openeuler/Kmesh

What is a service mesh

The service grid was proposed in 2016 by the buoyant company that developed the Linkerd software. The original definition given by William Morgan (CEO of Linkerd) Service Mesh:

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It’s responsible for the reliable delivery of requests through the complex topology of services that comprise a modern, cloud native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the application needing to be aware.

Roughly speaking: a service mesh is the infrastructure layer that handles communication between services. Provide transparent and reliable network communication for modern cloud-native applications through the form of network proxy array.

The essence of the service grid is to solve the problem of how to better communicate between microservices. Through governance rules such as load balancing, gray-scale routing, and circuit breaker current limiting, it arranges traffic reasonably and maximizes cluster service capabilities. It is the product of the evolution of service governance;

We divide the evolution process of service governance into three generations and compare them briefly; it can be seen from the evolution process that the service governance capability is gradually decoupled from the business and sinks to the infrastructure;

picture

As the infrastructure layer for handling inter-service communication, the service grid effectively makes up for the shortcomings of K8s in microservice governance. As a cloud-native next-generation technology, it has become a key component of cloud infrastructure.

As the technology direction of the trend in recent years, many service grid software have been born in the industry, such as: Linkerd, Istio, Consul Connect, Kuma, etc.; they have similar software architectures. Taking Istio as an example, the basic architecture of the service grid is shown:

picture

Taking the k8s cluster as an example, when a Pod instance is created, the service grid software transparently creates a Proxy container (also called sidecar, the default sidecar software in Istio is Envoy) in the Pod; the basic flow of Pod communication is as follows:

  • The traffic is transparently hijacked to the proxy component in the Pod through iptables rules;
  • The proxy component completes the traffic management logic (such as: fuse, routing, load balancing, etc.) according to the request, finds the peer service instance to communicate, and forwards the message;
  • The proxy component in the peer Pod hijacks external traffic, and performs basic traffic management logic (such as: current limiting, etc.), and then forwards the traffic to the Pod;
  • After the Pod processing is completed, the response message is returned to the requesting Pod according to the original path;

Problems and Challenges of Service Mesh Data Plane

As can be seen from the above, the service grid implements service governance transparent to applications by introducing a proxy layer on the data plane. However, this is not without cost, and the introduction of the proxy layer will inevitably increase the delay of service communication and degrade performance.

Taking the data provided by the Isito official website as an example, under the cluster scale, the single-hop communication delay on the data plane between microservices increases by 2.65ms; it should be known that under the microservice cluster, an external access often passes through multiple microservices within the cluster. Calling, the delay overhead brought by the grid is very large; as the service grid is more and more used, the additional delay overhead introduced by the proxy architecture has become a key problem faced by the service grid. ,

picture

To this end, we tested the scenario of HTTP service L7 load balancing, and made performance management on the communication of the grid data plane. The time-consuming ratio is as follows:

picture

From the detailed analysis of grid traffic, microservice interoperability has changed from 1 link establishment to 3 link establishments, and from 2 entry and exit protocol stacks to 6 entry and exit protocol stacks; time-consuming is mainly concentrated in multiple Data copying, link building communication, context scheduling switching, etc., the actual overhead of traffic management is not a large proportion.

Then the question is whether to reduce the latency overhead of the grid while maintaining the transparent governance of the grid to the application.

High performance service grid governance framework Kmesh

Based on the above performance analysis, we optimized the performance of the grid data surface in two stages.

Kmesh1.0: Based on sockmap to accelerate grid data surface

Sockmap is an ebpf feature introduced by Linux in 4.14, which can realize the redirection of data flow between sockets in the node without going through the complicated kernel protocol stack, and optimize the data forwarding performance between sockets on the link;

For the service grid scenario, a complete kernel protocol stack is used by default between the business container in the Pod and the local proxy component. This overhead can be optimized through sockmap; as shown in the following figure:

picture

The basic steps of sockmap acceleration:

  • Mount the ebpf program (ebpf prog type: BPF_PROG_TYPE_SOCK_OPS) during the link building process to intercept all TCP link building actions;

    Store the socket information of the positive and negative ends of the communication parties in the sockmap table;

    • BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB status adds client side sockmap record;
    • BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB status adds server-side sockmap record;
  • Mount the ebpf program (ebpf prog type: BPF_PROG_TYPE_SK_MSG) in the sendmsg process to intercept the action of sending messages;

    • Search the sockmap table according to the current socket information, find the socket information of the peer end in association, and redirect the traffic directly to the receive queue of the peer socket;

Based on the sockmap accelerated service grid data surface, in the actual measurement of 60 long link scenarios, the average delay of service access is reduced by 10% to 15% compared with the original grid.

picture

Sockmap is currently a relatively common solution to optimize the data surface of the service grid, but from the effect point of view, reducing the communication delay by 15% does not really solve the problem of poor grid delay performance;

Kmesh2.0: Based on the programmable kernel, sinking traffic management to the OS

According to the above performance analysis, it can be seen that among the additional overhead introduced by the grid, the actual overhead of traffic management is not high, and most of the time is wasted in directing traffic to proxy components; then, traffic governance can Can't you pass through this proxy component and complete the path of sending and receiving traffic along the way? Network communication naturally needs to go through the kernel protocol stack. If the protocol stack has the ability to manage traffic, can this problem be solved?

Kmesh is the high-performance service governance framework we proposed. Based on the programmable kernel, traffic governance is lowered to the OS. The grid data plane no longer passes through proxy components. Service interoperability changes from 3 hops to 1 hop, and the governance work is truly completed along the way. ; The traffic path of microservice intercommunication is as follows:

picture

Kmesh's software architecture:

picture

The main components of Kmesh include:

  • kmesh-controller: kmesh management program, responsible for Kmesh life cycle management, XDS protocol docking, observation operation and maintenance and other functions;
  • kmesh-api: The api interface layer provided by kmesh to the outside world, mainly including: the orchestration API after xds conversion, the observation operation and maintenance channel, etc.;
  • kmesh-runtime: the runtime implemented in the kernel that supports L3~L7 traffic orchestration;
  • kmesh-orchestration: implement L3~L7 traffic orchestration based on ebpf, such as routing, grayscale, load balancing, etc.;
  • kmesh-probe: observation operation and maintenance probe, providing end-to-end observation capability;

We deployed the Istio grid environment. By using different grid data plane software (Envoy/Kmesh), we conducted a comparative test on the performance of the grid data plane (test tool: fortio) for the scenario of http service L7 load balancing:

picture

It can be seen that based on Kmesh, the service interoperability performance in the grid is 5 times higher than that of the Istio native data surface (Envoy). At the same time, we also tested the service interoperability performance based on k8s under the non-grid, which is almost equivalent to the performance data of Kmesh. , which further proves the delay performance of Kmesh data plane. (The test scenario is L7 load balancing in the laboratory environment. The performance effect in the real governance scenario will not be so ideal. The preliminary evaluation will be 2~3 times better than Istio)

Summarize

As a cloud-native next-generation technology, service mesh provides transparent service governance for applications, and at the same time introduces additional delay overhead due to its proxy architecture, which has become the key to the promotion of grid applications; from the perspective of OS, Kmesh proposes a The service governance framework of the programming kernel, by sinking the traffic management capability to the OS, greatly improves the performance of the grid data plane, and provides a new way of thinking for the development of the grid data plane.

As a new project in the community, Kmesh is still in its infancy, and will continue to improve the L4/L7 traffic management capabilities in the future; for more information, please check the project homepage: https://gitee.com/openeuler/Kmesh

Guess you like

Origin blog.csdn.net/openEuler_/article/details/131783531