Message queue CKafka transoceanic data synchronization performance optimization

Introduction

This article mainly introduces a problem of large data synchronization delay between regions that CKafka encountered in the transoceanic scenario. The problem of cross-region delay is typical, so it is recorded in detail to make a summary.

1. Background

In order to meet customers' demands for cross-regional disaster recovery and cold backup, the message queue CKafka provides cross-regional data synchronization capabilities through the connector function, and supports cross-regional second-level quasi-real-time data synchronization.

The overall architecture diagram:

As shown in the figure above, CKafka's cross-regional data synchronization capability is implemented based on the Kafka Connect cluster at the bottom layer, and the cloud environment is opened through Vpcgw Privatelink.

The main process of data synchronization is as follows:

1. The Connect cluster initializes the Connect Task, and each Task will create multiple Worker ConsumerClients (the specific number depends on the number of partitions of the source instance) to pull data from the source CKafka instance.

2. After the Connect cluster pulls data from the source instance, it will start the Producer to send the data to the target CKafka instance.

In a customer business scenario, the customer hopes to synchronize the data of the CKafka instance in Hong Kong to the CKafka instance in the US East through the cross-regional synchronization capability, which causes a strange problem of cross-regional delay during use!

2. Problem phenomenon

When customers use the cross-regional synchronization capability, they find that the data synchronization delay from Hong Kong -> US East is very large, and it can be clearly seen that Connect acts as a consumer to consume and pull data from the source instance (Hong Kong). The accumulation of messages is very large.

news accumulation

According to past experience, there will not be such a large delay in the synchronization of our domestic regions. Why is there such a large delay this time across regions?

3. Problem analysis

Common causes of message accumulation

In the production and consumption process of Kafka, the common reasons for message accumulation are as follows:

● Broker cluster load is too high: including high CPU, high memory, high disk IO, resulting in slow consumption throughput.

● Insufficient processing capacity of consumers: If the processing capacity of consumers is insufficient to consume messages in a timely manner, messages will accumulate. This problem can be solved by increasing the number of consumers or optimizing the processing logic of consumers.

● Abnormal exit of the consumer: If the consumer exits abnormally, the messages cannot be consumed in time, and a large amount of unconsumed messages will be accumulated in the Broker. Abnormalities can be detected and dealt with in time by monitoring the status and health of consumers.

● Consumers fail to submit offsets: If consumers fail to submit offsets, messages will be repeatedly consumed or lost, and a large number of unconsumed messages will accumulate in the Broker. The atomicity and consistency of offsets can be guaranteed by optimizing the consumer's offset submission logic or by using Kafka's transaction mechanism.

● Network failure or Broker failure: If the network failure or Broker failure occurs, messages cannot be transmitted or stored in time, thus accumulating a large number of unconsumed messages in the Broker. This problem can be solved by optimizing the stability and reliability of the network, or increasing the number and fault tolerance of Brokers.

● The producer sends messages too fast: If the producer sends messages too fast and exceeds the processing capacity of the consumer, messages will accumulate. This problem can be solved by adjusting the sending speed of the producer, or increasing the number of consumers.

Based on the above reasons, we first checked the load of all nodes in the Connect cluster and all nodes of the source and target CKafka instances, and found that all monitoring indicators were healthy, the cluster load was very low, and there were no abnormalities or performance bottlenecks in the consumption capacity of ConnectConsumer.

However, the rate of pulling messages at a time is very low, with an average consumption rate of 325KB/s, which is not in line with expectations.

(Note: The bytes-consumed-rate indicator in the above figure represents the number of bytes consumed per second)

Since there is no problem with the cluster load, we conducted a deeper investigation and analysis:

The first stage of analysis: check the network speed

The message delay is long, and the first thing we thought of was the network problem, so we immediately started stress testing the network. Detect network speed through Iperf3 and Wget.

Iperf3 pressure test, the speed is 225Mbps.

Wget directly connects to Hong Kong in the Connect cluster, and the download speed is 20MB/s.

These two tests show that: under the same environment, the network transmission rate is not low, and can reach 20MB/s. So since the network bandwidth is no problem, where is the problem?

Phase 2 Analysis: Kernel Tuning Parameters

There is no problem with the network. Could it be that the Kafka network-related application parameters and kernel network-related parameters are set unreasonably?

1. We first adjusted the kernel parameters. The kernel parameters related to the network mainly include:

系统默认值:
net.core.rmem_max=212992
net.core.wmem_max=212992
net.core.rmem_default=212992
net.core.wmem_default=212992
net.ipv4.tcp_rmem="4096    87380   67108864"
net.ipv4.tcp_wmem="4096    65536   67108864"

---------------------------------------------------------
调整内核参数:
sysctl -w net.core.rmem_max=51200000
sysctl -w net.core.wmem_max=51200000
sysctl -w net.core.rmem_default=2097152
sysctl -w net.core.wmem_default=2097152
sysctl -w net.ipv4.tcp_rmem="40960 873800 671088640"
sysctl -w net.ipv4.tcp_wmem="40960 655360 671088640"

调整TCP的拥塞算法为bbr:
sysctl -w net.ipv4.tcp_congestion_control=bbr

We have increased the value of the overall kernel parameters (although we think the default value of the system kernel is not small), and we have also adjusted the congestion algorithm of TCP.

Here is an explanation of why the TCP congestion algorithm should be adjusted.

(Reference: [[Translation] [Paper] BBR: Congestion Control Based on Congestion (Not Packet Loss) (ACM, 2017)]( [Translation] [Paper] BBR: Congestion Control Based on Congestion (Not Packet Loss) (ACM, 2017) ))

Because this delay occurs across regions and across oceans, using BBR can significantly improve network throughput and reduce latency. Throughput improvements are especially noticeable over long-distance paths, such as trans-Pacific file or large data transfers, especially under network conditions with slight packet loss. The latency improvement is mostly seen in the last mile of the path, which is often affected by buffer bloat. The so-called "buffer inflation" refers to a network device or system that unnecessarily designs an excessively large buffer. Buffer bloat occurs when network links are congested, causing packets to queue for long periods of time in these very large buffers. In a FIFO queue system, an excessively large buffer will result in a longer queue and higher latency, and will not improve network throughput. Since BBR doesn't try to fill up the buffer, it tends to do a better job of avoiding buffer bloat.

After adjusting the kernel parameters, the verification found that the delay has not been greatly improved.

2. Under the reminder of the cloud product technical service expert, it is confirmed that the Receive Buffer setting of the connection is too small, and the adjustment of the kernel parameters does not take effect. It is suspected that the application layer has set it.

So we adjusted the parameter values ​​of Kafka application network parameters Socket.Send.Buffer and Socket.Recevie.Buffer:

(1) Adjust the Socket.Send.Buffer.Bytes parameter of the source target CKafka instance Broker from the default 64KB to use the system's Socket Send Buffer.

Kafka kernel code about Socket Send Buffer:

Tips】:

In Kafka, the size of the TCP send buffer is jointly determined by the application and the operating system. The application program can control the size of the TCP send buffer by setting the Socket.Send.Buffer.Bytes parameter, and the operating system can also control the size of the TCP send buffer by setting the parameters of the TCP/IP protocol stack.

The Socket.Send.Buffer.Bytes parameter set by the application will affect the size of the TCP send buffer, but the operating system will also limit the size of the TCP send buffer. If the Socket.Send.Buffer.Bytes parameter set by the application exceeds the limit of the operating system, the size of the TCP send buffer will be limited within the limit of the operating system. If the application sets Socket.Send.Buffer.Bytes=-1, the size of the TCP send buffer will default to the size of the TCP send buffer of the operating system. It should be noted that the size of the TCP send buffer affects the throughput and latency of the network. If the size of the TCP send buffer is too small, the throughput and performance of the network will decrease; if the size of the TCP send buffer is too large, the delay time of the network will increase. Therefore, it needs to be adjusted according to the actual situation to achieve optimal performance and reliability.

(2) Adjust the Receive.Buffer.Bytes parameter of the client Connect Consumer from the default 64KB to use the system's Socket Receive Buffer. Adjusted the maximum fetch size of the client Max.Partition.Fetch.Bytes partition to 5MB.

After the adjustment, we quickly coordinated with the customer to restart the cluster to verify the adjustment. After the adjustment, the effect is obvious: the average speed of a single connection has increased from 300KB/s to more than 2MB/s:

It can be seen that after increasing Kafka's Socket receiving and sending parameters, the effect is really obvious, and the synchronization rate has increased. Just when we thought the delay problem was solved, the problem appeared again!

The third stage of analysis: digging deep into the root cause

The Kafka parameter adjustment in the second stage above was applied to the customer cluster. After one day of observation, the customer reported that the overall delay of the cluster has improved, but the delay of some partitions is still very large. We also observed that the synchronization rate of about half of the partitions is still very low.

(Note: The Bytes-Consumed-Rate indicator in the above figure represents the number of bytes consumed per second)

(1) Why are some connection speeds still very low?

We first determine the ConsumerGroupID corresponding to the Partition with a low consumption rate through the operation background, and use this ConsumerGroupID to capture packets to locate the corresponding slow TCP connection.

After locating the connection, perform packet capture analysis:

It can be seen from the above that after the server sends a piece of data, it will pause for a period of time, and then continue to send about one RTT. Count the total size of the data packets between each sending interval, about 64KB. This basically means that the sending window of TCP is limited to 64KB. However, there is no such limitation found by capturing other connections with normal speed. Generally speaking, the actual size of the TCP sending window is related to Window Scale, which can only be confirmed when the connection is established.

Tips】:TCP Window Scale, TCP window scaling factor. (Reference: How to determine TCP initial window size and scaling option? Which factors affect the determination? - Red Hat Customer Portal )

In the traditional TCP protocol, the maximum size of the TCP window can only reach 64KB, which limits the transmission speed and efficiency of the TCP protocol. To solve this problem, the TCP Window Scale mechanism is introduced into the TCP protocol.

TCP window size = (receiver window size) * (2 ^ value of TCP Window Scale option)

It should be noted that the TCP Window Scale mechanism needs to be negotiated when the TCP three-way handshake connection is established to determine the expansion method and parameters of the TCP window size.

In order to capture the situation of establishing a connection, we tried to restart the consumption task of a single Partition, but found that as long as it is restarted, the consumption speed can be restored, and there will be no bottleneck in the size of the window.

(2) Why is the sending window limited?

In order to reproduce the problem, we simulated and constructed the customer's usage scenario, and carried out the overall scenario reproduction. It is finally confirmed that this problem will only occur when the task is fully restarted. During the task restart process, we performed an overall packet capture on the server side. After locating the normal connection and the abnormal connection, comparing the process of establishing the connection, it is finally confirmed that the Window Scale does not take effect in the slow connection!

Normal connection establishment process:

Slow connection establishment process:

As can be seen from the figure above, in the slow connection, when the server returns the Syn/Ack packet, there is no "WS=2", indicating that the Window Scale option is not enabled, which leads to the sending window of the entire connection being limited to 64KB, and the throughput cannot be increased. When the Client returned the last Ack, it also clearly displayed "no window scaling used".

(3) Why does Window Scale probabilistically not take effect?

At this point, we need to explain why the server will not open Window Scale probabilistically when sending Syn/Ack? Here, the big guy from the computing group gave us a similar case for us to learn: kubernetes - deep review - exception caused by restarting etcd - personal article - SegmentFault Thinking about whether to get a message: It seems that when the SYN cookie is in effect, the other party does not pass the Timestamp option (in fact, according to the principle of the SYN cookie, the return packet sent to the other end will save the option information encoded into the lower 6 bits of the Tsval field), and it will Call Tcp_Clear_Options to clear options such as window magnification factor. From the system log, we can also observe that the SYN Flood is indeed triggered when the task is restarted as a whole.

(4) Why did the server not receive Tsval (TCP Timestamp Value)?

As mentioned above, our data synchronization passed through a VPCGW within the company. We captured packets on the Client and Server respectively, and finally confirmed that the VPCGW swallowed the Tsval sent by the Client. At the same time, I also confirmed with the research and development students of VPCGW that in the NAT environment, not forwarding the Timestamp is the expected behavior, mainly to solve the packet loss problem in special cases, and the tcp_timestamps problem in the NAT environment_centos nat tcp_timestamps_Qingfeng Xulai 918 Blog-CSDN Blog . However, this problem no longer exists in the new kernel, so it can be scheduled to provide the ability to open Timestamp.

root cause location

After analyzing and digging all the way, the root cause of the problem is clear:

Connect Consumer started in batches, triggering a large number of new TCP connections, and a large number of new connections in a short period of time triggered the SYN Cookie protection check logic. However, because the client did not send the Timestamp option, the server cleared the window magnification factor, which eventually caused the maximum connection sending window to be 64KB, which affected the transmission performance in the case of large delays.

4. Our solutions

Once the root cause of the problem is found, the solution becomes clear.

● Avoidance solution: We adjusted the concurrency of Connect Woker initialization, reduced the speed of TCP initialization connection establishment, and ensured that no SYN Cookie will be triggered to ensure the performance of subsequent data synchronization.

● Final solution: Promote the ability of VPCGW to open TCP Timestamp.

V. Summary

On the surface, the problem is the problem of slow cross-regional data synchronization requests, but it is indeed a very low-level network problem after digging all the way.

The occurrence of this problem is relatively rare, because the conditions for this problem to occur are relatively complex, mainly due to the existence of network delay across regions, a large number of TCP connections at the same time, and the consumption of the TCP Timestamp parameter during the routing transmission process of VPCGW, which superimposes to cause this problem.

We need to maintain awe in the face of problems and dig deep to the bottom!

The 8 most in-demand programming languages ​​in 2023: PHP strong, C/C++ demand slow Programmer's Notes CherryTree 1.0.0.0 released CentOS project declared "open to everyone" MySQL 8.1 and MySQL 8.0.34 officially released GPT-4 getting more and more stupid? The accuracy rate dropped from 97.6% to 2.4%. Microsoft: Intensify efforts to use Rust Meta in Windows 11 Zoom in: release the open source large language model Llama 2, which is free for commercial use. The father of C# and TypeScript announced the latest open source project: TypeChat does not want to move bricks, but also wants to fulfill the requirements? Maybe this 5k star GitHub open source project can help - MetaGPT Wireshark's 25th anniversary, the most powerful open source network packet analyzer
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4587289/blog/10089900