RocketMQ cluster node flash failure caused by

1. Description of the problem

Received a warning alarm platform, Rockemq abnormal clusters

RocketMQ node cpu soar,

image.png

100% packet loss

image.png

IO has skyrocketed

image.png

Memory Interrupt

image.png


2. Check broker log

2.1 is not unusual to see GC logs

$cd /dev/shm/

2020-02-23T23:02:11.864+0800: 11683510.551: Total time for which application threads were stopped: 0.0090361 seconds, Stopping threads took: 0.0002291 seconds

2020-02-23T23:03:36.815+0800: 11683595.502: [GC pause (G1 Evacuation Pause) (young) 11683595.502: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 7992, predicted base time: 5.44 ms, remaining time: 194.56 ms, target pause time: 200.00 ms]

 11683595.502: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 255 regions, survivors: 1 regions, predicted young region time: 1.24 ms]

 11683595.502: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 255 regions, survivors: 1 regions, old: 0 regions, predicted pause time: 6.68 ms, target pause time: 200.00 ms]

, 0.0080468 secs]

   [Parallel Time: 4.1 ms, GC Workers: 23]

      [GC Worker Start (ms): Min: 11683595502.0, Avg: 11683595502.3, Max: 11683595502.5, Diff: 0.5]

      [Ext Root Scanning (ms): Min: 0.6, Avg: 0.9, Max: 1.5, Diff: 0.9, Sum: 20.5]

      [Update RS (ms): Min: 0.0, Avg: 1.0, Max: 2.5, Diff: 2.5, Sum: 23.7]

         [Processed Buffers: Min: 0, Avg: 11.8, Max: 35, Diff: 35, Sum: 271]

      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.7]

      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]

      [Object Copy (ms): Min: 0.0, Avg: 1.1, Max: 1.8, Diff: 1.8, Sum: 26.4]

      [Termination (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.3, Sum: 5.4]

         [Termination Attempts: Min: 1, Avg: 3.5, Max: 7, Diff: 6, Sum: 80]

      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.0]

      [GC Worker Total (ms): Min: 3.2, Avg: 3.4, Max: 3.6, Diff: 0.5, Sum: 77.9]

      [GC Worker End (ms): Min: 11683595505.6, Avg: 11683595505.6, Max: 11683595505.7, Diff: 0.1]

   [Code Root Fixup: 0.1 ms]

   [Code Root Purge: 0.0 ms]

   [Clear CT: 0.8 ms]

   [Other: 3.0 ms]

      [Choose CSet: 0.0 ms]

      [Ref Proc: 1.1 ms]

      [Ref Enq: 0.0 ms]

      [Redirty Cards: 1.0 ms]

      [Humongous Register: 0.0 ms]

      [Humongous Reclaim: 0.0 ms]

      [Free chat: 0.3 ms]

   [Eden: 4080.0M(4080.0M)->0.0B(4080.0M) Survivors: 16.0M->16.0M Heap: 4341.0M(16.0G)->262.4M(16.0G)]

 [Times: user=0.07 sys=0.00, real=0.01 secs]


2.2 Check broker log

2020-02-23T23:02:11.864 ERROR BrokerControllerScheduledThread1 - SyncTopicConfig Exception, x.x.x.x:10911 

org.apache.rocketmq.remoting.exception.RemotingTimeoutException: wait response on the channel <x.x.x.x:10909> timeout, 3000(ms)

        at org.apache.rocketmq.remoting.netty.NettyRemotingAbstract.invokeSyncImpl(NettyRemotingAbstract.java:427) ~[rocketmq-remoting-4.5.2.jar:4.5.2]

        at org.apache.rocketmq.remoting.netty.NettyRemotingClient.invokeSync(NettyRemotingClient.java:375) ~[rocketmq-remoting-4.5.2.jar:4.5.2]

By looking at RocketMQ cluster and GC logs, network timeouts can be seen, resulting in a master-slave synchronization issues; Broker itself did not find a problem

By monitoring network to see, CPU, disk IO are a problem; in the end is caused by disk IO soared high CPU? Cause network problems; or the first CPU skyrocketed causing network outages and disk IO. The machine is only one RocketMQ process, and the load is not high; it is not a process led by the application CPU, network, disk IO and so on. Ali cloud that will not shake it? There may be a cloud network jitter caused Ali, Ali cloud network if it is pulled, why only a few nodes in a cluster dithering, the same room did not shake machine business appear. Then log analysis linux


3, view the system log 

#grep -ci "page allocation failure" /var/log/messages*

By viewing the system log and found page allocation failure "page allocation failure order:. 0, mode: 0x20", it is not enough Page.

image.png

Find the problem by querying data

https://access.redhat.com/solutions/90883

设置这俩参数设置一下,Increase vm.min_free_kbytes value, for example to a higher value than a single allocation request.
Change vm.zone_reclaim_mode to 1 if it's set to zero, so the system can reclaim back memory from cached memory.


3.1 修改配置文件

$ sed -i '/swappiness=1/a\vm.zone_reclaim_mode = 1\nvm.min_free_kbytes = 512000'  /etc/sysctl.conf && sysctl -p /etc/sysctl.conf

zone_reclaim_mode默认为0即不启用zone_reclaim模式,1为打开zone_reclaim模式从本地节点回收内存;

min_free_kbytesy允许内核使用的最小内存


简单的说,就是Rocketmq是吃内存大户,如果没有开启内核限制,Rocketmq不断的向系统索要内存,系统将内存耗尽,当内存耗尽后,系统无法响应请求,导致网络丢包,cpu飙高。

broker节点的操作系统版本为Centos6.10,可能将系统升级到7版本以上不存在这种问题


4,分析原因

When the system is less than the free memory watermark [low], kernel thread kswapd started for memory recovery, the recovery is stopped until the number of free memory zone reaches watermark [high] after. If the upper layer application memory speed is too fast, leading to free memory falls watermark [min], the kernel will be direct reclaim (direct recycling), that is directly recovered in the context of the process of the application, and then recovered up free pages meet application memory, so the actual block the application, bring some response delay, and may trigger the system OOM. This is because the watermark [min] or less memory retention memory belonging to the system, to meet the specific use, so it will not apply to ordinary users to use state.

The size of the impact min_free_kbytes 

min_free_kbytes set larger, the higher the watermark lines, while the amount of buffer between the three lines will be correspondingly increased. This will mean an earlier start kswapd for recycling, and recovery will come more memory (up to watermark [high] will stop), which would make the system too much free memory reserve, thus reducing to some extent, application amount of memory that can be used. When you set an extreme case min_free_kbytes close to the memory size, memory will be left to the application of too little may cause OOM occur frequently.

min_free_kbytes set is too small, it will cause the system to reserve memory is too small. Kswapd in the process of recovery will be a small amount of memory allocation behavior (will be located on PF_MEMALLOC) signs, the signs will be allowed to use the reserved memory kswapd; Another is selected OOM kill process in the exit process, if necessary You can also use application memory reserved section. In both cases so that they can avoid the use of reserved memory system enters the deadlock state.

Note: The above analysis from the network

















Guess you like

Origin blog.51cto.com/536410/2482248