Diagnose frequent OOM problems of virtual machines

 Huo Mingming  360 Cloud Computing 

Heroine declaration

The author of this article, Huo Mingming, is responsible for the technical evangelism and solution promotion of 360 HULK cloud platform virtualization and containerization services. The OOM Killer that this article mainly explores  is a kernel function. When the host's memory is insufficient, a series of heuristics will be used to choose to kill a process. This article was first published on opsdev, and the reprint has been authorized by the author.

PS: Rich first-line technology and diversified forms of expression are all in the " HULK first-line technology talk ", please pay attention!

Preface

Virtual machine being OOM should be a problem frequently encountered by operators of Iaas platform. No, some time ago we encountered a situation where the virtual machines of certain businesses were frequently OOM. Let's take a look at the reasons.


Scene description:

  • Iaas management platform: OpenStack

  • Computing node: CentOS7.2, QEMU, KVM, 128GB memory

1

identify the problem

The phenomenon is that the business virtual machine is not man-made downtime, and it will happen after running for a period of time. After checking the operation history and audit records to confirm that it was not a human operation, the system log of the computing node found that it was caused by insufficient system memory to trigger OOM.


The reason is found, but found to be weird, why?


First of all, the computing node where these virtual machines are located does not turn on memory oversold;


Secondly, we have reserved 12GB of memory for the computing node OS ( 12GB / 128GB =  9.375% ). That is to say, if the memory used by the virtual machine is dead , the total memory usage of all virtual machines will not exceed 100% of the total memory-  9.375%90.625% . According to this theoretical value, unless the memory usage of the OS is very large, otherwise There should be no OOM situation.


image

2

Troubleshooting

From experience, the service memory running on the computing node OS will not eat up to 12GB, unless some services have memory leaks. When in doubt, we will be restarted by the OOM virtual machine and observe the memory usage on the host machine.


After running for a period of time, the virtual machine is still dropped by OOM, but there is no memory leak in the services on the OS, and the total memory usage is also normal, about 4GB. At this time, the maximum theoretical memory usage is about  (128-12 + 4) / 128 =  93.75% . At this time, the system should not trigger OOM, which is not considered swap (4GB). Here, we exclude the OS memory of the computing node OOM caused by usage problems.image.png


Now that the OS memory usage is okay, then look at the virtual machine memory usage from another angle if there is a problem?


Through statistics on the memory allocated by the virtual machine process (qemu-kvm) before and after OOM was triggered on the computing node, a "big" problem was found.image.png


As shown in the figure above, in the RES column, the amount of memory used is far about the amount of memory allocated to the virtual machine. The actual memory usage of the virtual machine with 4 cores and 8GB is basically between 8.3 and 8.9GB, and the actual memory usage of the virtual machine with 2 cores and 4GB is basically between 4.6 and 4.8GB. At this point, we know who used the extra memory.


Why is the memory usage of the virtual machine larger than the allocated value? Looking inside the virtual machine, although its memory usage is very full, it does not exceed the allocated value.image.png


With questions, Google has some information, and others have similar doubts.

https://lime-technology.com/forums/topic/48093-kvm-memory-leakingoverhead/


The article means that in addition to the internal memory used by the virtual machine, the qemu-kvm process also needs to provide memory for the virtual device. This part of the memory is also counted on the head of the virtual machine process qemu-kvm.


We have located the problem. How to solve this problem and reduce the occurrence of OOM of virtual machines?


3

solution

  1. Increase the OS reserved memory space. By increasing the OS reserved memory space to fill part of the virtual machine's expanded memory, the overall memory usage will not exceed the OOM critical value.

  2.  Increase the swap value. At present, the swap value of our computing nodes is unified to 4GB, which is a bit small for a node with 128GB of memory. We found that when the virtual machine is OOM, the swap utilization rate is definitely 100%, which is also in line with the prerequisites for OOM. Therefore, if you have SSD disks on your nodes, it is recommended to increase swap appropriately.

  3.  Modify the OpenStack logic so that when the virtual machine schedules the memory calculation, it is larger than the package value, and the expanded memory is reserved for the virtual machine. However, this method is not universal and is not recommended.


references:


1.https://lime-technology.com/forums/topic/48093-kvm-memory-leakingoverhead/


2.https://unix.stackexchange.com/questions/140322/kvm-killed-by-oomkiller


Guess you like

Origin blog.51cto.com/15127564/2667720