A complete troubleshooting record for memory leaks outside the JVM heap

Preface

Record an online JVM out-of-heap memory leak troubleshooting process and ideas, including some "principle analysis of JVM memory allocation" and "common JVM troubleshooting methods and tools sharing" , I hope to help everyone.

In the whole investigation process, I also took a lot of detours, but in the article I will still write out the complete thoughts and ideas, as a lesson for future generations, the article also summarizes the memory leak problem quickly Several principles of investigation.

"The main content of this article:"

  • Fault description and troubleshooting process

  • Analysis of failure causes and solutions

  • JVM heap memory and off-heap memory allocation principle

  • Introduction and use of commonly used process memory leak troubleshooting instructions and tools


Fault description

During the lunch break on August 12th, our commercial service received an alarm that the physical memory (16G) occupied by the service process exceeded the 80% threshold and was still rising.

The monitoring system calls up the chart to view:

For example, there is a memory leak in the Java process, and our heap memory limit is 4G, this kind of memory greater than 4G is almost full of memory should be a memory leak outside the JVM heap.

Confirm the startup configuration of the service process at that time:

-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80

Although no new code was launched that day , "We are using the message queue to push the repair script of historical data that morning. This task will call a large number of one of our service interfaces" , so it is initially suspected to be related to this interface.

The following figure shows the change in traffic of the calling interface on the day:

It can be seen that the number of calls at the time of the incident has increased a lot (5000+ times per minute) compared to the normal situation (200+ times per minute).

"We temporarily stopped the script from sending messages. The number of calls to this interface dropped to 200+ times per minute, and the container memory no longer increased at an extremely high slope. Everything seemed to be back to normal."

Next, check whether this interface has a memory leak.


Investigation process

First of all, let's review the memory allocation of the Java process to facilitate our explanation of the troubleshooting ideas below.

"Take the JDK1.8 version we use online as an example . " There are many summaries on JVM memory allocation on the Internet, so I won't do second creation.

The JVM memory area is divided into two blocks: heap area and non-heap area.

  • Heap area: It is the new generation and the old age as we know it.

  • Non-heap area: As shown in the figure, the non-heap area has a metadata area and direct memory.

"An additional note here is that the permanent generation (which is native to JDK8) stores the classes used by the JVM during runtime, and the objects of the permanent generation are garbage collected during full GC."

After reviewing the JVM memory allocation, let us return to the fault.

Heap memory analysis

Although it is basically confirmed at the beginning that it has nothing to do with the heap memory, because the leaked memory usage exceeds the heap memory limit of 4G, but let's look at the heap memory for clues.

We observed the memory occupancy curve of the new generation and the old generation and the statistics of the number of recovery times. There was no major problem as usual. We then dumped a log of the JVM heap memory on the container at the accident site.

Heap memory dump

Heap memory snapshot dump command:

jmap -dump:live,format=b,file=xxxx.hprof pid

❝Voiceover: You can also use jmap -histo:live pid to directly view the live objects in the heap memory.

After exporting, download the Dump file back to the local, and then use the MAT (Memory Analyzer) of Eclipse or the JVisualVM that comes with the JDK to open the log file.

Use MAT to open the file as shown:

"You can see that there are some nio-related large objects in the heap memory, such as the nioChannel that is receiving message queue messages, and nio.HeapByteBuffer, but the number is not large and cannot be used as a basis for judgment. Let's observe it first."

Next, I started to browse the interface code. The main logic inside the interface is to call the WCS client of the group and write the data in the database table to the WCS after looking up the table. There is no other additional logic.

After finding that there is no special logic, I began to doubt whether there is a memory leak in the WCS client package. The reason for this suspicion is that the bottom layer of the WCS client is encapsulated by the SCF client. As an RPC framework, the underlying communication transmission protocol may apply Direct memory.

"Did my code set off a bug in the WCS client, causing it to continuously apply for direct memory calls and eventually eat up the memory."

I contacted the WCS duty officer and described the problem we encountered with them. They responded to us and will perform a pressure test of the write operation locally to see if our problem can be reproduced.

Since it takes time to wait for their feedback, we are ready to figure out the reasons for ourselves.

"I focused my suspicion on direct memory. I suspected that the amount of interface calls was too large and the client's improper use of nio resulted in the use of ByteBuffer to request too much direct memory."

"Voiceover: The final result proved that this preconceived idea led to a detour in the investigation process. In the investigation process of the problem, it is possible to narrow the investigation scope with reasonable guesses, but it is best to first consider every possibility Make it clear that when you find yourself going deep into a certain possibility to no avail, you should look back and examine other possibilities in time.”

Sandbox environment reproduction

In order to restore the failure scenario at that time, I applied for a stress testing machine in the sandbox environment to ensure consistency with the online environment.

"First, let's simulate the memory overflow situation (a large number of calls to the interface):"

We let the script continue to push data, call our interface, and we continue to observe the memory usage.

When the call is started, the memory continues to grow, and it does not seem to be restricted (the Full GC is not triggered due to restrictions).

"Next, let's simulate the normal call volume (normal call interface):"

We cut the normal call volume of the interface (smaller and one batch call every 10 minutes) to the stress testing machine, and got the old generation memory and physical memory trends like the following figure:

"The question is: why does the memory keep going up and eat up the memory?"

At the time, it was speculated that because the JVM process did not limit the direct memory size (-XX:MaxDirectMemorySize), the off-heap memory continued to rise and would not trigger the FullGC operation.

"The picture above can draw two conclusions:"

  • When the amount of interface calls for memory leaks is large, if other conditions such as the old generation in the heap have not met the FullGC condition, it will not be FullGC, and the direct memory will rise all the way.

  • In the case of usually low call volume, memory leaks are slower, FullGC will always come, and the leaked part will be recovered. This is also the reason why there is no problem at ordinary times and it has been running normally for a long time.

"As mentioned above, there is no direct memory limit in the startup parameters of our process, so we added the -XX:MaxDirectMemorySize configuration and tested it again in the sandbox environment."

It turns out that the physical memory occupied by the process will continue to rise, exceeding the limit we set, and the "looks" configuration does not seem to work.

This surprised me. Is there a problem with the memory limit of the JVM?

"When I get here, I can see that my thinking is obsessed with direct memory leaks during the investigation process, and it is gone forever."

"Voiceover: We should trust the JVM's mastery of memory. If we find that the parameters are invalid, we should find the reason from ourselves and see if we use the parameters wrong."

Direct memory analysis

In order to further investigate what is in the direct memory, I started to work on the direct memory. Since direct memory is not like heap memory, it is easy to see all occupied objects. We need some commands to troubleshoot direct memory. I have used several methods to see what problems are in direct memory.

View process memory information pmap

pmap-report memory map of a process (view process memory map information)

The pmap command is used to report the memory mapping relationship of the process and is a good tool for Linux debugging and operation and maintenance.

pmap -x pid 如果需要排序  | sort -n -k3**

After the execution, I got the following output, and the output is as follows:

..
00007fa2d4000000    8660    8660    8660 rw---   [ anon ]
00007fa65f12a000    8664    8664    8664 rw---   [ anon ]
00007fa610000000    9840    9832    9832 rw---   [ anon ]
00007fa5f75ff000   10244   10244   10244 rw---   [ anon ]
00007fa6005fe000   59400   10276   10276 rw---   [ anon ]
00007fa3f8000000   10468   10468   10468 rw---   [ anon ]
00007fa60c000000   10480   10480   10480 rw---   [ anon ]
00007fa614000000   10724   10696   10696 rw---   [ anon ]
00007fa6e1c59000   13048   11228       0 r-x-- libjvm.so
00007fa604000000   12140   12016   12016 rw---   [ anon ]
00007fa654000000   13316   13096   13096 rw---   [ anon ]
00007fa618000000   16888   16748   16748 rw---   [ anon ]
00007fa624000000   37504   18756   18756 rw---   [ anon ]
00007fa62c000000   53220   22368   22368 rw---   [ anon ]
00007fa630000000   25128   23648   23648 rw---   [ anon ]
00007fa63c000000   28044   24300   24300 rw---   [ anon ]
00007fa61c000000   42376   27348   27348 rw---   [ anon ]
00007fa628000000   29692   27388   27388 rw---   [ anon ]
00007fa640000000   28016   28016   28016 rw---   [ anon ]
00007fa620000000   28228   28216   28216 rw---   [ anon ]
00007fa634000000   36096   30024   30024 rw---   [ anon ]
00007fa638000000   65516   40128   40128 rw---   [ anon ]
00007fa478000000   46280   46240   46240 rw---   [ anon ]
0000000000f7e000   47980   47856   47856 rw---   [ anon ]
00007fa67ccf0000   52288   51264   51264 rw---   [ anon ]
00007fa6dc000000   65512   63264   63264 rw---   [ anon ]
00007fa6cd000000   71296   68916   68916 rwx--   [ anon ]
00000006c0000000 4359360 2735484 2735484 rw---   [ anon ]

It can be seen that the bottom line is the mapping of the heap memory, which occupies 4G, and the other has a very small memory footprint, but we still can't see the problem through this information.

NativeMemoryTracking

❝Native Memory Tracking (NMT) is a function used by Hotspot VM to analyze VM internal memory usage. We can use jcmd (built in jdk) to access NMT data.

NMT must first be turned on through the VM startup parameters, but it should be noted that turning on NMT will bring about 5%-10% performance loss.

-XX:NativeMemoryTracking=[off | summary | detail]
# off: 默认关闭
# summary: 只统计各个分类的内存使用情况.
# detail: Collect memory usage by individual call sites.

Then run the process, you can use the following command to view the direct memory:

jcmd <pid> VM.native_memory [summary | detail | baseline | summary.diff | detail.diff | shutdown] [scale= KB | MB | GB]
 
# summary: 分类内存使用情况.
# detail: 详细内存使用情况,除了summary信息之外还包含了虚拟内存使用情况。
# baseline: 创建内存使用快照,方便和后面做对比
# summary.diff: 和上一次baseline的summary对比
# detail.diff: 和上一次baseline的detail对比
# shutdown: 关闭NMT

We use:

jcmd pid VM.native_memory detail scale=MB > temp.txt

Get the result as shown:

None of the information given to us in the above picture can clearly see the problem, at least I still could not see the problem through these several information.

The investigation seems to have reached a deadlock.

There is no way out of doubt

When the investigation stalled, we got a reply from WCS and SCF, "Both parties confirmed that there is no memory leak in their package." WCS did not use direct memory, and SCF was used as the underlying RPC protocol, but It will not leave such an obvious memory bug, otherwise there should be a lot of feedback online.

View JVM memory information jmap

At this point, if I couldn't find the problem, I opened a new sandbox container again, ran the service process, and then ran the jmap command to take a look at the "actual configuration" of the JVM memory :

jmap -heap pid

got the answer:

Attaching to process ID 1474, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.66-b17

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 4294967296 (4096.0MB)
   NewSize                  = 2147483648 (2048.0MB)
   MaxNewSize               = 2147483648 (2048.0MB)
   OldSize                  = 2147483648 (2048.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 1932787712 (1843.25MB)
   used     = 1698208480 (1619.5378112792969MB)
   free     = 234579232 (223.71218872070312MB)
   87.86316621615607% used
Eden Space:
   capacity = 1718091776 (1638.5MB)
   used     = 1690833680 (1612.504653930664MB)
   free     = 27258096 (25.995346069335938MB)
   98.41346682518548% used
From Space:
   capacity = 214695936 (204.75MB)
   used     = 7374800 (7.0331573486328125MB)
   free     = 207321136 (197.7168426513672MB)
   3.4349974840697497% used
To Space:
   capacity = 214695936 (204.75MB)
   used     = 0 (0.0MB)
   free     = 214695936 (204.75MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 2147483648 (2048.0MB)
   used     = 322602776 (307.6579818725586MB)
   free     = 1824880872 (1740.3420181274414MB)
   15.022362396121025% used

29425 interned Strings occupying 3202824 bytes

In the output information, it can be seen that the old and new generations are quite normal, the meta space only occupies 20M, and the direct memory seems to be 2g...

Ok? Why MaxMetaspaceSize = 17592186044415 MB? "Looks and no restrictions as」 .

Take a closer look at our startup parameters:

-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80

What is configured is -XX:PermSize=256m -XX:MaxPermSize=512mthe memory space of the permanent generation. "And after 1.8, Hotspot virtual machine has been removed permanently generations, using the yuan instead of space." Since we are using the online JDK1.8, "so we have the maximum capacity of the yuan space restrictions simply do not do" , -XX:PermSize=256m -XX:MaxPermSize=512mwhich The two parameters are expired parameters for 1.8.

The following figure describes the change from 1.7 to 1.8, the permanent generation:

"Could it be that the metaspace memory was leaked?"

I chose to test locally, which is convenient to change the parameters, and it is also convenient to use the JVisualVM tool to visually see the memory changes.

Use JVisualVM to observe the process running

First limit the meta space, use parameters -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=128m, and then call the problematic interface locally.

Get as shown:

"It can be seen that when the meta space is exhausted, the system starts Full GC, the meta space memory is reclaimed, and many classes are unloaded."

Then we remove the meta space restriction, that is, use the parameters that had problems before:

-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80 -XX:MaxDirectMemorySize=2g -XX:+UnlockDiagnosticVMOptions

Get as shown:

"It can be seen that the meta space is rising, and the loaded classes are also rising with the increase in the number of calls, showing a positive correlation trend."

Liu An Hua Ming You Yi Village

Clear up problems at once, "With each call interface, most likely a class are constantly being created, taking up memory space yuan」 .

Observe the JVM class loading situation-verbose

❝When debugging a program, sometimes it is necessary to check the classes loaded by the program, the memory recovery situation, the local interface called, etc. At this time, the -verbose command is needed. In myeclipse, you can right-click to set (as follows), or you can enter java-on the command line. verbose to see.

-verbose:class 查看类加载情况
-verbose:gc 查看虚拟机中内存回收情况
-verbose:jni 查看本地方法调用的情况

In the local environment, we add startup parameters to -verbose:classcall the interface cyclically.

You can see that countless are generated com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto:

[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]

When calling it many times and accumulating certain classes, we manually execute Full GC to recycle the class loader. We found that a large number of fastjson related classes were recycled.

"If you use jmap to check the class loading before recycling, you can also find a large number of fastjson related classes:"

jmap -clstats 7984

Now I have a direction, "check the code carefully this time" , check where fastjson is used in the code logic, and found the following code:

/**
 * 返回Json字符串.驼峰转_
 * @param bean 实体类.
 */
public static String buildData(Object bean) {
    try {
        SerializeConfig CONFIG = new SerializeConfig();
        CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase;
        return jsonString = JSON.toJSONString(bean, CONFIG);
    } catch (Exception e) {
        return null;
    }
}

Root cause of the problem

We serialized the entity class of the camel case field into an underscore field before calling wcs. This requires the use of Fastjson's SerializeConfig, and we instantiated it in a static method. When SerializeConfig is created, an ASM proxy class is created by default to realize the serialization of the target object. That is, the classes com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDtothat are frequently created above . If we reuse SerializeConfig, fastjson will look for the created proxy classes and reuse them. But if new SerializeConfig() cannot find the originally generated proxy class, it will continue to generate a new WlkCustomerDto proxy class.

The source code of the problem location in the following two pictures:

We used SerializeConfig as a static variable of the class, and the problem was solved.

private static final SerializeConfig CONFIG = new SerializeConfig();

static {
    CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase;
}

What does fastjson SerializeConfig do

Introduction to SerializeConfig:

❝The main function of SerializeConfig is to configure and record the serialization class (implementation class of ObjectSerializer interface) corresponding to each Java type. For example, Boolean.class uses BooleanCodec (see the name to know that this class will implement serialization and deserialization together) ) As the serialization implementation class, float[].class uses FloatArraySerializer as the serialization implementation class. Some of these serialization implementation classes are implemented by default in FastJSON (such as Java basic classes), and some are generated through the ASM framework (such as user self Definition class), and some even user-defined serialization classes (for example, the default implementation of the Date type framework is converted to milliseconds, and the application needs to be converted to seconds). Of course, this involves using ASM to generate serialized classes or using JavaBeans The problem of serialization of the serialization class, the judgment here is based on whether the Android environment (the environment variable "java.vm.name" is "dalvik" or "lemur" is the Android environment), but the judgment is not only here, but also the follow-up More specific judgments.

Theoretically speaking, if each SerializeConfig instance serializes the same class, it will find the proxy class of that class generated before for serialization. Our service instantiates a ParseConfig object to configure the settings of Fastjson deserialization every time the interface is called. If the ASM proxy is not disabled, since each call to ParseConfig is a new instance, it is always checked The created proxy class is not available, so Fastjson constantly creates new proxy classes and loads them into the metaspace, which eventually leads to the continuous expansion of the metaspace and exhausts the memory of the machine.

Problems will only occur after upgrading JDK1.8

The cause of the problem is still worth paying attention to. Why does this problem not occur before the upgrade? This is to analyze the difference between the hotspot virtual machine that comes with jdk1.8 and 1.7.

❝Starting from jdk1.8, the hostspot virtual machine that comes with it cancels the permanent area in the past and adds a metaspace area. From a functional point of view, metaspace can be considered similar to the permanent area, and its main function is to store metadata , But the actual mechanism is quite different.

First of all, the default maximum value of metaspace is the physical memory size of the entire machine, so the continuous expansion of metaspace will cause java programs to invade the available memory of the system, and eventually the system will have no available memory; while the permanent area has a fixed default size and will not expand to the entire The available memory of the machine. When the allocated memory is exhausted, both will trigger full gc, but the difference is that when the permanent area is in full gc, the class metadata (Class object) in the permanent area is reclaimed with a similar mechanism when the heap memory is reclaimed. Objects that cannot be reached by the root reference can be recycled. Metaspace determines whether the class metadata can be recycled. It is judged based on whether the Classloader that loads the class metadata can be recycled. As long as the Classloader cannot be recycled, the class metadata loaded through it Will not be recycled. This also explains why our two services have problems after upgrading to 1.8, because in the previous jdk version, although many proxy classes were created every time fastjson was called, many proxy classes were loaded in the permanent area. Class instances, but these Class instances are created when the method is called, and are unreachable after the call is completed. Therefore, when the permanent area memory is full and the full gc is triggered, they will be recycled.

When using 1.8, because these proxy classes are loaded through the Classloader of the main thread, this Classloader will never be recycled during the program running, so the proxy classes loaded through it will never be recycled. This led to the continuous expansion of metaspace, which eventually exhausted the machine's memory.

This problem is not limited to fastjson, as long as it is necessary to load and create classes through the program, this problem may occur. "Especially in the framework, a large number of tools like ASM, javassist, etc. are often used for bytecode enhancement. According to the above analysis, before jdk1.8, because in most cases dynamically loaded Class can be obtained in full gc Recycling, so it is not prone to problems" , and therefore many frameworks and toolkits do not deal with this problem. Once upgraded to 1.8, these problems may be exposed.

 

to sum up

The problem was solved. Next, I reviewed the entire troubleshooting process. The entire process revealed many problems for me. The most important thing is " I am not familiar with the memory allocation of different versions of JVM" , which led to errors in the judgment of the old generation and metaspace. , I took a lot of detours, and spent a lot of time investigating in direct memory.

Secondly, the investigation needs "one is careful, and the other is comprehensive." It is best to sort out all the possibilities first, otherwise it is easy to fall into the scope of the investigation set by yourself and enter a dead end.

Finally, sum up the gains from this problem:

  • Starting from JDK1.8, the hostspot virtual machine that comes with it cancels the past permanent area, and adds a metaspace area. From a functional point of view, metaspace can be considered similar to the permanent area. Its main function is to store metadata, but The actual mechanism is quite different.

  • The memory in the JVM needs to be restricted at startup, including the familiar heap memory, but also direct memory and meta-generation area. This is the last bag to ensure the normal operation of online services.

  • When using class libraries, please pay more attention to the way the code is written and try not to show obvious memory leaks.

  • For libraries that use bytecode enhancement tools such as ASM, please be careful when using them (especially after JDK1.8).

reference

Observe the process of class loading when the program is running

blog.csdn.net/tenderhearted/article/details/39642275

Metaspace overall introduction (the reason why the permanent generation is replaced, the characteristics of the metaspace, the analysis method of the metaspace memory)

https://www.cnblogs.com/duanxz/p/3520829.html

Common troubleshooting procedures for java memory usage exceptions (including off-heap memory exceptions)

https://my.oschina.net/haitaohu/blog/3024843

Full interpretation of off-heap memory of JVM source code analysis

http://lovestblog.cn/blog/2015/05/12/direct-buffer/

Uninstallation of JVM classes

https://www.cnblogs.com/caoxb/p/12735525.html

fastjson opens asm on jdk1.8

https://github.com/alibaba/fastjson/issues/385

fastjson:PropertyNamingStrategy_cn

https://github.com/alibaba/fastjson/wiki/PropertyNamingStrategy_cn

Be wary of Metaspace memory leaks caused by dynamic agents

https://blog.csdn.net/xyghehehehe/article/details/78820135


There is no way, but the technique can be achieved; if there is no way, it ends with the technique

Welcome everyone to follow the Java Way public account

Good article, I am reading ❤️

Guess you like

Origin blog.csdn.net/hollis_chuang/article/details/108525633