Preface
Record an online JVM out-of-heap memory leak troubleshooting process and ideas, including some "principle analysis of JVM memory allocation" and "common JVM troubleshooting methods and tools sharing" , I hope to help everyone.
In the whole investigation process, I also took a lot of detours, but in the article I will still write out the complete thoughts and ideas, as a lesson for future generations, the article also summarizes the memory leak problem quickly Several principles of investigation.
"The main content of this article:"
Fault description and troubleshooting process
Analysis of failure causes and solutions
JVM heap memory and off-heap memory allocation principle
Introduction and use of commonly used process memory leak troubleshooting instructions and tools
Fault description
During the lunch break on August 12th, our commercial service received an alarm that the physical memory (16G) occupied by the service process exceeded the 80% threshold and was still rising.
The monitoring system calls up the chart to view:
For example, there is a memory leak in the Java process, and our heap memory limit is 4G, this kind of memory greater than 4G is almost full of memory should be a memory leak outside the JVM heap.
Confirm the startup configuration of the service process at that time:
-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80
Although no new code was launched that day , "We are using the message queue to push the repair script of historical data that morning. This task will call a large number of one of our service interfaces" , so it is initially suspected to be related to this interface.
The following figure shows the change in traffic of the calling interface on the day:
It can be seen that the number of calls at the time of the incident has increased a lot (5000+ times per minute) compared to the normal situation (200+ times per minute).
"We temporarily stopped the script from sending messages. The number of calls to this interface dropped to 200+ times per minute, and the container memory no longer increased at an extremely high slope. Everything seemed to be back to normal."
Next, check whether this interface has a memory leak.
Investigation process
First of all, let's review the memory allocation of the Java process to facilitate our explanation of the troubleshooting ideas below.
"Take the JDK1.8 version we use online as an example . " There are many summaries on JVM memory allocation on the Internet, so I won't do second creation.
The JVM memory area is divided into two blocks: heap area and non-heap area.
Heap area: It is the new generation and the old age as we know it.
Non-heap area: As shown in the figure, the non-heap area has a metadata area and direct memory.
"An additional note here is that the permanent generation (which is native to JDK8) stores the classes used by the JVM during runtime, and the objects of the permanent generation are garbage collected during full GC."
After reviewing the JVM memory allocation, let us return to the fault.
Heap memory analysis
Although it is basically confirmed at the beginning that it has nothing to do with the heap memory, because the leaked memory usage exceeds the heap memory limit of 4G, but let's look at the heap memory for clues.
We observed the memory occupancy curve of the new generation and the old generation and the statistics of the number of recovery times. There was no major problem as usual. We then dumped a log of the JVM heap memory on the container at the accident site.
Heap memory dump
Heap memory snapshot dump command:
jmap -dump:live,format=b,file=xxxx.hprof pid
❝Voiceover: You can also use jmap -histo:live pid to directly view the live objects in the heap memory.
After exporting, download the Dump file back to the local, and then use the MAT (Memory Analyzer) of Eclipse or the JVisualVM that comes with the JDK to open the log file.
Use MAT to open the file as shown:
"You can see that there are some nio-related large objects in the heap memory, such as the nioChannel that is receiving message queue messages, and nio.HeapByteBuffer, but the number is not large and cannot be used as a basis for judgment. Let's observe it first."
Next, I started to browse the interface code. The main logic inside the interface is to call the WCS client of the group and write the data in the database table to the WCS after looking up the table. There is no other additional logic.
After finding that there is no special logic, I began to doubt whether there is a memory leak in the WCS client package. The reason for this suspicion is that the bottom layer of the WCS client is encapsulated by the SCF client. As an RPC framework, the underlying communication transmission protocol may apply Direct memory.
"Did my code set off a bug in the WCS client, causing it to continuously apply for direct memory calls and eventually eat up the memory."
I contacted the WCS duty officer and described the problem we encountered with them. They responded to us and will perform a pressure test of the write operation locally to see if our problem can be reproduced.
Since it takes time to wait for their feedback, we are ready to figure out the reasons for ourselves.
"I focused my suspicion on direct memory. I suspected that the amount of interface calls was too large and the client's improper use of nio resulted in the use of ByteBuffer to request too much direct memory."
❝ "Voiceover: The final result proved that this preconceived idea led to a detour in the investigation process. In the investigation process of the problem, it is possible to narrow the investigation scope with reasonable guesses, but it is best to first consider every possibility Make it clear that when you find yourself going deep into a certain possibility to no avail, you should look back and examine other possibilities in time.”
Sandbox environment reproduction
In order to restore the failure scenario at that time, I applied for a stress testing machine in the sandbox environment to ensure consistency with the online environment.
"First, let's simulate the memory overflow situation (a large number of calls to the interface):"
We let the script continue to push data, call our interface, and we continue to observe the memory usage.
When the call is started, the memory continues to grow, and it does not seem to be restricted (the Full GC is not triggered due to restrictions).
"Next, let's simulate the normal call volume (normal call interface):"
We cut the normal call volume of the interface (smaller and one batch call every 10 minutes) to the stress testing machine, and got the old generation memory and physical memory trends like the following figure:
"The question is: why does the memory keep going up and eat up the memory?"
At the time, it was speculated that because the JVM process did not limit the direct memory size (-XX:MaxDirectMemorySize), the off-heap memory continued to rise and would not trigger the FullGC operation.
"The picture above can draw two conclusions:"
When the amount of interface calls for memory leaks is large, if other conditions such as the old generation in the heap have not met the FullGC condition, it will not be FullGC, and the direct memory will rise all the way.
In the case of usually low call volume, memory leaks are slower, FullGC will always come, and the leaked part will be recovered. This is also the reason why there is no problem at ordinary times and it has been running normally for a long time.
"As mentioned above, there is no direct memory limit in the startup parameters of our process, so we added the -XX:MaxDirectMemorySize configuration and tested it again in the sandbox environment."
It turns out that the physical memory occupied by the process will continue to rise, exceeding the limit we set, and the "looks" configuration does not seem to work.
This surprised me. Is there a problem with the memory limit of the JVM?
"When I get here, I can see that my thinking is obsessed with direct memory leaks during the investigation process, and it is gone forever."
❝ "Voiceover: We should trust the JVM's mastery of memory. If we find that the parameters are invalid, we should find the reason from ourselves and see if we use the parameters wrong."
Direct memory analysis
In order to further investigate what is in the direct memory, I started to work on the direct memory. Since direct memory is not like heap memory, it is easy to see all occupied objects. We need some commands to troubleshoot direct memory. I have used several methods to see what problems are in direct memory.
View process memory information pmap
pmap-report memory map of a process (view process memory map information)
The pmap command is used to report the memory mapping relationship of the process and is a good tool for Linux debugging and operation and maintenance.
pmap -x pid 如果需要排序 | sort -n -k3**
After the execution, I got the following output, and the output is as follows:
..
00007fa2d4000000 8660 8660 8660 rw--- [ anon ]
00007fa65f12a000 8664 8664 8664 rw--- [ anon ]
00007fa610000000 9840 9832 9832 rw--- [ anon ]
00007fa5f75ff000 10244 10244 10244 rw--- [ anon ]
00007fa6005fe000 59400 10276 10276 rw--- [ anon ]
00007fa3f8000000 10468 10468 10468 rw--- [ anon ]
00007fa60c000000 10480 10480 10480 rw--- [ anon ]
00007fa614000000 10724 10696 10696 rw--- [ anon ]
00007fa6e1c59000 13048 11228 0 r-x-- libjvm.so
00007fa604000000 12140 12016 12016 rw--- [ anon ]
00007fa654000000 13316 13096 13096 rw--- [ anon ]
00007fa618000000 16888 16748 16748 rw--- [ anon ]
00007fa624000000 37504 18756 18756 rw--- [ anon ]
00007fa62c000000 53220 22368 22368 rw--- [ anon ]
00007fa630000000 25128 23648 23648 rw--- [ anon ]
00007fa63c000000 28044 24300 24300 rw--- [ anon ]
00007fa61c000000 42376 27348 27348 rw--- [ anon ]
00007fa628000000 29692 27388 27388 rw--- [ anon ]
00007fa640000000 28016 28016 28016 rw--- [ anon ]
00007fa620000000 28228 28216 28216 rw--- [ anon ]
00007fa634000000 36096 30024 30024 rw--- [ anon ]
00007fa638000000 65516 40128 40128 rw--- [ anon ]
00007fa478000000 46280 46240 46240 rw--- [ anon ]
0000000000f7e000 47980 47856 47856 rw--- [ anon ]
00007fa67ccf0000 52288 51264 51264 rw--- [ anon ]
00007fa6dc000000 65512 63264 63264 rw--- [ anon ]
00007fa6cd000000 71296 68916 68916 rwx-- [ anon ]
00000006c0000000 4359360 2735484 2735484 rw--- [ anon ]
It can be seen that the bottom line is the mapping of the heap memory, which occupies 4G, and the other has a very small memory footprint, but we still can't see the problem through this information.
NativeMemoryTracking
❝Native Memory Tracking (NMT) is a function used by Hotspot VM to analyze VM internal memory usage. We can use jcmd (built in jdk) to access NMT data.
NMT must first be turned on through the VM startup parameters, but it should be noted that turning on NMT will bring about 5%-10% performance loss.
-XX:NativeMemoryTracking=[off | summary | detail]
# off: 默认关闭
# summary: 只统计各个分类的内存使用情况.
# detail: Collect memory usage by individual call sites.
Then run the process, you can use the following command to view the direct memory:
jcmd <pid> VM.native_memory [summary | detail | baseline | summary.diff | detail.diff | shutdown] [scale= KB | MB | GB]
# summary: 分类内存使用情况.
# detail: 详细内存使用情况,除了summary信息之外还包含了虚拟内存使用情况。
# baseline: 创建内存使用快照,方便和后面做对比
# summary.diff: 和上一次baseline的summary对比
# detail.diff: 和上一次baseline的detail对比
# shutdown: 关闭NMT
We use:
jcmd pid VM.native_memory detail scale=MB > temp.txt
Get the result as shown:
None of the information given to us in the above picture can clearly see the problem, at least I still could not see the problem through these several information.
The investigation seems to have reached a deadlock.
There is no way out of doubt
When the investigation stalled, we got a reply from WCS and SCF, "Both parties confirmed that there is no memory leak in their package." WCS did not use direct memory, and SCF was used as the underlying RPC protocol, but It will not leave such an obvious memory bug, otherwise there should be a lot of feedback online.
View JVM memory information jmap
At this point, if I couldn't find the problem, I opened a new sandbox container again, ran the service process, and then ran the jmap command to take a look at the "actual configuration" of the JVM memory :
jmap -heap pid
got the answer:
Attaching to process ID 1474, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.66-b17
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 4294967296 (4096.0MB)
NewSize = 2147483648 (2048.0MB)
MaxNewSize = 2147483648 (2048.0MB)
OldSize = 2147483648 (2048.0MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 1932787712 (1843.25MB)
used = 1698208480 (1619.5378112792969MB)
free = 234579232 (223.71218872070312MB)
87.86316621615607% used
Eden Space:
capacity = 1718091776 (1638.5MB)
used = 1690833680 (1612.504653930664MB)
free = 27258096 (25.995346069335938MB)
98.41346682518548% used
From Space:
capacity = 214695936 (204.75MB)
used = 7374800 (7.0331573486328125MB)
free = 207321136 (197.7168426513672MB)
3.4349974840697497% used
To Space:
capacity = 214695936 (204.75MB)
used = 0 (0.0MB)
free = 214695936 (204.75MB)
0.0% used
concurrent mark-sweep generation:
capacity = 2147483648 (2048.0MB)
used = 322602776 (307.6579818725586MB)
free = 1824880872 (1740.3420181274414MB)
15.022362396121025% used
29425 interned Strings occupying 3202824 bytes
In the output information, it can be seen that the old and new generations are quite normal, the meta space only occupies 20M, and the direct memory seems to be 2g...
Ok? Why MaxMetaspaceSize = 17592186044415 MB
? "Looks and no restrictions as」 .
Take a closer look at our startup parameters:
-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:PermSize=256m -XX:MaxPermSize=512m -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80
What is configured is -XX:PermSize=256m -XX:MaxPermSize=512m
the memory space of the permanent generation. "And after 1.8, Hotspot virtual machine has been removed permanently generations, using the yuan instead of space." Since we are using the online JDK1.8, "so we have the maximum capacity of the yuan space restrictions simply do not do" , -XX:PermSize=256m -XX:MaxPermSize=512m
which The two parameters are expired parameters for 1.8.
The following figure describes the change from 1.7 to 1.8, the permanent generation:
"Could it be that the metaspace memory was leaked?"
I chose to test locally, which is convenient to change the parameters, and it is also convenient to use the JVisualVM tool to visually see the memory changes.
Use JVisualVM to observe the process running
First limit the meta space, use parameters -XX:MetaspaceSize=64m -XX:MaxMetaspaceSize=128m
, and then call the problematic interface locally.
Get as shown:
"It can be seen that when the meta space is exhausted, the system starts Full GC, the meta space memory is reclaimed, and many classes are unloaded."
Then we remove the meta space restriction, that is, use the parameters that had problems before:
-Xms4g -Xmx4g -Xmn2g -Xss1024K -XX:ParallelGCThreads=20 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80 -XX:MaxDirectMemorySize=2g -XX:+UnlockDiagnosticVMOptions
Get as shown:
"It can be seen that the meta space is rising, and the loaded classes are also rising with the increase in the number of calls, showing a positive correlation trend."
Liu An Hua Ming You Yi Village
Clear up problems at once, "With each call interface, most likely a class are constantly being created, taking up memory space yuan」 .
Observe the JVM class loading situation-verbose
❝When debugging a program, sometimes it is necessary to check the classes loaded by the program, the memory recovery situation, the local interface called, etc. At this time, the -verbose command is needed. In myeclipse, you can right-click to set (as follows), or you can enter java-on the command line. verbose to see.
-verbose:class 查看类加载情况
-verbose:gc 查看虚拟机中内存回收情况
-verbose:jni 查看本地方法调用的情况
In the local environment, we add startup parameters to -verbose:class
call the interface cyclically.
You can see that countless are generated com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto
:
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]
When calling it many times and accumulating certain classes, we manually execute Full GC to recycle the class loader. We found that a large number of fastjson related classes were recycled.
"If you use jmap to check the class loading before recycling, you can also find a large number of fastjson related classes:"
jmap -clstats 7984
Now I have a direction, "check the code carefully this time" , check where fastjson is used in the code logic, and found the following code:
/**
* 返回Json字符串.驼峰转_
* @param bean 实体类.
*/
public static String buildData(Object bean) {
try {
SerializeConfig CONFIG = new SerializeConfig();
CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase;
return jsonString = JSON.toJSONString(bean, CONFIG);
} catch (Exception e) {
return null;
}
}
Root cause of the problem
We serialized the entity class of the camel case field into an underscore field before calling wcs. This requires the use of Fastjson's SerializeConfig, and we instantiated it in a static method. When SerializeConfig is created, an ASM proxy class is created by default to realize the serialization of the target object. That is, the classes com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto
that are frequently created above . If we reuse SerializeConfig, fastjson will look for the created proxy classes and reuse them. But if new SerializeConfig() cannot find the originally generated proxy class, it will continue to generate a new WlkCustomerDto proxy class.
The source code of the problem location in the following two pictures:
We used SerializeConfig as a static variable of the class, and the problem was solved.
private static final SerializeConfig CONFIG = new SerializeConfig();
static {
CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase;
}
What does fastjson SerializeConfig do
Introduction to SerializeConfig:
❝The main function of SerializeConfig is to configure and record the serialization class (implementation class of ObjectSerializer interface) corresponding to each Java type. For example, Boolean.class uses BooleanCodec (see the name to know that this class will implement serialization and deserialization together) ) As the serialization implementation class, float[].class uses FloatArraySerializer as the serialization implementation class. Some of these serialization implementation classes are implemented by default in FastJSON (such as Java basic classes), and some are generated through the ASM framework (such as user self Definition class), and some even user-defined serialization classes (for example, the default implementation of the Date type framework is converted to milliseconds, and the application needs to be converted to seconds). Of course, this involves using ASM to generate serialized classes or using JavaBeans The problem of serialization of the serialization class, the judgment here is based on whether the Android environment (the environment variable "java.vm.name" is "dalvik" or "lemur" is the Android environment), but the judgment is not only here, but also the follow-up More specific judgments.
Theoretically speaking, if each SerializeConfig instance serializes the same class, it will find the proxy class of that class generated before for serialization. Our service instantiates a ParseConfig object to configure the settings of Fastjson deserialization every time the interface is called. If the ASM proxy is not disabled, since each call to ParseConfig is a new instance, it is always checked The created proxy class is not available, so Fastjson constantly creates new proxy classes and loads them into the metaspace, which eventually leads to the continuous expansion of the metaspace and exhausts the memory of the machine.
Problems will only occur after upgrading JDK1.8
The cause of the problem is still worth paying attention to. Why does this problem not occur before the upgrade? This is to analyze the difference between the hotspot virtual machine that comes with jdk1.8 and 1.7.
❝Starting from jdk1.8, the hostspot virtual machine that comes with it cancels the permanent area in the past and adds a metaspace area. From a functional point of view, metaspace can be considered similar to the permanent area, and its main function is to store metadata , But the actual mechanism is quite different.
First of all, the default maximum value of metaspace is the physical memory size of the entire machine, so the continuous expansion of metaspace will cause java programs to invade the available memory of the system, and eventually the system will have no available memory; while the permanent area has a fixed default size and will not expand to the entire The available memory of the machine. When the allocated memory is exhausted, both will trigger full gc, but the difference is that when the permanent area is in full gc, the class metadata (Class object) in the permanent area is reclaimed with a similar mechanism when the heap memory is reclaimed. Objects that cannot be reached by the root reference can be recycled. Metaspace determines whether the class metadata can be recycled. It is judged based on whether the Classloader that loads the class metadata can be recycled. As long as the Classloader cannot be recycled, the class metadata loaded through it Will not be recycled. This also explains why our two services have problems after upgrading to 1.8, because in the previous jdk version, although many proxy classes were created every time fastjson was called, many proxy classes were loaded in the permanent area. Class instances, but these Class instances are created when the method is called, and are unreachable after the call is completed. Therefore, when the permanent area memory is full and the full gc is triggered, they will be recycled.
When using 1.8, because these proxy classes are loaded through the Classloader of the main thread, this Classloader will never be recycled during the program running, so the proxy classes loaded through it will never be recycled. This led to the continuous expansion of metaspace, which eventually exhausted the machine's memory.
This problem is not limited to fastjson, as long as it is necessary to load and create classes through the program, this problem may occur. "Especially in the framework, a large number of tools like ASM, javassist, etc. are often used for bytecode enhancement. According to the above analysis, before jdk1.8, because in most cases dynamically loaded Class can be obtained in full gc Recycling, so it is not prone to problems" , and therefore many frameworks and toolkits do not deal with this problem. Once upgraded to 1.8, these problems may be exposed.
to sum up
The problem was solved. Next, I reviewed the entire troubleshooting process. The entire process revealed many problems for me. The most important thing is " I am not familiar with the memory allocation of different versions of JVM" , which led to errors in the judgment of the old generation and metaspace. , I took a lot of detours, and spent a lot of time investigating in direct memory.
Secondly, the investigation needs "one is careful, and the other is comprehensive." It is best to sort out all the possibilities first, otherwise it is easy to fall into the scope of the investigation set by yourself and enter a dead end.
Finally, sum up the gains from this problem:
Starting from JDK1.8, the hostspot virtual machine that comes with it cancels the past permanent area, and adds a metaspace area. From a functional point of view, metaspace can be considered similar to the permanent area. Its main function is to store metadata, but The actual mechanism is quite different.
The memory in the JVM needs to be restricted at startup, including the familiar heap memory, but also direct memory and meta-generation area. This is the last bag to ensure the normal operation of online services.
When using class libraries, please pay more attention to the way the code is written and try not to show obvious memory leaks.
For libraries that use bytecode enhancement tools such as ASM, please be careful when using them (especially after JDK1.8).
reference
Observe the process of class loading when the program is running
blog.csdn.net/tenderhearted/article/details/39642275
Metaspace overall introduction (the reason why the permanent generation is replaced, the characteristics of the metaspace, the analysis method of the metaspace memory)
https://www.cnblogs.com/duanxz/p/3520829.html
Common troubleshooting procedures for java memory usage exceptions (including off-heap memory exceptions)
https://my.oschina.net/haitaohu/blog/3024843
Full interpretation of off-heap memory of JVM source code analysis
http://lovestblog.cn/blog/2015/05/12/direct-buffer/
Uninstallation of JVM classes
https://www.cnblogs.com/caoxb/p/12735525.html
fastjson opens asm on jdk1.8
https://github.com/alibaba/fastjson/issues/385
fastjson:PropertyNamingStrategy_cn
https://github.com/alibaba/fastjson/wiki/PropertyNamingStrategy_cn
Be wary of Metaspace memory leaks caused by dynamic agents
https://blog.csdn.net/xyghehehehe/article/details/78820135
There is no way, but the technique can be achieved; if there is no way, it ends with the technique
Welcome everyone to follow the Java Way public account
Good article, I am reading ❤️