性能调优之JMH必知必会5:JMH的Profiler
JMH必知必会系列文章(持续更新)
一、前言
在前面四篇文章中分别介绍了什么是JMH、JMH的基本法、编写正确的微基准测试用例和JMH的高级用法。现在来介绍JMH的Profiler。【单位换算:1秒(s)=1000000微秒(us)=1000000000纳秒(ns)
】
官方JMH源码(包含样例,在jmh-samples包里)下载地址:https://github.com/openjdk/jmh/tags。
官方JMH样例在线浏览地址:http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/。
本文内容参考书籍《Java高并发编程详解:深入理解并发核心库》,作者为 汪文君 ,读者有需要可以去购买正版书籍。
本文由 @大白有点菜 原创,请勿盗用,转载请说明出处!如果觉得文章还不错,请点点赞,加关注,谢谢!
二、JMH的Profiler
JMH提供了一些非常有用的Profiler可以帮助我们更加深入地了解基准测试,甚至还能帮助开发者分析所编写的代码。
Profiler 名称 | Profiler 描述 |
---|---|
CL | 分析执行 Benchmark 方法时的类加载情况 |
COMP | 通过 Standard MBean 进行 Benchmark 方法的 JIT 编译器分析 |
GC | 通过 Standard MBean 进行 Benchmark 方法的 GC 分析 |
HS_CL | HotSpotTM 类加载器通过特定于实现的 MBean 进行分析 |
HS_COMP | HotSpotTMJIT 通过特定于实现的 MBean 编译分析 |
HS_GC | HotSpotTM内存管理器(GC)通过特定于实现的 MBean 进行分析 |
HS_RT | 通过 Implementation-Specific MBean 进行 HotSpotTM 运行时分析 |
HS_THR | 通过 Implementation-Specific MBean 进行 HotSpotTM 线程分析 |
STACK | JVM线程栈信息分析 |
【先附上官方Profiler样例(JMHSample_35_Profilers) 代码,并使用谷歌和百度翻译其中的注解】
package cn.zhuangyt.javabase.jmh.jmh_sample;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.profile.ClassloaderProfiler;
import org.openjdk.jmh.profile.DTraceAsmProfiler;
import org.openjdk.jmh.profile.LinuxPerfProfiler;
import org.openjdk.jmh.profile.StackProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import java.net.URL;
import java.net.URLClassLoader;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
public class JMHSample_35_Profilers {
/**
* This sample serves as the profiler overview.
*
* 此示例用作探查器概述。
*
* JMH has a few very handy profilers that help to understand your benchmarks. While
* these profilers are not the substitute for full-fledged external profilers, in many
* cases, these are handy to quickly dig into the benchmark behavior. When you are
* doing many cycles of tuning up the benchmark code itself, it is important to have
* a quick turnaround for the results.
*
* JMH 有一些非常方便的分析器,可以帮助您了解基准。
* 虽然这些分析器不能替代成熟的外部分析器,但在许多情况下,它们可以方便地快速挖掘基准行为。
* 当您对基准代码本身进行多次调整时,快速获得结果很重要。
*
* Use -lprof to list the profilers. There are quite a few profilers, and this sample
* would expand on a handful of most useful ones. Many profilers have their own options,
* usually accessible via -prof <profiler-name>:help.
*
* 使用 -lprof 列出分析器。 有很多分析器,这个示例将扩展一些最有用的分析器。
* 许多分析器都有自己的选项,通常可以通过 -prof <profiler-name>:help 访问。
*
* Since profilers are reporting on different things, it is hard to construct a single
* benchmark sample that will show all profilers in action. Therefore, we have a couple
* of benchmarks in this sample.
*
* 由于分析器报告不同的事情,因此很难构建一个单一的基准样本来显示所有分析器的运行情况。 因此,我们在此示例中有几个基准。
*/
/*
* ================================ MAPS BENCHMARK ================================
*/
@State(Scope.Thread)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public static class Maps {
private Map<Integer, Integer> map;
@Param({
"hashmap", "treemap"})
private String type;
private int begin;
private int end;
@Setup
public void setup() {
switch (type) {
case "hashmap":
map = new HashMap<>();
break;
case "treemap":
map = new TreeMap<>();
break;
default:
throw new IllegalStateException("Unknown type: " + type);
}
begin = 1;
end = 256;
for (int i = begin; i < end; i++) {
map.put(i, i);
}
}
@Benchmark
public void test(Blackhole bh) {
for (int i = begin; i < end; i++) {
bh.consume(map.get(i));
}
}
/*
* ============================== HOW TO RUN THIS TEST: ====================================
*
* You can run this test:
*
* a) Via the command line:
* $ mvn clean install
* $ java -jar target/benchmarks.jar JMHSample_35.*Maps -prof stack
* $ java -jar target/benchmarks.jar JMHSample_35.*Maps -prof gc
*
* b) Via the Java API:
* (see the JMH homepage for possible caveats when running from IDE:
* http://openjdk.java.net/projects/code-tools/jmh/)
*/
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JmhTestApp17_Profiler.Maps.class.getSimpleName())
.addProfiler(StackProfiler.class)
// .addProfiler(GCProfiler.class)
.build();
new Runner(opt).run();
}
/**
Running this benchmark will yield something like:
Benchmark (type) Mode Cnt Score Error Units
JMHSample_35_Profilers.Maps.test hashmap avgt 5 1553.201 ± 6.199 ns/op
JMHSample_35_Profilers.Maps.test treemap avgt 5 5177.065 ± 361.278 ns/op
Running with -prof stack will yield:
....[Thread state: RUNNABLE]........................................................................
99.0% 99.0% org.openjdk.jmh.samples.JMHSample_35_Profilers$Maps.test
0.4% 0.4% org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Maps_test.test_avgt_jmhStub
0.2% 0.2% sun.reflect.NativeMethodAccessorImpl.invoke0
0.2% 0.2% java.lang.Integer.valueOf
0.2% 0.2% sun.misc.Unsafe.compareAndSwapInt
....[Thread state: RUNNABLE]........................................................................
78.0% 78.0% java.util.TreeMap.getEntry
21.2% 21.2% org.openjdk.jmh.samples.JMHSample_35_Profilers$Maps.test
0.4% 0.4% java.lang.Integer.valueOf
0.2% 0.2% sun.reflect.NativeMethodAccessorImpl.invoke0
0.2% 0.2% org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Maps_test.test_avgt_jmhStub
Stack profiler is useful to quickly see if the code we are stressing actually executes. As many other
sampling profilers, it is susceptible for sampling bias: it can fail to notice quickly executing methods,
for example. In the benchmark above, it does not notice HashMap.get.
Stack profiler(堆栈分析器) 有助于快速查看我们强调的代码是否实际执行。 与许多其他采样分析器一样,
它容易受到采样偏差的影响:例如,它可能无法注意到快速执行的方法。 在上面的基准测试中,它没有注意到 HashMap.get。
Next up, GC profiler. Running with -prof gc will yield:
Benchmark (type) Mode Cnt Score Error Units
JMHSample_35_Profilers.Maps.test hashmap avgt 5 1553.201 ± 6.199 ns/op
JMHSample_35_Profilers.Maps.test:·gc.alloc.rate hashmap avgt 5 1257.046 ± 5.675 MB/sec
JMHSample_35_Profilers.Maps.test:·gc.alloc.rate.norm hashmap avgt 5 2048.001 ± 0.001 B/op
JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Eden_Space hashmap avgt 5 1259.148 ± 315.277 MB/sec
JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Eden_Space.norm hashmap avgt 5 2051.519 ± 520.324 B/op
JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Survivor_Space hashmap avgt 5 0.175 ± 0.386 MB/sec
JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Survivor_Space.norm hashmap avgt 5 0.285 ± 0.629 B/op
JMHSample_35_Profilers.Maps.test:·gc.count hashmap avgt 5 29.000 counts
JMHSample_35_Profilers.Maps.test:·gc.time hashmap avgt 5 16.000 ms
JMHSample_35_Profilers.Maps.test treemap avgt 5 5177.065 ± 361.278 ns/op
JMHSample_35_Profilers.Maps.test:·gc.alloc.rate treemap avgt 5 377.251 ± 26.188 MB/sec
JMHSample_35_Profilers.Maps.test:·gc.alloc.rate.norm treemap avgt 5 2048.003 ± 0.001 B/op
JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Eden_Space treemap avgt 5 392.743 ± 174.156 MB/sec
JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Eden_Space.norm treemap avgt 5 2131.767 ± 913.941 B/op
JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Survivor_Space treemap avgt 5 0.131 ± 0.215 MB/sec
JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Survivor_Space.norm treemap avgt 5 0.709 ± 1.125 B/op
JMHSample_35_Profilers.Maps.test:·gc.count treemap avgt 5 25.000 counts
JMHSample_35_Profilers.Maps.test:·gc.time treemap avgt 5 26.000 ms
There, we can see that the tests are producing quite some garbage. "gc.alloc" would say we are allocating 1257
and 377 MB of objects per second, or 2048 bytes per benchmark operation. "gc.churn" would say that GC removes
the same amount of garbage from Eden space every second. In other words, we are producing 2048 bytes of garbage per
benchmark operation.
在那里,我们可以看到测试产生了相当多的垃圾。 “gc.alloc”表示我们每秒分配 1257 和 377 MB 的对象,或者每个基准操作分配 2048 字节。
“gc.churn”表示 GC 每秒从Eden(伊甸园)空间移除相同数量的垃圾。 换句话说,我们在每个基准操作中产生 2048 字节的垃圾。
If you look closely at the test, you can get a (correct) hypothesis this is due to Integer autoboxing.
如果仔细观察测试,您可以得到一个(正确的)假设,这是由于Integer自动装箱。
Note that "gc.alloc" counters generally produce more accurate data, but they can also fail when threads come and
go over the course of the benchmark. "gc.churn" values are updated on each GC event, and so if you want a more accurate
data, running longer and/or with small heap would help. But anyhow, always cross-reference "gc.alloc" and "gc.churn"
values with each other to get a complete picture.
请注意,“gc.alloc”计数器通常会产生更准确的数据,但当线程在基准测试过程中来来去去时,它们也可能会失败。
“gc.churn”值在每个 GC 事件中更新,因此如果您想要更准确的数据,运行更长时间和/或使用小堆会有所帮助。
但无论如何,总是相互交叉引用“gc.alloc”和“gc.churn”值以获得完整的图片。
It is also worth noticing that non-normalized counters are dependent on benchmark performance! Here, "treemap"
tests are 3x slower, and thus both allocation and churn rates are also comparably lower. It is often useful to look
into non-normalized counters to see if the test is allocation/GC-bound (figure the allocation pressure "ceiling"
for your configuration!), and normalized counters to see the more precise benchmark behavior.
还值得注意的是,非标准化计数器依赖于基准性能! 在这里,“treemap”测试慢了 3 倍,因此分配率和流失率也相对较低。
查看非规范化计数器以查看测试是否受分配/GC-bound(GC 限制)(为您的配置计算分配压力“上限”!)和规范化计数器以查看更精确的基准行为通常很有用。
As most profilers, both "stack" and "gc" profile are able to aggregate samples from multiple forks. It is a good
idea to run multiple forks with the profilers enabled, as it improves results error estimates.
与大多数分析器一样,“stack”和“gc”分析器都能够聚合来自多个分支的样本。 在启用分析器的情况下运行多个分支是一个好主意,因为它可以改进结果错误估计。
*/
}
/*
* ================================ CLASSLOADER BENCHMARK ================================
*/
@State(Scope.Thread)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public static class Classy {
/**
* Our own crippled classloader, that can only load a simple class over and over again.
*/
public static class XLoader extends URLClassLoader {
private static final byte[] X_BYTECODE = new byte[]{
(byte) 0xCA, (byte) 0xFE, (byte) 0xBA, (byte) 0xBE, 0x00, 0x00, 0x00, 0x34, 0x00, 0x0D, 0x0A, 0x00, 0x03, 0x00,
0x0A, 0x07, 0x00, 0x0B, 0x07, 0x00, 0x0C, 0x01, 0x00, 0x06, 0x3C, 0x69, 0x6E, 0x69, 0x74, 0x3E, 0x01, 0x00, 0x03,
0x28, 0x29, 0x56, 0x01, 0x00, 0x04, 0x43, 0x6F, 0x64, 0x65, 0x01, 0x00, 0x0F, 0x4C, 0x69, 0x6E, 0x65, 0x4E, 0x75,
0x6D, 0x62, 0x65, 0x72, 0x54, 0x61, 0x62, 0x6C, 0x65, 0x01, 0x00, 0x0A, 0x53, 0x6F, 0x75, 0x72, 0x63, 0x65, 0x46,
0x69, 0x6C, 0x65, 0x01, 0x00, 0x06, 0x58, 0x2E, 0x6A, 0x61, 0x76, 0x61, 0x0C, 0x00, 0x04, 0x00, 0x05, 0x01, 0x00,
0x01, 0x58, 0x01, 0x00, 0x10, 0x6A, 0x61, 0x76, 0x61, 0x2F, 0x6C, 0x61, 0x6E, 0x67, 0x2F, 0x4F, 0x62, 0x6A, 0x65,
0x63, 0x74, 0x00, 0x20, 0x00, 0x02, 0x00, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x04, 0x00,
0x05, 0x00, 0x01, 0x00, 0x06, 0x00, 0x00, 0x00, 0x1D, 0x00, 0x01, 0x00, 0x01, 0x00, 0x00, 0x00, 0x05, 0x2A,
(byte) 0xB7, 0x00, 0x01, (byte) 0xB1, 0x00, 0x00, 0x00, 0x01, 0x00, 0x07, 0x00, 0x00, 0x00, 0x06, 0x00, 0x01, 0x00,
0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x08, 0x00, 0x00, 0x00, 0x02, 0x00, 0x09,
};
public XLoader() {
super(new URL[0], ClassLoader.getSystemClassLoader());
}
@Override
protected Class<?> findClass(final String name) throws ClassNotFoundException {
return defineClass(name, X_BYTECODE, 0, X_BYTECODE.length);
}
}
@Benchmark
public Class<?> load() throws ClassNotFoundException {
return Class.forName("X", true, new JmhTestApp17_Profiler.Classy.XLoader());
}
/*
* ============================== HOW TO RUN THIS TEST: ====================================
*
* You can run this test:
*
* a) Via the command line:
* $ mvn clean install
* $ java -jar target/benchmarks.jar JMHSample_35.*Classy -prof cl
* $ java -jar target/benchmarks.jar JMHSample_35.*Classy -prof comp
*
* b) Via the Java API:
* (see the JMH homepage for possible caveats when running from IDE:
* http://openjdk.java.net/projects/code-tools/jmh/)
*/
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JmhTestApp17_Profiler.Classy.class.getSimpleName())
.addProfiler(ClassloaderProfiler.class)
// .addProfiler(CompilerProfiler.class)
.build();
new Runner(opt).run();
}
/**
Running with -prof cl will yield:
Benchmark Mode Cnt Score Error Units
JMHSample_35_Profilers.Classy.load avgt 15 34215.363 ± 545.892 ns/op
JMHSample_35_Profilers.Classy.load:·class.load avgt 15 29374.097 ± 716.743 classes/sec
JMHSample_35_Profilers.Classy.load:·class.load.norm avgt 15 1.000 ± 0.001 classes/op
JMHSample_35_Profilers.Classy.load:·class.unload avgt 15 29598.233 ± 3420.181 classes/sec
JMHSample_35_Profilers.Classy.load:·class.unload.norm avgt 15 1.008 ± 0.119 classes/op
Here, we can see the benchmark indeed load class per benchmark op, and this adds up to more than 29K classloads
per second. We can also see the runtime is able to successfully keep the number of loaded classes at bay,
since the class unloading happens at the same rate.
在这里,我们可以看到基准确实在每个基准操作中加载了类,这加起来每秒超过 29K 类加载。
我们还可以看到运行时能够成功地控制加载类的数量,因为类卸载以相同的速率发生。
This profiler is handy when doing the classloading performance work, because it says if the classes
were actually loaded, and not reused across the Class.forName calls. It also helps to see if the benchmark
performs any classloading in the measurement phase. For example, if you have non-classloading benchmark,
you would expect these metrics be zero.
在执行类加载性能工作时,此分析器很方便,因为它会说明类是否实际加载,而不是在 Class.forName 调用中重用。
它还有助于查看基准测试是否在测量阶段执行任何类加载。 例如,如果您有非类加载基准,您会期望这些指标为零。
Another useful profiler that could tell if compiler is doing a heavy work in background, and thus interfering
with measurement, -prof comp:
另一个有用的分析器可以判断编译器是否在后台做繁重的工作,从而干扰测量,-prof comp:
Benchmark Mode Cnt Score Error Units
JMHSample_35_Profilers.Classy.load avgt 5 33523.875 ± 3026.025 ns/op
JMHSample_35_Profilers.Classy.load:·compiler.time.profiled avgt 5 5.000 ms
JMHSample_35_Profilers.Classy.load:·compiler.time.total avgt 5 479.000 ms
We seem to be at proper steady state: out of 479 ms of total compiler work, only 5 ms happen during the
measurement window. It is expected to have some level of background compilation even at steady state.
我们似乎处于适当的稳定状态:在 479 毫秒的总编译器工作中,只有 5 毫秒发生在测量窗口期间。 预计即使在稳定状态下也有一定程度的后台编译。
As most profilers, both "cl" and "comp" are able to aggregate samples from multiple forks. It is a good
idea to run multiple forks with the profilers enabled, as it improves results error estimates.
与大多数分析器一样,“cl”和“comp”都能够从多个分叉中聚合样本。 在启用分析器的情况下运行多个分支是一个好主意,因为它可以改进结果错误估计。
*/
}
/*
* ================================ ATOMIC LONG BENCHMARK ================================
*/
@State(Scope.Benchmark)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public static class Atomic {
private AtomicLong n;
@Setup
public void setup() {
n = new AtomicLong();
}
@Benchmark
public long test() {
return n.incrementAndGet();
}
/*
* ============================== HOW TO RUN THIS TEST: ====================================
*
* You can run this test:
*
* a) Via the command line:
* $ mvn clean install
* $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof perf -f 1 (Linux)
* $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof perfnorm -f 3 (Linux)
* $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof perfasm -f 1 (Linux)
* $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof xperfasm -f 1 (Windows)
* $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof dtraceasm -f 1 (Mac OS X)
* b) Via the Java API:
* (see the JMH homepage for possible caveats when running from IDE:
* http://openjdk.java.net/projects/code-tools/jmh/)
*/
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JmhTestApp17_Profiler.Atomic.class.getSimpleName())
.addProfiler(LinuxPerfProfiler.class)
// .addProfiler(LinuxPerfNormProfiler.class)
// .addProfiler(LinuxPerfAsmProfiler.class)
// .addProfiler(WinPerfAsmProfiler.class)
// .addProfiler(DTraceAsmProfiler.class)
.build();
new Runner(opt).run();
}
/**
Dealing with nanobenchmarks like these requires looking into the abyss of runtime, hardware, and
generated code. Luckily, JMH has a few handy tools that ease the pain. If you are running Linux,
then perf_events are probably available as standard package. This kernel facility taps into
hardware counters, and provides the data for user space programs like JMH. Windows has less
sophisticated facilities, but also usable, see below.
处理像这样的纳米基准需要深入研究运行时、硬件和生成的代码。 幸运的是,JMH 有一些方便的工具可以减轻痛苦。
如果您正在运行 Linux,那么 perf_events 可能作为标准包提供。 该内核工具利用硬件计数器,并为 JMH 等用户空间程序提供数据。
Windows 的功能不那么复杂,但也可用,请参见下文。
One can simply run "perf stat java -jar ..." to get the first idea how the workload behaves. In
JMH case, however, this will cause perf to profile both host and forked JVMs.
只需运行“perf stat java -jar ...”即可初步了解工作负载的行为方式。 然而,在 JMH 的情况下,这将导致 perf 分析主机和分支 JVM。
-prof perf avoids that: JMH invokes perf for the forked VM alone. For the benchmark above, it
would print something like:
-prof perf 避免了这一点:JMH 单独为分叉的 VM 调用 perf。 对于上面的基准测试,它会打印如下内容:
Perf stats:
--------------------------------------------------
4172.776137 task-clock (msec) # 0.411 CPUs utilized
612 context-switches # 0.147 K/sec
31 cpu-migrations # 0.007 K/sec
195 page-faults # 0.047 K/sec
16,599,643,026 cycles # 3.978 GHz [30.80%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
17,815,084,879 instructions # 1.07 insns per cycle [38.49%]
3,813,373,583 branches # 913.870 M/sec [38.56%]
1,212,788 branch-misses # 0.03% of all branches [38.91%]
7,582,256,427 L1-dcache-loads # 1817.077 M/sec [39.07%]
312,913 L1-dcache-load-misses # 0.00% of all L1-dcache hits [38.66%]
35,688 LLC-loads # 0.009 M/sec [32.58%]
<not supported> LLC-load-misses:HG
<not supported> L1-icache-loads:HG
161,436 L1-icache-load-misses:HG # 0.00% of all L1-icache hits [32.81%]
7,200,981,198 dTLB-loads:HG # 1725.705 M/sec [32.68%]
3,360 dTLB-load-misses:HG # 0.00% of all dTLB cache hits [32.65%]
193,874 iTLB-loads:HG # 0.046 M/sec [32.56%]
4,193 iTLB-load-misses:HG # 2.16% of all iTLB cache hits [32.44%]
<not supported> L1-dcache-prefetches:HG
0 L1-dcache-prefetch-misses:HG # 0.000 K/sec [32.33%]
10.159432892 seconds time elapsed
We can already see this benchmark goes with good IPC, does lots of loads and lots of stores,
all of them are more or less fulfilled without misses. The data like this is not handy though:
you would like to normalize the counters per benchmark op.
我们已经可以看到这个基准与良好的 IPC 相得益彰,进行了大量的加载和存储,所有这些都或多或少地完成了,没有遗漏。
虽然这样的数据并不方便:您希望对每个基准操作的计数器进行归一化。
This is exactly what -prof perfnorm does:
这正是 -prof perfnorm 所做的:
Benchmark Mode Cnt Score Error Units
JMHSample_35_Profilers.Atomic.test avgt 15 6.551 ± 0.023 ns/op
JMHSample_35_Profilers.Atomic.test:·CPI avgt 3 0.933 ± 0.026 #/op
JMHSample_35_Profilers.Atomic.test:·L1-dcache-load-misses avgt 3 0.001 ± 0.022 #/op
JMHSample_35_Profilers.Atomic.test:·L1-dcache-loads avgt 3 12.267 ± 1.324 #/op
JMHSample_35_Profilers.Atomic.test:·L1-dcache-store-misses avgt 3 0.001 ± 0.006 #/op
JMHSample_35_Profilers.Atomic.test:·L1-dcache-stores avgt 3 4.090 ± 0.402 #/op
JMHSample_35_Profilers.Atomic.test:·L1-icache-load-misses avgt 3 0.001 ± 0.011 #/op
JMHSample_35_Profilers.Atomic.test:·LLC-loads avgt 3 0.001 ± 0.004 #/op
JMHSample_35_Profilers.Atomic.test:·LLC-stores avgt 3 ≈ 10⁻⁴ #/op
JMHSample_35_Profilers.Atomic.test:·branch-misses avgt 3 ≈ 10⁻⁴ #/op
JMHSample_35_Profilers.Atomic.test:·branches avgt 3 6.152 ± 0.385 #/op
JMHSample_35_Profilers.Atomic.test:·bus-cycles avgt 3 0.670 ± 0.048 #/op
JMHSample_35_Profilers.Atomic.test:·context-switches avgt 3 ≈ 10⁻⁶ #/op
JMHSample_35_Profilers.Atomic.test:·cpu-migrations avgt 3 ≈ 10⁻⁷ #/op
JMHSample_35_Profilers.Atomic.test:·cycles avgt 3 26.790 ± 1.393 #/op
JMHSample_35_Profilers.Atomic.test:·dTLB-load-misses avgt 3 ≈ 10⁻⁴ #/op
JMHSample_35_Profilers.Atomic.test:·dTLB-loads avgt 3 12.278 ± 0.277 #/op
JMHSample_35_Profilers.Atomic.test:·dTLB-store-misses avgt 3 ≈ 10⁻⁵ #/op
JMHSample_35_Profilers.Atomic.test:·dTLB-stores avgt 3 4.113 ± 0.437 #/op
JMHSample_35_Profilers.Atomic.test:·iTLB-load-misses avgt 3 ≈ 10⁻⁵ #/op
JMHSample_35_Profilers.Atomic.test:·iTLB-loads avgt 3 0.001 ± 0.034 #/op
JMHSample_35_Profilers.Atomic.test:·instructions avgt 3 28.729 ± 1.297 #/op
JMHSample_35_Profilers.Atomic.test:·minor-faults avgt 3 ≈ 10⁻⁷ #/op
JMHSample_35_Profilers.Atomic.test:·page-faults avgt 3 ≈ 10⁻⁷ #/op
JMHSample_35_Profilers.Atomic.test:·ref-cycles avgt 3 26.734 ± 2.081 #/op
It is customary to trim the lines irrelevant to the particular benchmark. We show all of them here for
completeness.
通常会修剪与特定基准无关的线条。为了完整起见,我们在这里展示了所有这些。
We can see that the benchmark does ~12 loads per benchmark op, and about ~4 stores per op, most of
them fitting in the cache. There are also ~6 branches per benchmark op, all are predicted as well.
It is also easy to see the benchmark op takes ~28 instructions executed in ~27 cycles.
我们可以看到基准测试每个基准操作执行约 12 次加载,每个操作执行约 4 次存储,其中大部分适合缓存。
每个基准操作也有大约 6 个分支,所有分支都是预测的。 还可以很容易地看出基准操作在大约 27 个周期内执行了大约 28 条指令。
The output would get more interesting when we run with more threads, say, -t 8:
当我们运行更多线程时,输出会变得更有趣,比如 -t 8:
Benchmark Mode Cnt Score Error Units
JMHSample_35_Profilers.Atomic.test avgt 15 143.595 ± 1.968 ns/op
JMHSample_35_Profilers.Atomic.test:·CPI avgt 3 17.741 ± 28.761 #/op
JMHSample_35_Profilers.Atomic.test:·L1-dcache-load-misses avgt 3 0.175 ± 0.406 #/op
JMHSample_35_Profilers.Atomic.test:·L1-dcache-loads avgt 3 11.872 ± 0.786 #/op
JMHSample_35_Profilers.Atomic.test:·L1-dcache-store-misses avgt 3 0.184 ± 0.505 #/op
JMHSample_35_Profilers.Atomic.test:·L1-dcache-stores avgt 3 4.422 ± 0.561 #/op
JMHSample_35_Profilers.Atomic.test:·L1-icache-load-misses avgt 3 0.015 ± 0.083 #/op
JMHSample_35_Profilers.Atomic.test:·LLC-loads avgt 3 0.015 ± 0.128 #/op
JMHSample_35_Profilers.Atomic.test:·LLC-stores avgt 3 1.036 ± 0.045 #/op
JMHSample_35_Profilers.Atomic.test:·branch-misses avgt 3 0.224 ± 0.492 #/op
JMHSample_35_Profilers.Atomic.test:·branches avgt 3 6.524 ± 2.873 #/op
JMHSample_35_Profilers.Atomic.test:·bus-cycles avgt 3 13.475 ± 14.502 #/op
JMHSample_35_Profilers.Atomic.test:·context-switches avgt 3 ≈ 10⁻⁴ #/op
JMHSample_35_Profilers.Atomic.test:·cpu-migrations avgt 3 ≈ 10⁻⁶ #/op
JMHSample_35_Profilers.Atomic.test:·cycles avgt 3 537.874 ± 595.723 #/op
JMHSample_35_Profilers.Atomic.test:·dTLB-load-misses avgt 3 0.001 ± 0.006 #/op
JMHSample_35_Profilers.Atomic.test:·dTLB-loads avgt 3 12.032 ± 2.430 #/op
JMHSample_35_Profilers.Atomic.test:·dTLB-store-misses avgt 3 ≈ 10⁻⁴ #/op
JMHSample_35_Profilers.Atomic.test:·dTLB-stores avgt 3 4.557 ± 0.948 #/op
JMHSample_35_Profilers.Atomic.test:·iTLB-load-misses avgt 3 ≈ 10⁻³ #/op
JMHSample_35_Profilers.Atomic.test:·iTLB-loads avgt 3 0.016 ± 0.052 #/op
JMHSample_35_Profilers.Atomic.test:·instructions avgt 3 30.367 ± 15.052 #/op
JMHSample_35_Profilers.Atomic.test:·minor-faults avgt 3 ≈ 10⁻⁵ #/op
JMHSample_35_Profilers.Atomic.test:·page-faults avgt 3 ≈ 10⁻⁵ #/op
JMHSample_35_Profilers.Atomic.test:·ref-cycles avgt 3 538.697 ± 590.183 #/op
Note how this time the CPI is awfully high: 17 cycles per instruction! Indeed, we are making almost the
same ~30 instructions, but now they take >530 cycles. Other counters highlight why: we now have cache
misses on both loads and stores, on all levels of cache hierarchy. With a simple constant-footprint
like ours, that's an indication of sharing problems. Indeed, our AtomicLong is heavily-contended
with 8 threads.
请注意这次 CPI 非常高:每条指令 17 个周期! 事实上,我们正在执行几乎相同的 ~30 条指令,但现在它们需要 >530 个周期。
其他计数器强调了原因:我们现在在缓存层次结构的所有级别的加载和存储上都有缓存未命中。 使用像我们这样的简单恒定足迹,这表明存在共享问题。
事实上,我们的 AtomicLong 与 8 个线程竞争激烈。
"perfnorm", again, can (and should!) be used with multiple forks, to properly estimate the metrics.
“perfnorm”再次可以(并且应该!)与多个分支一起使用,以正确估计指标
The last, but not the least player on our field is -prof perfasm. It is important to follow up on
generated code when dealing with fine-grained benchmarks. We could employ PrintAssembly to dump the
generated code, but it will dump *all* the generated code, and figuring out what is related to our
benchmark is a daunting task. But we have "perf" that can tell what program addresses are really hot!
This enables us to contrast the assembly output.
我们领域中最后但并非最不重要的参与者是 -prof perfasm。 在处理细粒度基准测试时,跟进生成的代码很重要。
我们可以使用 PrintAssembly 转储生成的代码,但它会转储 所有 生成的代码,并且找出与我们的基准测试相关的内容是一项艰巨的任务。
但是我们有“perf”可以判断哪些程序地址真的很热! 这使我们能够对比汇编输出。
-prof perfasm would indeed contrast out the hottest loop in the generated code! It will also point
fingers at "lock xadd" as the hottest instruction in our code. Hardware counters are not very precise
about the instruction addresses, so sometimes they attribute the events to the adjacent code lines.
-prof perfasm 确实会对比生成代码中最热的循环! 它还会指出“lock xadd”是我们代码中最热门的指令。
硬件计数器对于指令地址不是很精确,因此有时它们会将事件归因于相邻的代码行。
Hottest code regions (>10.00% "cycles" events):
....[Hottest Region 1]..............................................................................
[0x7f1824f87c45:0x7f1824f87c79] in org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@29 (line 201)
; implicit exception: dispatches to 0x00007f1824f87d21
0x00007f1824f87c25: test %r11d,%r11d
0x00007f1824f87c28: jne 0x00007f1824f87cbd ;*ifeq
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@32 (line 201)
0x00007f1824f87c2e: mov $0x1,%ebp
0x00007f1824f87c33: nopw 0x0(%rax,%rax,1)
0x00007f1824f87c3c: xchg %ax,%ax ;*aload
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@13 (line 199)
0x00007f1824f87c40: mov 0x8(%rsp),%r10
0.00% 0x00007f1824f87c45: mov 0xc(%r10),%r11d ;*getfield n
; - org.openjdk.jmh.samples.JMHSample_35_Profilers$Atomic::test@1 (line 280)
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@16 (line 199)
0.19% 0.02% 0x00007f1824f87c49: test %r11d,%r11d
0x00007f1824f87c4c: je 0x00007f1824f87cad
0x00007f1824f87c4e: mov $0x1,%edx
0x00007f1824f87c53: lock xadd %rdx,0x10(%r12,%r11,8)
;*invokevirtual getAndAddLong
; - java.util.concurrent.atomic.AtomicLong::incrementAndGet@8 (line 200)
; - org.openjdk.jmh.samples.JMHSample_35_Profilers$Atomic::test@4 (line 280)
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@16 (line 199)
95.20% 95.06% 0x00007f1824f87c5a: add $0x1,%rdx ;*ladd
; - java.util.concurrent.atomic.AtomicLong::incrementAndGet@12 (line 200)
; - org.openjdk.jmh.samples.JMHSample_35_Profilers$Atomic::test@4 (line 280)
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@16 (line 199)
0.24% 0.00% 0x00007f1824f87c5e: mov 0x10(%rsp),%rsi
0x00007f1824f87c63: callq 0x00007f1824e2b020 ; OopMap{[0]=Oop [8]=Oop [16]=Oop [24]=Oop off=232}
;*invokevirtual consume
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@19 (line 199)
; {optimized virtual_call}
0.20% 0.01% 0x00007f1824f87c68: mov 0x18(%rsp),%r10
0x00007f1824f87c6d: movzbl 0x94(%r10),%r11d ;*getfield isDone
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@29 (line 201)
0.00% 0x00007f1824f87c75: add $0x1,%rbp ; OopMap{r10=Oop [0]=Oop [8]=Oop [16]=Oop [24]=Oop off=249}
;*ifeq
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@32 (line 201)
0.20% 0.01% 0x00007f1824f87c79: test %eax,0x15f36381(%rip) # 0x00007f183aebe000
; {poll}
0x00007f1824f87c7f: test %r11d,%r11d
0x00007f1824f87c82: je 0x00007f1824f87c40 ;*aload_2
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@35 (line 202)
0x00007f1824f87c84: mov $0x7f1839be4220,%r10
0x00007f1824f87c8e: callq *%r10 ;*invokestatic nanoTime
; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@36 (line 202)
0x00007f1824f87c91: mov (%rsp),%r10
....................................................................................................
96.03% 95.10% <total for region 1>
perfasm would also print the hottest methods to show if we indeed spending time in our benchmark. Most of the time,
it can demangle VM and kernel symbols as well:
perfasm 还会打印最热门的方法,以显示我们是否确实在基准测试中花费了时间。 大多数时候,它也可以分解 VM 和内核符号:
....[Hottest Methods (after inlining)]..............................................................
96.03% 95.10% org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub
0.73% 0.78% org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_AverageTime
0.63% 0.00% org.openjdk.jmh.infra.Blackhole::consume
0.23% 0.25% native_write_msr_safe ([kernel.kallsyms])
0.09% 0.05% _raw_spin_unlock ([kernel.kallsyms])
0.09% 0.00% [unknown] (libpthread-2.19.so)
0.06% 0.07% _raw_spin_lock ([kernel.kallsyms])
0.06% 0.04% _raw_spin_unlock_irqrestore ([kernel.kallsyms])
0.06% 0.05% _IO_fwrite (libc-2.19.so)
0.05% 0.03% __srcu_read_lock; __srcu_read_unlock ([kernel.kallsyms])
0.04% 0.05% _raw_spin_lock_irqsave ([kernel.kallsyms])
0.04% 0.06% vfprintf (libc-2.19.so)
0.04% 0.01% mutex_unlock ([kernel.kallsyms])
0.04% 0.01% _nv014306rm ([nvidia])
0.04% 0.04% rcu_eqs_enter_common.isra.47 ([kernel.kallsyms])
0.04% 0.02% mutex_lock ([kernel.kallsyms])
0.03% 0.07% __acct_update_integrals ([kernel.kallsyms])
0.03% 0.02% fget_light ([kernel.kallsyms])
0.03% 0.01% fput ([kernel.kallsyms])
0.03% 0.04% rcu_eqs_exit_common.isra.48 ([kernel.kallsyms])
1.63% 2.26% <...other 319 warm methods...>
....................................................................................................
100.00% 98.97% <totals>
....[Distribution by Area]..........................................................................
97.44% 95.99% <generated code>
1.60% 2.42% <native code in ([kernel.kallsyms])>
0.47% 0.78% <native code in (libjvm.so)>
0.22% 0.29% <native code in (libc-2.19.so)>
0.15% 0.07% <native code in (libpthread-2.19.so)>
0.07% 0.38% <native code in ([nvidia])>
0.05% 0.06% <native code in (libhsdis-amd64.so)>
0.00% 0.00% <native code in (nf_conntrack.ko)>
0.00% 0.00% <native code in (hid.ko)>
....................................................................................................
100.00% 100.00% <totals>
Since program addresses change from fork to fork, it does not make sense to run perfasm with more than
a single fork.
由于程序地址在不同的 fork 之间变化,因此使用多个 fork 运行 perfasm 是没有意义的。
*/
}
}
1、添加JMH依赖包
在Maven仓库中搜索依赖包jmh-core
和 jmh-generator-annprocess
,版本为 1.36
。需要注释 jmh-generator-annprocess 包中的“<scope>test</scope>”,不然项目运行会报错。
<!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-core -->
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-core</artifactId>
<version>1.36</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-generator-annprocess -->
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-generator-annprocess</artifactId>
<version>1.36</version>
<!-- <scope>test</scope>-->
</dependency>
2、StackProfiler
StackProfiler 不仅可以输出线程堆栈的信息,还能统计程序在执行的过程中线程状态的数据,比如 RUNNING 状态、WAIT 状态所占用的百分比等。
【StackProfiler样例 - 代码】
package cn.zhuangyt.javabase.jmh;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.profile.StackProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.TimeValue;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.TimeUnit;
/**
* JMH测试17:StackProfiler 样例
* @author 大白有点菜
*/
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Group)
public class JmhTestApp17_StackProfiler {
private BlockingQueue<Integer> queue;
private final static int VALUE = Integer.MAX_VALUE;
@Setup
public void init()
{
this.queue = new ArrayBlockingQueue<>(10);
}
@GroupThreads(5)
@Group("blockingQueue")
@Benchmark
public void put() throws InterruptedException
{
this.queue.put(VALUE);
}
@GroupThreads(5)
@Group("blockingQueue")
@Benchmark
public int take() throws InterruptedException
{
return this.queue.take();
}
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(JmhTestApp17_StackProfiler.class.getSimpleName())
.timeout(TimeValue.seconds(10))
// 增加StackProfiler
.addProfiler(StackProfiler.class)
.build();
new Runner(opt).run();
}
}
【StackProfiler样例 - 代码运行结果】
# JMH version: 1.36
# VM version: JDK 1.8.0_281, Java HotSpot(TM) 64-Bit Server VM, 25.281-b09
# VM invoker: D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe
# VM options: -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=1431:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin -Dfile.encoding=UTF-8
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 s per iteration, ***WARNING: The timeout might be too low!***
# Threads: 10 threads (1 group; 5x "put", 5x "take" in each group), will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue
# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration 1: (benchmark timed out, interrupted 1 times) 11.438 ±(99.9%) 6.489 us/op
# Warmup Iteration 2: (benchmark timed out, interrupted 1 times) 11.984 ±(99.9%) 6.463 us/op
# Warmup Iteration 3: 10.120 ±(99.9%) 6.382 us/op
# Warmup Iteration 4: (benchmark timed out, interrupted 1 times) 12.545 ±(99.9%) 7.091 us/op
# Warmup Iteration 5: 10.464 ±(99.9%) 4.765 us/op
Iteration 1: 9.519 ±(99.9%) 4.039 us/op
put: 9.834 ±(99.9%) 13.564 us/op
take: 9.205 ±(99.9%) 7.101 us/op
·stack: <delayed till summary>
Iteration 2: 10.761 ±(99.9%) 4.969 us/op
put: 10.628 ±(99.9%) 12.275 us/op
take: 10.893 ±(99.9%) 14.461 us/op
·stack: <delayed till summary>
Iteration 3: 11.941 ±(99.9%) 9.002 us/op
put: 11.009 ±(99.9%) 9.926 us/op
take: 12.873 ±(99.9%) 32.438 us/op
·stack: <delayed till summary>
Iteration 4: 13.876 ±(99.9%) 10.938 us/op
put: 13.309 ±(99.9%) 23.108 us/op
take: 14.444 ±(99.9%) 34.647 us/op
·stack: <delayed till summary>
Iteration 5: 11.473 ±(99.9%) 7.817 us/op
put: 11.103 ±(99.9%) 20.193 us/op
take: 11.843 ±(99.9%) 21.887 us/op
·stack: <delayed till summary>
Result "cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue":
11.514 ±(99.9%) 6.182 us/op [Average]
(min, avg, max) = (9.519, 11.514, 13.876), stdev = 1.606
CI (99.9%): [5.332, 17.696] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue:put":
11.176 ±(99.9%) 4.977 us/op [Average]
(min, avg, max) = (9.834, 11.176, 13.309), stdev = 1.293
CI (99.9%): [6.199, 16.154] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue:take":
11.852 ±(99.9%) 7.625 us/op [Average]
(min, avg, max) = (9.205, 11.852, 14.444), stdev = 1.980
CI (99.9%): [4.226, 19.477] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue:·stack":
Stack profiler:
....[Thread state distributions]....................................................................
88.7% WAITING
11.3% RUNNABLE
....[Thread state: WAITING].........................................................................
88.7% 100.0% sun.misc.Unsafe.park
....[Thread state: RUNNABLE]........................................................................
9.1% 80.2% java.net.SocketInputStream.socketRead0
1.7% 15.3% sun.misc.Unsafe.unpark
0.5% 4.3% sun.misc.Unsafe.park
0.0% 0.1% java.util.concurrent.ArrayBlockingQueue.take
0.0% 0.0% java.util.concurrent.ArrayBlockingQueue.put
0.0% 0.0% java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await
0.0% 0.0% cn.zhuangyt.javabase.jmh.jmh_generated.JmhTestApp17_StackProfiler_blockingQueue_jmhTest.blockingQueue_AverageTime
0.0% 0.0% cn.zhuangyt.javabase.jmh.jmh_generated.JmhTestApp17_StackProfiler_blockingQueue_jmhTest.take_avgt_jmhStub
# Run complete. Total time: 00:01:43
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
JmhTestApp17_StackProfiler.blockingQueue avgt 5 11.514 ± 6.182 us/op
JmhTestApp17_StackProfiler.blockingQueue:put avgt 5 11.176 ± 4.977 us/op
JmhTestApp17_StackProfiler.blockingQueue:take avgt 5 11.852 ± 7.625 us/op
JmhTestApp17_StackProfiler.blockingQueue:·stack avgt NaN ---
我们在Options中增加了StackProfiler用于分析线程的堆栈情况,还可以输出线程状态的分布情况。
得出结论:通过上面的输出结果可以看到,线程状态的分布情况为 WAITING:88.7%,RUNNABLE:11.3%,考虑到我们使用的是 BlockingQueue,因此这种分布应该还算合理。
3、GcProfiler
GcProfiler 可用于分析出在测试方法中垃圾回收器在JVM每个内存空间上所花费的时间,本节将使用自定义的类加载器进行类的加载。
【GcProfiler样例 - 代码】
package cn.zhuangyt.javabase.jmh;
import java.net.URL;
import java.net.URLClassLoader;
/**
* 大白有点菜 类加载器
* @author 大白有点菜
*/
public class DbydcClassLoader extends URLClassLoader {
private final byte[] bytes;
public DbydcClassLoader(byte[] bytes) {
super(new URL[0], ClassLoader.getSystemClassLoader());
this.bytes = bytes;
}
@Override
protected Class<?> findClass(String name) throws ClassNotFoundException {
return defineClass(name, bytes, 0, bytes.length);
}
}
package cn.zhuangyt.javabase.jmh;
/**
* 大白有点菜 类
* @author 大白有点菜
*/
public class Dbydc {
private String name = "大白有点菜";
private int age = 18;
private byte[] data = new byte[1024 * 10];
public static void main(String[] args) {
}
}
package cn.zhuangyt.javabase.jmh;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.profile.GCProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeUnit;
/**
* JMH测试18:GcProfiler 样例
* @author 大白有点菜
*/
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class JmhTestApp18_GcProfiler {
private byte[] dbydcBytes;
private DbydcClassLoader classLoader;
@Setup
public void init() throws IOException
{
this.dbydcBytes = Files.readAllBytes(
Paths.get("D:\\Dbydc.class")
);
this.classLoader = new DbydcClassLoader(dbydcBytes);
}
@Benchmark
public Object testLoadClass() throws ClassNotFoundException, IllegalAccessException, InstantiationException {
Class<?> dbydcClass = Class.forName("cn.zhuangyt.javabase.jmh.Dbydc", true, classLoader);
return dbydcClass.newInstance();
}
public static void main(String[] args) throws RunnerException {
final Options opts = new OptionsBuilder()
.include(JmhTestApp18_GcProfiler.class.getSimpleName())
// add GcProfiler输出基准方法执行过程中的GC信息
.addProfiler(GCProfiler.class)
// 将最大堆内存设置为128MB,会有多次的GC发生
.jvmArgsAppend("-Xmx128M")
.build();
new Runner(opts).run();
}
}
有些东西需要注意一下,编译 Dbydc.java 为 Dbydc.class 其实很简单,添加一个 main 函数运行一下就可以了,Dbydc.class 生成在 target 目录下。笔者拷贝 Dbydc.class 文件到 D盘根目录下方便测试。使用 Class.forName 方法加载 Dbydc.class,需要把 Dbydc.class 的相对路径写全,即写所在的包路径,只写类名会报找不到类异常
。
【GcProfiler样例 - 代码运行结果】
# JMH version: 1.36
# VM version: JDK 1.8.0_281, Java HotSpot(TM) 64-Bit Server VM, 25.281-b09
# VM invoker: D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe
# VM options: -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=9976:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin -Dfile.encoding=UTF-8 -Xmx128M
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass
# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration 1: 1.332 us/op
# Warmup Iteration 2: 1.167 us/op
# Warmup Iteration 3: 1.159 us/op
# Warmup Iteration 4: 1.416 us/op
# Warmup Iteration 5: 1.164 us/op
Iteration 1: 1.169 us/op
·gc.alloc.rate: 8385.220 MB/sec
·gc.alloc.rate.norm: 10280.000 B/op
·gc.count: 2041.000 counts
·gc.time: 1168.000 ms
Iteration 2: 1.173 us/op
·gc.alloc.rate: 8359.449 MB/sec
·gc.alloc.rate.norm: 10280.000 B/op
·gc.count: 2037.000 counts
·gc.time: 1175.000 ms
Iteration 3: 1.158 us/op
·gc.alloc.rate: 8462.958 MB/sec
·gc.alloc.rate.norm: 10280.000 B/op
·gc.count: 2060.000 counts
·gc.time: 1202.000 ms
Iteration 4: 1.155 us/op
·gc.alloc.rate: 8489.859 MB/sec
·gc.alloc.rate.norm: 10280.000 B/op
·gc.count: 2068.000 counts
·gc.time: 1196.000 ms
Iteration 5: 1.233 us/op
·gc.alloc.rate: 7953.332 MB/sec
·gc.alloc.rate.norm: 10280.000 B/op
·gc.count: 1938.000 counts
·gc.time: 1107.000 ms
Result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass":
1.178 ±(99.9%) 0.122 us/op [Average]
(min, avg, max) = (1.155, 1.178, 1.233), stdev = 0.032
CI (99.9%): [1.056, 1.300] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass:·gc.alloc.rate":
8330.164 ±(99.9%) 837.078 MB/sec [Average]
(min, avg, max) = (7953.332, 8330.164, 8489.859), stdev = 217.387
CI (99.9%): [7493.085, 9167.242] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass:·gc.alloc.rate.norm":
10280.000 ±(99.9%) 0.001 B/op [Average]
(min, avg, max) = (10280.000, 10280.000, 10280.000), stdev = 0.001
CI (99.9%): [10280.000, 10280.000] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass:·gc.count":
10144.000 ±(99.9%) 0.001 counts [Sum]
(min, avg, max) = (1938.000, 2028.800, 2068.000), stdev = 52.371
CI (99.9%): [10144.000, 10144.000] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass:·gc.time":
5848.000 ±(99.9%) 0.001 ms [Sum]
(min, avg, max) = (1107.000, 1169.600, 1202.000), stdev = 37.740
CI (99.9%): [5848.000, 5848.000] (assumes normal distribution)
# Run complete. Total time: 00:01:41
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
JmhTestApp18_GcProfiler.testLoadClass avgt 5 1.178 ± 0.122 us/op
JmhTestApp18_GcProfiler.testLoadClass:·gc.alloc.rate avgt 5 8330.164 ± 837.078 MB/sec
JmhTestApp18_GcProfiler.testLoadClass:·gc.alloc.rate.norm avgt 5 10280.000 ± 0.001 B/op
JmhTestApp18_GcProfiler.testLoadClass:·gc.count avgt 5 10144.000 counts
JmhTestApp18_GcProfiler.testLoadClass:·gc.time avgt 5 5848.000 ms
运行上面的基准测试方法,除了得到 testLoadClass() 方法的基准数据之外,还会得到GC相关的信息。
根据 GcProfiler 的输出信息可以看到,在基准方法执行过程之中,GC总共出现过 10144 次,总共耗时 5848 毫秒,在此期间也发生了多次的堆内存的申请,比如,每秒钟大约会有 8830.164MB 的数据被创建,若换算成对testLoadClass方法的每次调用,那么我们会发现大约有 10280.000 Byte 的内存使用。
4、ClassLoaderProfiler
ClassLoaderProfiler 可以帮助我们看到在基准方法的执行过程中有多少类被加载和卸载,但是考虑到在一个类加载器中同一个类只会被加载一次的情况,因此我们需要将Warmup设置为0,以避免在热身阶段就已经加载了基准测试方法所需的所有类。
【ClassLoaderProfiler样例 - 代码】
package cn.zhuangyt.javabase.jmh;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.profile.ClassloaderProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeUnit;
/**
* JMH测试19:ClassLoaderProfiler 样例
* @author 大白有点菜
*/
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
// 将热身批次设置为0
@Warmup(iterations = 0)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class JmhTestApp19_ClassLoaderProfiler {
private byte[] dbydcBytes;
private DbydcClassLoader classLoader;
@Setup
public void init() throws IOException
{
this.dbydcBytes = Files.readAllBytes(
Paths.get("D:\\Dbydc.class")
);
this.classLoader = new DbydcClassLoader(dbydcBytes);
}
@Benchmark
public Object testLoadClass() throws ClassNotFoundException, IllegalAccessException, InstantiationException {
Class<?> dbydcClass = Class.forName("cn.zhuangyt.javabase.jmh.Dbydc", true, classLoader);
return dbydcClass.newInstance();
}
public static void main(String[] args) throws RunnerException {
final Options opts = new OptionsBuilder()
.include(JmhTestApp19_ClassLoaderProfiler.class.getSimpleName())
// 增加CL Profiler,输出类的加载、卸载信息
.addProfiler(ClassloaderProfiler.class)
.build();
new Runner(opts).run();
}
}
【ClassLoaderProfiler样例 - 代码运行结果】
# JMH version: 1.36
# VM version: JDK 1.8.0_281, Java HotSpot(TM) 64-Bit Server VM, 25.281-b09
# VM invoker: D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe
# VM options: -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=11136:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin -Dfile.encoding=UTF-8
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: <none>
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass
# Run progress: 0.00% complete, ETA 00:00:50
# Fork: 1 of 1
Iteration 1: 1.251 us/op
·class.load: 952.182 classes/sec
·class.load.norm: ≈ 10⁻⁵ classes/op
·class.unload: ≈ 0 classes/sec
·class.unload.norm: ≈ 0 classes/op
Iteration 2: 1.243 us/op
·class.load: ≈ 0 classes/sec
·class.load.norm: ≈ 0 classes/op
·class.unload: ≈ 0 classes/sec
·class.unload.norm: ≈ 0 classes/op
Iteration 3: 1.218 us/op
·class.load: ≈ 0 classes/sec
·class.load.norm: ≈ 0 classes/op
·class.unload: ≈ 0 classes/sec
·class.unload.norm: ≈ 0 classes/op
Iteration 4: 1.179 us/op
·class.load: ≈ 0 classes/sec
·class.load.norm: ≈ 0 classes/op
·class.unload: ≈ 0 classes/sec
·class.unload.norm: ≈ 0 classes/op
Iteration 5: 1.149 us/op
·class.load: ≈ 0 classes/sec
·class.load.norm: ≈ 0 classes/op
·class.unload: ≈ 0 classes/sec
·class.unload.norm: ≈ 0 classes/op
Result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass":
1.208 ±(99.9%) 0.167 us/op [Average]
(min, avg, max) = (1.149, 1.208, 1.251), stdev = 0.043
CI (99.9%): [1.041, 1.375] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.load":
190.436 ±(99.9%) 1639.715 classes/sec [Average]
(min, avg, max) = (≈ 0, 190.436, 952.182), stdev = 425.829
CI (99.9%): [≈ 0, 1830.151] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.load.norm":
≈ 10⁻⁶ classes/op
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.unload":
≈ 0 classes/sec
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.unload.norm":
≈ 0 classes/op
# Run complete. Total time: 00:00:51
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
JmhTestApp19_ClassLoaderProfiler.testLoadClass avgt 5 1.208 ± 0.167 us/op
JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.load avgt 5 190.436 ± 1639.715 classes/sec
JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.load.norm avgt 5 ≈ 10⁻⁶ classes/op
JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.unload avgt 5 ≈ 0 classes/sec
JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.unload.norm avgt 5 ≈ 0 classes/op
运行上面的基准测试方法,我们将会看到在第一个批次的度量时加载了大量的类,在余下的几次度量中将不会再进行类的加载了,这也符合JVM类加载器的基本逻辑。
我们可以看到,在 testLoadClass 方法的执行过程中,每秒大约会有 190 个类的加载。
5、CompilerProfiler
CompilerProfiler 将会告诉你在代码的执行过程中JIT编译器所花费的优化时间,我们可以打开verbose模式观察更详细的输出。
【CompilerProfiler样例 - 代码】
package cn.zhuangyt.javabase.jmh;
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.profile.CompilerProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeUnit;
/**
* JMH测试20:CompilerProfiler 样例
* @author 大白有点菜
*/
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class JmhTestApp20_CompilerProfiler {
private byte[] dbydcBytes;
private DbydcClassLoader classLoader;
@Setup
public void init() throws IOException
{
this.dbydcBytes = Files.readAllBytes(
Paths.get("D:\\Dbydc.class")
);
this.classLoader = new DbydcClassLoader(dbydcBytes);
}
@Benchmark
public Object testLoadClass() throws ClassNotFoundException, IllegalAccessException, InstantiationException {
Class<?> dbydcClass = Class.forName("cn.zhuangyt.javabase.jmh.Dbydc", true, classLoader);
return dbydcClass.newInstance();
}
public static void main(String[] args) throws RunnerException {
final Options opts = new OptionsBuilder()
.include(JmhTestApp20_CompilerProfiler.class.getSimpleName())
.addProfiler(CompilerProfiler.class)
.verbosity(VerboseMode.EXTRA)
.build();
new Runner(opts).run();
}
}
【CompilerProfiler样例 - 代码运行结果】
# JMH version: 1.36
# VM version: JDK 1.8.0_281, Java HotSpot(TM) 64-Bit Server VM, 25.281-b09
# VM invoker: D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe
# VM options: -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=11289:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin -Dfile.encoding=UTF-8
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: cn.zhuangyt.javabase.jmh.JmhTestApp20_CompilerProfiler.testLoadClass
# Run progress: 0.00% complete, ETA 00:01:40
Forking using command: [D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe, -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=11289:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin, -Dfile.encoding=UTF-8, -XX:CompileCommandFile=C:\Users\ZHUANG~1\AppData\Local\Temp\jmh6395784747893774195compilecommand, -cp, "D:\Develop\JDK\jdk1.8.0_281\jre\lib\charsets.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\deploy.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\access-bridge-64.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\cldrdata.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\dnsns.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\jaccess.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\jfxrt.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\localedata.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\nashorn.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\sunec.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\sunjce_provider.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\sunmscapi.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\sunpkcs11.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\zipfs.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\javaws.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\jce.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\jfr.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\jfxswt.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\jsse.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\management-agent.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\plugin.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\resources.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\rt.jar;D:\Projects\JavaProject\spring-cloud-study\java-base\target\classes;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter-web\2.6.3\spring-boot-starter-web-2.6.3.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter\2.6.3\spring-boot-starter-2.6.3.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot\2.6.3\spring-boot-2.6.3.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-autoconfigure\2.6.3\spring-boot-autoconfigure-2.6.3.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter-logging\2.6.3\spring-boot-starter-logging-2.6.3.jar;D:\Develop\MavenRepository\ch\qos\logback\logback-classic\1.2.10\logback-classic-1.2.10.jar;D:\Develop\MavenRepository\ch\qos\logback\logback-core\1.2.10\logback-core-1.2.10.jar;D:\Develop\MavenRepository\org\apache\logging\log4j\log4j-to-slf4j\2.17.1\log4j-to-slf4j-2.17.1.jar;D:\Develop\MavenRepository\org\apache\logging\log4j\log4j-api\2.17.1\log4j-api-2.17.1.jar;D:\Develop\MavenRepository\org\slf4j\jul-to-slf4j\1.7.33\jul-to-slf4j-1.7.33.jar;D:\Develop\MavenRepository\jakarta\annotation\jakarta.annotation-api\1.3.5\jakarta.annotation-api-1.3.5.jar;D:\Develop\MavenRepository\org\yaml\snakeyaml\1.29\snakeyaml-1.29.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter-json\2.6.3\spring-boot-starter-json-2.6.3.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\core\jackson-databind\2.13.1\jackson-databind-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\core\jackson-annotations\2.13.1\jackson-annotations-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\core\jackson-core\2.13.1\jackson-core-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\datatype\jackson-datatype-jdk8\2.13.1\jackson-datatype-jdk8-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\datatype\jackson-datatype-jsr310\2.13.1\jackson-datatype-jsr310-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\module\jackson-module-parameter-names\2.13.1\jackson-module-parameter-names-2.13.1.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter-tomcat\2.6.3\spring-boot-starter-tomcat-2.6.3.jar;D:\Develop\MavenRepository\org\apache\tomcat\embed\tomcat-embed-core\9.0.56\tomcat-embed-core-9.0.56.jar;D:\Develop\MavenRepository\org\apache\tomcat\embed\tomcat-embed-el\9.0.56\tomcat-embed-el-9.0.56.jar;D:\Develop\MavenRepository\org\apache\tomcat\embed\tomcat-embed-websocket\9.0.56\tomcat-embed-websocket-9.0.56.jar;D:\Develop\MavenRepository\org\springframework\spring-web\5.3.15\spring-web-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-beans\5.3.15\spring-beans-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-webmvc\5.3.15\spring-webmvc-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-aop\5.3.15\spring-aop-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-context\5.3.15\spring-context-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-expression\5.3.15\spring-expression-5.3.15.jar;D:\Develop\MavenRepository\org\slf4j\slf4j-api\1.7.33\slf4j-api-1.7.33.jar;D:\Develop\MavenRepository\org\hamcrest\hamcrest\2.2\hamcrest-2.2.jar;D:\Develop\MavenRepository\org\springframework\spring-core\5.3.15\spring-core-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-jcl\5.3.15\spring-jcl-5.3.15.jar;D:\Develop\MavenRepository\org\openjdk\jmh\jmh-core\1.36\jmh-core-1.36.jar;D:\Develop\MavenRepository\net\sf\jopt-simple\jopt-simple\5.0.4\jopt-simple-5.0.4.jar;D:\Develop\MavenRepository\org\apache\commons\commons-math3\3.2\commons-math3-3.2.jar;D:\Develop\MavenRepository\org\openjdk\jmh\jmh-generator-annprocess\1.36\jmh-generator-annprocess-1.36.jar;D:\Develop\MavenRepository\junit\junit\4.13.2\junit-4.13.2.jar;D:\Develop\MavenRepository\org\hamcrest\hamcrest-core\2.2\hamcrest-core-2.2.jar;D:\Develop\MavenRepository\com\google\guava\guava\31.0.1-jre\guava-31.0.1-jre.jar;D:\Develop\MavenRepository\com\google\guava\failureaccess\1.0.1\failureaccess-1.0.1.jar;D:\Develop\MavenRepository\com\google\guava\listenablefuture\9999.0-empty-to-avoid-conflict-with-guava\listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar;D:\Develop\MavenRepository\com\google\code\findbugs\jsr305\3.0.2\jsr305-3.0.2.jar;D:\Develop\MavenRepository\org\checkerframework\checker-qual\3.12.0\checker-qual-3.12.0.jar;D:\Develop\MavenRepository\com\google\errorprone\error_prone_annotations\2.7.1\error_prone_annotations-2.7.1.jar;D:\Develop\MavenRepository\com\google\j2objc\j2objc-annotations\1.3\j2objc-annotations-1.3.jar;C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar", org.openjdk.jmh.runner.ForkedMain, 127.0.0.1, 11291]
# Fork: 1 of 1
# Warmup Iteration 1: 1.175 us/op
# Warmup Iteration 2: 1.165 us/op
# Warmup Iteration 3: 1.104 us/op
# Warmup Iteration 4: 1.028 us/op
# Warmup Iteration 5: 1.093 us/op
Iteration 1: 1.244 us/op
·compiler.time.profiled: ≈ 0 ms
·compiler.time.total: 317.000 ms
Iteration 2: 1.091 us/op
·compiler.time.profiled: 2.000 ms
·compiler.time.total: 320.000 ms
Iteration 3: 1.140 us/op
·compiler.time.profiled: 1.000 ms
·compiler.time.total: 321.000 ms
Iteration 4: 1.182 us/op
·compiler.time.profiled: ≈ 0 ms
·compiler.time.total: 321.000 ms
Iteration 5: 1.119 us/op
·compiler.time.profiled: ≈ 0 ms
·compiler.time.total: 321.000 ms
Result "cn.zhuangyt.javabase.jmh.JmhTestApp20_CompilerProfiler.testLoadClass":
1.155 ±(99.9%) 0.229 us/op [Average]
(min, avg, max) = (1.091, 1.155, 1.244), stdev = 0.060
CI (99.9%): [0.926, 1.385] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp20_CompilerProfiler.testLoadClass:·compiler.time.profiled":
3.000 ±(99.9%) 0.001 ms [Sum]
(min, avg, max) = (≈ 0, 0.600, 2.000), stdev = 0.894
CI (99.9%): [3.000, 3.000] (assumes normal distribution)
Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp20_CompilerProfiler.testLoadClass:·compiler.time.total":
321.000 ±(99.9%) 0.001 ms [Maximum]
(min, avg, max) = (317.000, 320.000, 321.000), stdev = 1.732
CI (99.9%): [321.000, 321.000] (assumes normal distribution)
# Run complete. Total time: 00:01:41
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
JmhTestApp20_CompilerProfiler.testLoadClass avgt 5 1.155 ± 0.229 us/op
JmhTestApp20_CompilerProfiler.testLoadClass:·compiler.time.profiled avgt 5 3.000 ms
JmhTestApp20_CompilerProfiler.testLoadClass:·compiler.time.total avgt 5 321.000 ms
我们可以看到,在整个方法的执行过程中,profiled 的优化耗时为 3 毫秒,total 的优化耗时为 321 毫秒。