性能调优之JMH必知必会5：JMH的Profiler

JMH必知必会系列文章（持续更新）

一、前言

在前面四篇文章中分别介绍了什么是JMH、JMH的基本法、编写正确的微基准测试用例和JMH的高级用法。现在来介绍JMH的Profiler。【单位换算：1秒(s)=1000000微秒(us)=1000000000纳秒(ns)】

官方JMH源码（包含样例，在jmh-samples包里）下载地址：https://github.com/openjdk/jmh/tags。

官方JMH样例在线浏览地址：http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/。

本文内容参考书籍《Java高并发编程详解：深入理解并发核心库》，作者为汪文君，读者有需要可以去购买正版书籍。

本文由 @大白有点菜原创，请勿盗用，转载请说明出处！如果觉得文章还不错，请点点赞，加关注，谢谢！

二、JMH的Profiler

JMH提供了一些非常有用的Profiler可以帮助我们更加深入地了解基准测试，甚至还能帮助开发者分析所编写的代码。

Profiler 名称	Profiler 描述
CL	分析执行 Benchmark 方法时的类加载情况
COMP	通过 Standard MBean 进行 Benchmark 方法的 JIT 编译器分析
GC	通过 Standard MBean 进行 Benchmark 方法的 GC 分析
HS_CL	HotSpot^TM 类加载器通过特定于实现的 MBean 进行分析
HS_COMP	HotSpot^TMJIT 通过特定于实现的 MBean 编译分析
HS_GC	HotSpot^TM内存管理器（GC）通过特定于实现的 MBean 进行分析
HS_RT	通过 Implementation-Specific MBean 进行 HotSpot^TM 运行时分析
HS_THR	通过 Implementation-Specific MBean 进行 HotSpot^TM 线程分析
STACK	JVM线程栈信息分析

【先附上官方Profiler样例（JMHSample_35_Profilers）代码，并使用谷歌和百度翻译其中的注解】

package cn.zhuangyt.javabase.jmh.jmh_sample;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.profile.ClassloaderProfiler;
import org.openjdk.jmh.profile.DTraceAsmProfiler;
import org.openjdk.jmh.profile.LinuxPerfProfiler;
import org.openjdk.jmh.profile.StackProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.net.URL;
import java.net.URLClassLoader;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;

public class JMHSample_35_Profilers {
    
    

    /**
     * This sample serves as the profiler overview.
     *
     * 此示例用作探查器概述。
     *
     * JMH has a few very handy profilers that help to understand your benchmarks. While
     * these profilers are not the substitute for full-fledged external profilers, in many
     * cases, these are handy to quickly dig into the benchmark behavior. When you are
     * doing many cycles of tuning up the benchmark code itself, it is important to have
     * a quick turnaround for the results.
     *
     * JMH 有一些非常方便的分析器，可以帮助您了解基准。
     * 虽然这些分析器不能替代成熟的外部分析器，但在许多情况下，它们可以方便地快速挖掘基准行为。
     * 当您对基准代码本身进行多次调整时，快速获得结果很重要。
     *
     * Use -lprof to list the profilers. There are quite a few profilers, and this sample
     * would expand on a handful of most useful ones. Many profilers have their own options,
     * usually accessible via -prof <profiler-name>:help.
     *
     * 使用 -lprof 列出分析器。 有很多分析器，这个示例将扩展一些最有用的分析器。
     * 许多分析器都有自己的选项，通常可以通过 -prof <profiler-name>:help 访问。
     *
     * Since profilers are reporting on different things, it is hard to construct a single
     * benchmark sample that will show all profilers in action. Therefore, we have a couple
     * of benchmarks in this sample.
     *
     * 由于分析器报告不同的事情，因此很难构建一个单一的基准样本来显示所有分析器的运行情况。 因此，我们在此示例中有几个基准。
     */

    /*
     * ================================ MAPS BENCHMARK ================================
     */

    @State(Scope.Thread)
    @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
    @Fork(3)
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public static class Maps {
    
    
        private Map<Integer, Integer> map;

        @Param({
    
    "hashmap", "treemap"})
        private String type;

        private int begin;
        private int end;

        @Setup
        public void setup() {
    
    
            switch (type) {
    
    
                case "hashmap":
                    map = new HashMap<>();
                    break;
                case "treemap":
                    map = new TreeMap<>();
                    break;
                default:
                    throw new IllegalStateException("Unknown type: " + type);
            }

            begin = 1;
            end = 256;
            for (int i = begin; i < end; i++) {
    
    
                map.put(i, i);
            }
        }

        @Benchmark
        public void test(Blackhole bh) {
    
    
            for (int i = begin; i < end; i++) {
    
    
                bh.consume(map.get(i));
            }
        }

        /*
         * ============================== HOW TO RUN THIS TEST: ====================================
         *
         * You can run this test:
         *
         * a) Via the command line:
         *    $ mvn clean install
         *    $ java -jar target/benchmarks.jar JMHSample_35.*Maps -prof stack
         *    $ java -jar target/benchmarks.jar JMHSample_35.*Maps -prof gc
         *
         * b) Via the Java API:
         *    (see the JMH homepage for possible caveats when running from IDE:
         *      http://openjdk.java.net/projects/code-tools/jmh/)
         */

        public static void main(String[] args) throws RunnerException {
    
    
            Options opt = new OptionsBuilder()
                    .include(JmhTestApp17_Profiler.Maps.class.getSimpleName())
                    .addProfiler(StackProfiler.class)
//                    .addProfiler(GCProfiler.class)
                    .build();

            new Runner(opt).run();
        }

        /**
            Running this benchmark will yield something like:

              Benchmark                              (type)  Mode  Cnt     Score    Error   Units
              JMHSample_35_Profilers.Maps.test     hashmap  avgt    5  1553.201 ±   6.199   ns/op
              JMHSample_35_Profilers.Maps.test     treemap  avgt    5  5177.065 ± 361.278   ns/op

            Running with -prof stack will yield:

              ....[Thread state: RUNNABLE]........................................................................
               99.0%  99.0% org.openjdk.jmh.samples.JMHSample_35_Profilers$Maps.test
                0.4%   0.4% org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Maps_test.test_avgt_jmhStub
                0.2%   0.2% sun.reflect.NativeMethodAccessorImpl.invoke0
                0.2%   0.2% java.lang.Integer.valueOf
                0.2%   0.2% sun.misc.Unsafe.compareAndSwapInt

              ....[Thread state: RUNNABLE]........................................................................
               78.0%  78.0% java.util.TreeMap.getEntry
               21.2%  21.2% org.openjdk.jmh.samples.JMHSample_35_Profilers$Maps.test
                0.4%   0.4% java.lang.Integer.valueOf
                0.2%   0.2% sun.reflect.NativeMethodAccessorImpl.invoke0
                0.2%   0.2% org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Maps_test.test_avgt_jmhStub

            Stack profiler is useful to quickly see if the code we are stressing actually executes. As many other
            sampling profilers, it is susceptible for sampling bias: it can fail to notice quickly executing methods,
            for example. In the benchmark above, it does not notice HashMap.get.

            Stack profiler（堆栈分析器） 有助于快速查看我们强调的代码是否实际执行。 与许多其他采样分析器一样，
            它容易受到采样偏差的影响：例如，它可能无法注意到快速执行的方法。 在上面的基准测试中，它没有注意到 HashMap.get。

            Next up, GC profiler. Running with -prof gc will yield:

              Benchmark                                                            (type)  Mode  Cnt    Score     Error   Units

              JMHSample_35_Profilers.Maps.test                                   hashmap  avgt    5  1553.201 ±   6.199   ns/op
              JMHSample_35_Profilers.Maps.test:·gc.alloc.rate                    hashmap  avgt    5  1257.046 ±   5.675  MB/sec
              JMHSample_35_Profilers.Maps.test:·gc.alloc.rate.norm               hashmap  avgt    5  2048.001 ±   0.001    B/op
              JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Eden_Space           hashmap  avgt    5  1259.148 ± 315.277  MB/sec
              JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Eden_Space.norm      hashmap  avgt    5  2051.519 ± 520.324    B/op
              JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Survivor_Space       hashmap  avgt    5     0.175 ±   0.386  MB/sec
              JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Survivor_Space.norm  hashmap  avgt    5     0.285 ±   0.629    B/op
              JMHSample_35_Profilers.Maps.test:·gc.count                         hashmap  avgt    5    29.000            counts
              JMHSample_35_Profilers.Maps.test:·gc.time                          hashmap  avgt    5    16.000                ms

              JMHSample_35_Profilers.Maps.test                                   treemap  avgt    5  5177.065 ± 361.278   ns/op
              JMHSample_35_Profilers.Maps.test:·gc.alloc.rate                    treemap  avgt    5   377.251 ±  26.188  MB/sec
              JMHSample_35_Profilers.Maps.test:·gc.alloc.rate.norm               treemap  avgt    5  2048.003 ±   0.001    B/op
              JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Eden_Space           treemap  avgt    5   392.743 ± 174.156  MB/sec
              JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Eden_Space.norm      treemap  avgt    5  2131.767 ± 913.941    B/op
              JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Survivor_Space       treemap  avgt    5     0.131 ±   0.215  MB/sec
              JMHSample_35_Profilers.Maps.test:·gc.churn.PS_Survivor_Space.norm  treemap  avgt    5     0.709 ±   1.125    B/op
              JMHSample_35_Profilers.Maps.test:·gc.count                         treemap  avgt    5    25.000            counts
              JMHSample_35_Profilers.Maps.test:·gc.time                          treemap  avgt    5    26.000                ms

            There, we can see that the tests are producing quite some garbage. "gc.alloc" would say we are allocating 1257
            and 377 MB of objects per second, or 2048 bytes per benchmark operation. "gc.churn" would say that GC removes
            the same amount of garbage from Eden space every second. In other words, we are producing 2048 bytes of garbage per
            benchmark operation.

            在那里，我们可以看到测试产生了相当多的垃圾。 “gc.alloc”表示我们每秒分配 1257 和 377 MB 的对象，或者每个基准操作分配 2048 字节。
            “gc.churn”表示 GC 每秒从Eden（伊甸园）空间移除相同数量的垃圾。 换句话说，我们在每个基准操作中产生 2048 字节的垃圾。

            If you look closely at the test, you can get a (correct) hypothesis this is due to Integer autoboxing.

            如果仔细观察测试，您可以得到一个（正确的）假设，这是由于Integer自动装箱。

            Note that "gc.alloc" counters generally produce more accurate data, but they can also fail when threads come and
            go over the course of the benchmark. "gc.churn" values are updated on each GC event, and so if you want a more accurate
            data, running longer and/or with small heap would help. But anyhow, always cross-reference "gc.alloc" and "gc.churn"
            values with each other to get a complete picture.

            请注意，“gc.alloc”计数器通常会产生更准确的数据，但当线程在基准测试过程中来来去去时，它们也可能会失败。
            “gc.churn”值在每个 GC 事件中更新，因此如果您想要更准确的数据，运行更长时间和/或使用小堆会有所帮助。
            但无论如何，总是相互交叉引用“gc.alloc”和“gc.churn”值以获得完整的图片。

            It is also worth noticing that non-normalized counters are dependent on benchmark performance! Here, "treemap"
            tests are 3x slower, and thus both allocation and churn rates are also comparably lower. It is often useful to look
            into non-normalized counters to see if the test is allocation/GC-bound (figure the allocation pressure "ceiling"
            for your configuration!), and normalized counters to see the more precise benchmark behavior.

            还值得注意的是，非标准化计数器依赖于基准性能！ 在这里，“treemap”测试慢了 3 倍，因此分配率和流失率也相对较低。
            查看非规范化计数器以查看测试是否受分配/GC-bound（GC 限制）（为您的配置计算分配压力“上限”！）和规范化计数器以查看更精确的基准行为通常很有用。

            As most profilers, both "stack" and "gc" profile are able to aggregate samples from multiple forks. It is a good
            idea to run multiple forks with the profilers enabled, as it improves results error estimates.

            与大多数分析器一样，“stack”和“gc”分析器都能够聚合来自多个分支的样本。 在启用分析器的情况下运行多个分支是一个好主意，因为它可以改进结果错误估计。
        */
    }

    /*
     * ================================ CLASSLOADER BENCHMARK ================================
     */


    @State(Scope.Thread)
    @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
    @Fork(3)
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public static class Classy {
    
    

        /**
         * Our own crippled classloader, that can only load a simple class over and over again.
         */
        public static class XLoader extends URLClassLoader {
    
    
            private static final byte[] X_BYTECODE = new byte[]{
    
    
                    (byte) 0xCA, (byte) 0xFE, (byte) 0xBA, (byte) 0xBE, 0x00, 0x00, 0x00, 0x34, 0x00, 0x0D, 0x0A, 0x00, 0x03, 0x00,
                    0x0A, 0x07, 0x00, 0x0B, 0x07, 0x00, 0x0C, 0x01, 0x00, 0x06, 0x3C, 0x69, 0x6E, 0x69, 0x74, 0x3E, 0x01, 0x00, 0x03,
                    0x28, 0x29, 0x56, 0x01, 0x00, 0x04, 0x43, 0x6F, 0x64, 0x65, 0x01, 0x00, 0x0F, 0x4C, 0x69, 0x6E, 0x65, 0x4E, 0x75,
                    0x6D, 0x62, 0x65, 0x72, 0x54, 0x61, 0x62, 0x6C, 0x65, 0x01, 0x00, 0x0A, 0x53, 0x6F, 0x75, 0x72, 0x63, 0x65, 0x46,
                    0x69, 0x6C, 0x65, 0x01, 0x00, 0x06, 0x58, 0x2E, 0x6A, 0x61, 0x76, 0x61, 0x0C, 0x00, 0x04, 0x00, 0x05, 0x01, 0x00,
                    0x01, 0x58, 0x01, 0x00, 0x10, 0x6A, 0x61, 0x76, 0x61, 0x2F, 0x6C, 0x61, 0x6E, 0x67, 0x2F, 0x4F, 0x62, 0x6A, 0x65,
                    0x63, 0x74, 0x00, 0x20, 0x00, 0x02, 0x00, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x04, 0x00,
                    0x05, 0x00, 0x01, 0x00, 0x06, 0x00, 0x00, 0x00, 0x1D, 0x00, 0x01, 0x00, 0x01, 0x00, 0x00, 0x00, 0x05, 0x2A,
                    (byte) 0xB7, 0x00, 0x01, (byte) 0xB1, 0x00, 0x00, 0x00, 0x01, 0x00, 0x07, 0x00, 0x00, 0x00, 0x06, 0x00, 0x01, 0x00,
                    0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x08, 0x00, 0x00, 0x00, 0x02, 0x00, 0x09,
            };

            public XLoader() {
    
    
                super(new URL[0], ClassLoader.getSystemClassLoader());
            }

            @Override
            protected Class<?> findClass(final String name) throws ClassNotFoundException {
    
    
                return defineClass(name, X_BYTECODE, 0, X_BYTECODE.length);
            }

        }

        @Benchmark
        public Class<?> load() throws ClassNotFoundException {
    
    
            return Class.forName("X", true, new JmhTestApp17_Profiler.Classy.XLoader());
        }

        /*
         * ============================== HOW TO RUN THIS TEST: ====================================
         *
         * You can run this test:
         *
         * a) Via the command line:
         *    $ mvn clean install
         *    $ java -jar target/benchmarks.jar JMHSample_35.*Classy -prof cl
         *    $ java -jar target/benchmarks.jar JMHSample_35.*Classy -prof comp
         *
         * b) Via the Java API:
         *    (see the JMH homepage for possible caveats when running from IDE:
         *      http://openjdk.java.net/projects/code-tools/jmh/)
         */

        public static void main(String[] args) throws RunnerException {
    
    
            Options opt = new OptionsBuilder()
                    .include(JmhTestApp17_Profiler.Classy.class.getSimpleName())
                    .addProfiler(ClassloaderProfiler.class)
//                    .addProfiler(CompilerProfiler.class)
                    .build();

            new Runner(opt).run();
        }

        /**
            Running with -prof cl will yield:

                Benchmark                                              Mode  Cnt      Score      Error        Units
                JMHSample_35_Profilers.Classy.load                     avgt   15  34215.363 ±  545.892        ns/op
                JMHSample_35_Profilers.Classy.load:·class.load         avgt   15  29374.097 ±  716.743  classes/sec
                JMHSample_35_Profilers.Classy.load:·class.load.norm    avgt   15      1.000 ±    0.001   classes/op
                JMHSample_35_Profilers.Classy.load:·class.unload       avgt   15  29598.233 ± 3420.181  classes/sec
                JMHSample_35_Profilers.Classy.load:·class.unload.norm  avgt   15      1.008 ±    0.119   classes/op

            Here, we can see the benchmark indeed load class per benchmark op, and this adds up to more than 29K classloads
            per second. We can also see the runtime is able to successfully keep the number of loaded classes at bay,
            since the class unloading happens at the same rate.

            在这里，我们可以看到基准确实在每个基准操作中加载了类，这加起来每秒超过 29K 类加载。
            我们还可以看到运行时能够成功地控制加载类的数量，因为类卸载以相同的速率发生。

            This profiler is handy when doing the classloading performance work, because it says if the classes
            were actually loaded, and not reused across the Class.forName calls. It also helps to see if the benchmark
            performs any classloading in the measurement phase. For example, if you have non-classloading benchmark,
            you would expect these metrics be zero.

            在执行类加载性能工作时，此分析器很方便，因为它会说明类是否实际加载，而不是在 Class.forName 调用中重用。
            它还有助于查看基准测试是否在测量阶段执行任何类加载。 例如，如果您有非类加载基准，您会期望这些指标为零。

            Another useful profiler that could tell if compiler is doing a heavy work in background, and thus interfering
            with measurement, -prof comp:

            另一个有用的分析器可以判断编译器是否在后台做繁重的工作，从而干扰测量，-prof comp：

                Benchmark                                                   Mode  Cnt      Score      Error  Units
                JMHSample_35_Profilers.Classy.load                          avgt    5  33523.875 ± 3026.025  ns/op
                JMHSample_35_Profilers.Classy.load:·compiler.time.profiled  avgt    5      5.000                ms
                JMHSample_35_Profilers.Classy.load:·compiler.time.total     avgt    5    479.000                ms

            We seem to be at proper steady state: out of 479 ms of total compiler work, only 5 ms happen during the
            measurement window. It is expected to have some level of background compilation even at steady state.

            我们似乎处于适当的稳定状态：在 479 毫秒的总编译器工作中，只有 5 毫秒发生在测量窗口期间。 预计即使在稳定状态下也有一定程度的后台编译。

            As most profilers, both "cl" and "comp" are able to aggregate samples from multiple forks. It is a good
            idea to run multiple forks with the profilers enabled, as it improves results error estimates.

            与大多数分析器一样，“cl”和“comp”都能够从多个分叉中聚合样本。 在启用分析器的情况下运行多个分支是一个好主意，因为它可以改进结果错误估计。
         */
    }

    /*
     * ================================ ATOMIC LONG BENCHMARK ================================
     */

    @State(Scope.Benchmark)
    @Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
    @Fork(1)
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public static class Atomic {
    
    
        private AtomicLong n;

        @Setup
        public void setup() {
    
    
            n = new AtomicLong();
        }

        @Benchmark
        public long test() {
    
    
            return n.incrementAndGet();
        }

        /*
         * ============================== HOW TO RUN THIS TEST: ====================================
         *
         * You can run this test:
         *
         * a) Via the command line:
         *    $ mvn clean install
         *    $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof perf     -f 1 (Linux)
         *    $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof perfnorm -f 3 (Linux)
         *    $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof perfasm  -f 1 (Linux)
         *    $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof xperfasm -f 1 (Windows)
         *    $ java -jar target/benchmarks.jar JMHSample_35.*Atomic -prof dtraceasm -f 1 (Mac OS X)
         * b) Via the Java API:
         *    (see the JMH homepage for possible caveats when running from IDE:
         *      http://openjdk.java.net/projects/code-tools/jmh/)
         */

        public static void main(String[] args) throws RunnerException {
    
    
            Options opt = new OptionsBuilder()
                    .include(JmhTestApp17_Profiler.Atomic.class.getSimpleName())
                    .addProfiler(LinuxPerfProfiler.class)
//                    .addProfiler(LinuxPerfNormProfiler.class)
//                    .addProfiler(LinuxPerfAsmProfiler.class)
//                    .addProfiler(WinPerfAsmProfiler.class)
//                    .addProfiler(DTraceAsmProfiler.class)
                    .build();

            new Runner(opt).run();
        }

        /**
            Dealing with nanobenchmarks like these requires looking into the abyss of runtime, hardware, and
            generated code. Luckily, JMH has a few handy tools that ease the pain. If you are running Linux,
            then perf_events are probably available as standard package. This kernel facility taps into
            hardware counters, and provides the data for user space programs like JMH. Windows has less
            sophisticated facilities, but also usable, see below.

            处理像这样的纳米基准需要深入研究运行时、硬件和生成的代码。 幸运的是，JMH 有一些方便的工具可以减轻痛苦。
            如果您正在运行 Linux，那么 perf_events 可能作为标准包提供。 该内核工具利用硬件计数器，并为 JMH 等用户空间程序提供数据。
            Windows 的功能不那么复杂，但也可用，请参见下文。

            One can simply run "perf stat java -jar ..." to get the first idea how the workload behaves. In
            JMH case, however, this will cause perf to profile both host and forked JVMs.

            只需运行“perf stat java -jar ...”即可初步了解工作负载的行为方式。 然而，在 JMH 的情况下，这将导致 perf 分析主机和分支 JVM。

            -prof perf avoids that: JMH invokes perf for the forked VM alone. For the benchmark above, it
            would print something like:

            -prof perf 避免了这一点：JMH 单独为分叉的 VM 调用 perf。 对于上面的基准测试，它会打印如下内容：

                 Perf stats:
                                --------------------------------------------------

                       4172.776137 task-clock (msec)         #    0.411 CPUs utilized
                               612 context-switches          #    0.147 K/sec
                                31 cpu-migrations            #    0.007 K/sec
                               195 page-faults               #    0.047 K/sec
                    16,599,643,026 cycles                    #    3.978 GHz                     [30.80%]
                   <not supported> stalled-cycles-frontend
                   <not supported> stalled-cycles-backend
                    17,815,084,879 instructions              #    1.07  insns per cycle         [38.49%]
                     3,813,373,583 branches                  #  913.870 M/sec                   [38.56%]
                         1,212,788 branch-misses             #    0.03% of all branches         [38.91%]
                     7,582,256,427 L1-dcache-loads           # 1817.077 M/sec                   [39.07%]
                           312,913 L1-dcache-load-misses     #    0.00% of all L1-dcache hits   [38.66%]
                            35,688 LLC-loads                 #    0.009 M/sec                   [32.58%]
                   <not supported> LLC-load-misses:HG
                   <not supported> L1-icache-loads:HG
                           161,436 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [32.81%]
                     7,200,981,198 dTLB-loads:HG             # 1725.705 M/sec                   [32.68%]
                             3,360 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [32.65%]
                           193,874 iTLB-loads:HG             #    0.046 M/sec                   [32.56%]
                             4,193 iTLB-load-misses:HG       #    2.16% of all iTLB cache hits  [32.44%]
                   <not supported> L1-dcache-prefetches:HG
                                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [32.33%]

                      10.159432892 seconds time elapsed

            We can already see this benchmark goes with good IPC, does lots of loads and lots of stores,
            all of them are more or less fulfilled without misses. The data like this is not handy though:
            you would like to normalize the counters per benchmark op.

            我们已经可以看到这个基准与良好的 IPC 相得益彰，进行了大量的加载和存储，所有这些都或多或少地完成了，没有遗漏。
            虽然这样的数据并不方便：您希望对每个基准操作的计数器进行归一化。

            This is exactly what -prof perfnorm does:

            这正是 -prof perfnorm 所做的：

                Benchmark                                                   Mode  Cnt   Score    Error  Units
                JMHSample_35_Profilers.Atomic.test                          avgt   15   6.551 ±  0.023  ns/op
                JMHSample_35_Profilers.Atomic.test:·CPI                     avgt    3   0.933 ±  0.026   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-dcache-load-misses   avgt    3   0.001 ±  0.022   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-dcache-loads         avgt    3  12.267 ±  1.324   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-dcache-store-misses  avgt    3   0.001 ±  0.006   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-dcache-stores        avgt    3   4.090 ±  0.402   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-icache-load-misses   avgt    3   0.001 ±  0.011   #/op
                JMHSample_35_Profilers.Atomic.test:·LLC-loads               avgt    3   0.001 ±  0.004   #/op
                JMHSample_35_Profilers.Atomic.test:·LLC-stores              avgt    3  ≈ 10⁻⁴            #/op
                JMHSample_35_Profilers.Atomic.test:·branch-misses           avgt    3  ≈ 10⁻⁴            #/op
                JMHSample_35_Profilers.Atomic.test:·branches                avgt    3   6.152 ±  0.385   #/op
                JMHSample_35_Profilers.Atomic.test:·bus-cycles              avgt    3   0.670 ±  0.048   #/op
                JMHSample_35_Profilers.Atomic.test:·context-switches        avgt    3  ≈ 10⁻⁶            #/op
                JMHSample_35_Profilers.Atomic.test:·cpu-migrations          avgt    3  ≈ 10⁻⁷            #/op
                JMHSample_35_Profilers.Atomic.test:·cycles                  avgt    3  26.790 ±  1.393   #/op
                JMHSample_35_Profilers.Atomic.test:·dTLB-load-misses        avgt    3  ≈ 10⁻⁴            #/op
                JMHSample_35_Profilers.Atomic.test:·dTLB-loads              avgt    3  12.278 ±  0.277   #/op
                JMHSample_35_Profilers.Atomic.test:·dTLB-store-misses       avgt    3  ≈ 10⁻⁵            #/op
                JMHSample_35_Profilers.Atomic.test:·dTLB-stores             avgt    3   4.113 ±  0.437   #/op
                JMHSample_35_Profilers.Atomic.test:·iTLB-load-misses        avgt    3  ≈ 10⁻⁵            #/op
                JMHSample_35_Profilers.Atomic.test:·iTLB-loads              avgt    3   0.001 ±  0.034   #/op
                JMHSample_35_Profilers.Atomic.test:·instructions            avgt    3  28.729 ±  1.297   #/op
                JMHSample_35_Profilers.Atomic.test:·minor-faults            avgt    3  ≈ 10⁻⁷            #/op
                JMHSample_35_Profilers.Atomic.test:·page-faults             avgt    3  ≈ 10⁻⁷            #/op
                JMHSample_35_Profilers.Atomic.test:·ref-cycles              avgt    3  26.734 ±  2.081   #/op

            It is customary to trim the lines irrelevant to the particular benchmark. We show all of them here for
            completeness.

            通常会修剪与特定基准无关的线条。为了完整起见，我们在这里展示了所有这些。

            We can see that the benchmark does ~12 loads per benchmark op, and about ~4 stores per op, most of
            them fitting in the cache. There are also ~6 branches per benchmark op, all are predicted as well.
            It is also easy to see the benchmark op takes ~28 instructions executed in ~27 cycles.

            我们可以看到基准测试每个基准操作执行约 12 次加载，每个操作执行约 4 次存储，其中大部分适合缓存。
            每个基准操作也有大约 6 个分支，所有分支都是预测的。 还可以很容易地看出基准操作在大约 27 个周期内执行了大约 28 条指令。

            The output would get more interesting when we run with more threads, say, -t 8:

            当我们运行更多线程时，输出会变得更有趣，比如 -t 8：

                Benchmark                                                   Mode  Cnt    Score     Error  Units
                JMHSample_35_Profilers.Atomic.test                          avgt   15  143.595 ±   1.968  ns/op
                JMHSample_35_Profilers.Atomic.test:·CPI                     avgt    3   17.741 ±  28.761   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-dcache-load-misses   avgt    3    0.175 ±   0.406   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-dcache-loads         avgt    3   11.872 ±   0.786   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-dcache-store-misses  avgt    3    0.184 ±   0.505   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-dcache-stores        avgt    3    4.422 ±   0.561   #/op
                JMHSample_35_Profilers.Atomic.test:·L1-icache-load-misses   avgt    3    0.015 ±   0.083   #/op
                JMHSample_35_Profilers.Atomic.test:·LLC-loads               avgt    3    0.015 ±   0.128   #/op
                JMHSample_35_Profilers.Atomic.test:·LLC-stores              avgt    3    1.036 ±   0.045   #/op
                JMHSample_35_Profilers.Atomic.test:·branch-misses           avgt    3    0.224 ±   0.492   #/op
                JMHSample_35_Profilers.Atomic.test:·branches                avgt    3    6.524 ±   2.873   #/op
                JMHSample_35_Profilers.Atomic.test:·bus-cycles              avgt    3   13.475 ±  14.502   #/op
                JMHSample_35_Profilers.Atomic.test:·context-switches        avgt    3   ≈ 10⁻⁴             #/op
                JMHSample_35_Profilers.Atomic.test:·cpu-migrations          avgt    3   ≈ 10⁻⁶             #/op
                JMHSample_35_Profilers.Atomic.test:·cycles                  avgt    3  537.874 ± 595.723   #/op
                JMHSample_35_Profilers.Atomic.test:·dTLB-load-misses        avgt    3    0.001 ±   0.006   #/op
                JMHSample_35_Profilers.Atomic.test:·dTLB-loads              avgt    3   12.032 ±   2.430   #/op
                JMHSample_35_Profilers.Atomic.test:·dTLB-store-misses       avgt    3   ≈ 10⁻⁴             #/op
                JMHSample_35_Profilers.Atomic.test:·dTLB-stores             avgt    3    4.557 ±   0.948   #/op
                JMHSample_35_Profilers.Atomic.test:·iTLB-load-misses        avgt    3   ≈ 10⁻³             #/op
                JMHSample_35_Profilers.Atomic.test:·iTLB-loads              avgt    3    0.016 ±   0.052   #/op
                JMHSample_35_Profilers.Atomic.test:·instructions            avgt    3   30.367 ±  15.052   #/op
                JMHSample_35_Profilers.Atomic.test:·minor-faults            avgt    3   ≈ 10⁻⁵             #/op
                JMHSample_35_Profilers.Atomic.test:·page-faults             avgt    3   ≈ 10⁻⁵             #/op
                JMHSample_35_Profilers.Atomic.test:·ref-cycles              avgt    3  538.697 ± 590.183   #/op

            Note how this time the CPI is awfully high: 17 cycles per instruction! Indeed, we are making almost the
            same ~30 instructions, but now they take >530 cycles. Other counters highlight why: we now have cache
            misses on both loads and stores, on all levels of cache hierarchy. With a simple constant-footprint
            like ours, that's an indication of sharing problems. Indeed, our AtomicLong is heavily-contended
            with 8 threads.

            请注意这次 CPI 非常高：每条指令 17 个周期！ 事实上，我们正在执行几乎相同的 ~30 条指令，但现在它们需要 >530 个周期。
            其他计数器强调了原因：我们现在在缓存层次结构的所有级别的加载和存储上都有缓存未命中。 使用像我们这样的简单恒定足迹，这表明存在共享问题。
            事实上，我们的 AtomicLong 与 8 个线程竞争激烈。

            "perfnorm", again, can (and should!) be used with multiple forks, to properly estimate the metrics.

            “perfnorm”再次可以（并且应该！）与多个分支一起使用，以正确估计指标

            The last, but not the least player on our field is -prof perfasm. It is important to follow up on
            generated code when dealing with fine-grained benchmarks. We could employ PrintAssembly to dump the
            generated code, but it will dump *all* the generated code, and figuring out what is related to our
            benchmark is a daunting task. But we have "perf" that can tell what program addresses are really hot!
            This enables us to contrast the assembly output.

            我们领域中最后但并非最不重要的参与者是 -prof perfasm。 在处理细粒度基准测试时，跟进生成的代码很重要。
            我们可以使用 PrintAssembly 转储生成的代码，但它会转储 所有 生成的代码，并且找出与我们的基准测试相关的内容是一项艰巨的任务。
            但是我们有“perf”可以判断哪些程序地址真的很热！ 这使我们能够对比汇编输出。

            -prof perfasm would indeed contrast out the hottest loop in the generated code! It will also point
            fingers at "lock xadd" as the hottest instruction in our code. Hardware counters are not very precise
            about the instruction addresses, so sometimes they attribute the events to the adjacent code lines.

            -prof perfasm 确实会对比生成代码中最热的循环！ 它还会指出“lock xadd”是我们代码中最热门的指令。
            硬件计数器对于指令地址不是很精确，因此有时它们会将事件归因于相邻的代码行。

                Hottest code regions (>10.00% "cycles" events):
                ....[Hottest Region 1]..............................................................................
                 [0x7f1824f87c45:0x7f1824f87c79] in org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub

                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@29 (line 201)
                                                                                  ; implicit exception: dispatches to 0x00007f1824f87d21
                                    0x00007f1824f87c25: test   %r11d,%r11d
                                    0x00007f1824f87c28: jne    0x00007f1824f87cbd  ;*ifeq
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@32 (line 201)
                                    0x00007f1824f87c2e: mov    $0x1,%ebp
                                    0x00007f1824f87c33: nopw   0x0(%rax,%rax,1)
                                    0x00007f1824f87c3c: xchg   %ax,%ax            ;*aload
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@13 (line 199)
                                    0x00007f1824f87c40: mov    0x8(%rsp),%r10
                  0.00%             0x00007f1824f87c45: mov    0xc(%r10),%r11d    ;*getfield n
                                                                                  ; - org.openjdk.jmh.samples.JMHSample_35_Profilers$Atomic::test@1 (line 280)
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@16 (line 199)
                  0.19%    0.02%    0x00007f1824f87c49: test   %r11d,%r11d
                                    0x00007f1824f87c4c: je     0x00007f1824f87cad
                                    0x00007f1824f87c4e: mov    $0x1,%edx
                                    0x00007f1824f87c53: lock xadd %rdx,0x10(%r12,%r11,8)
                                                                                  ;*invokevirtual getAndAddLong
                                                                                  ; - java.util.concurrent.atomic.AtomicLong::incrementAndGet@8 (line 200)
                                                                                  ; - org.openjdk.jmh.samples.JMHSample_35_Profilers$Atomic::test@4 (line 280)
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@16 (line 199)
                 95.20%   95.06%    0x00007f1824f87c5a: add    $0x1,%rdx          ;*ladd
                                                                                  ; - java.util.concurrent.atomic.AtomicLong::incrementAndGet@12 (line 200)
                                                                                  ; - org.openjdk.jmh.samples.JMHSample_35_Profilers$Atomic::test@4 (line 280)
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@16 (line 199)
                  0.24%    0.00%    0x00007f1824f87c5e: mov    0x10(%rsp),%rsi
                                    0x00007f1824f87c63: callq  0x00007f1824e2b020  ; OopMap{[0]=Oop [8]=Oop [16]=Oop [24]=Oop off=232}
                                                                                  ;*invokevirtual consume
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@19 (line 199)
                                                                                  ;   {optimized virtual_call}
                  0.20%    0.01%    0x00007f1824f87c68: mov    0x18(%rsp),%r10
                                    0x00007f1824f87c6d: movzbl 0x94(%r10),%r11d   ;*getfield isDone
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@29 (line 201)
                  0.00%             0x00007f1824f87c75: add    $0x1,%rbp          ; OopMap{r10=Oop [0]=Oop [8]=Oop [16]=Oop [24]=Oop off=249}
                                                                                  ;*ifeq
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@32 (line 201)
                  0.20%    0.01%    0x00007f1824f87c79: test   %eax,0x15f36381(%rip)        # 0x00007f183aebe000
                                                                                  ;   {poll}
                                    0x00007f1824f87c7f: test   %r11d,%r11d
                                    0x00007f1824f87c82: je     0x00007f1824f87c40  ;*aload_2
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@35 (line 202)
                                    0x00007f1824f87c84: mov    $0x7f1839be4220,%r10
                                    0x00007f1824f87c8e: callq  *%r10              ;*invokestatic nanoTime
                                                                                  ; - org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub@36 (line 202)
                                    0x00007f1824f87c91: mov    (%rsp),%r10
                ....................................................................................................
                 96.03%   95.10%  <total for region 1>

            perfasm would also print the hottest methods to show if we indeed spending time in our benchmark. Most of the time,
            it can demangle VM and kernel symbols as well:

            perfasm 还会打印最热门的方法，以显示我们是否确实在基准测试中花费了时间。 大多数时候，它也可以分解 VM 和内核符号：

                ....[Hottest Methods (after inlining)]..............................................................
                 96.03%   95.10%  org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_avgt_jmhStub
                  0.73%    0.78%  org.openjdk.jmh.samples.generated.JMHSample_35_Profilers_Atomic_test::test_AverageTime
                  0.63%    0.00%  org.openjdk.jmh.infra.Blackhole::consume
                  0.23%    0.25%  native_write_msr_safe ([kernel.kallsyms])
                  0.09%    0.05%  _raw_spin_unlock ([kernel.kallsyms])
                  0.09%    0.00%  [unknown] (libpthread-2.19.so)
                  0.06%    0.07%  _raw_spin_lock ([kernel.kallsyms])
                  0.06%    0.04%  _raw_spin_unlock_irqrestore ([kernel.kallsyms])
                  0.06%    0.05%  _IO_fwrite (libc-2.19.so)
                  0.05%    0.03%  __srcu_read_lock; __srcu_read_unlock ([kernel.kallsyms])
                  0.04%    0.05%  _raw_spin_lock_irqsave ([kernel.kallsyms])
                  0.04%    0.06%  vfprintf (libc-2.19.so)
                  0.04%    0.01%  mutex_unlock ([kernel.kallsyms])
                  0.04%    0.01%  _nv014306rm ([nvidia])
                  0.04%    0.04%  rcu_eqs_enter_common.isra.47 ([kernel.kallsyms])
                  0.04%    0.02%  mutex_lock ([kernel.kallsyms])
                  0.03%    0.07%  __acct_update_integrals ([kernel.kallsyms])
                  0.03%    0.02%  fget_light ([kernel.kallsyms])
                  0.03%    0.01%  fput ([kernel.kallsyms])
                  0.03%    0.04%  rcu_eqs_exit_common.isra.48 ([kernel.kallsyms])
                  1.63%    2.26%  <...other 319 warm methods...>
                ....................................................................................................
                100.00%   98.97%  <totals>

                ....[Distribution by Area]..........................................................................
                 97.44%   95.99%  <generated code>
                  1.60%    2.42%  <native code in ([kernel.kallsyms])>
                  0.47%    0.78%  <native code in (libjvm.so)>
                  0.22%    0.29%  <native code in (libc-2.19.so)>
                  0.15%    0.07%  <native code in (libpthread-2.19.so)>
                  0.07%    0.38%  <native code in ([nvidia])>
                  0.05%    0.06%  <native code in (libhsdis-amd64.so)>
                  0.00%    0.00%  <native code in (nf_conntrack.ko)>
                  0.00%    0.00%  <native code in (hid.ko)>
                ....................................................................................................
                100.00%  100.00%  <totals>

            Since program addresses change from fork to fork, it does not make sense to run perfasm with more than
            a single fork.

            由于程序地址在不同的 fork 之间变化，因此使用多个 fork 运行 perfasm 是没有意义的。
        */
    }
}

1、添加JMH依赖包

在Maven仓库中搜索依赖包jmh-core 和 jmh-generator-annprocess ，版本为 1.36 。需要注释 jmh-generator-annprocess 包中的“<scope>test</scope>”，不然项目运行会报错。

<!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-core -->
<dependency>
    <groupId>org.openjdk.jmh</groupId>
    <artifactId>jmh-core</artifactId>
    <version>1.36</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.openjdk.jmh/jmh-generator-annprocess -->
<dependency>
    <groupId>org.openjdk.jmh</groupId>
    <artifactId>jmh-generator-annprocess</artifactId>
    <version>1.36</version>
<!--            <scope>test</scope>-->
</dependency>

2、StackProfiler

StackProfiler 不仅可以输出线程堆栈的信息，还能统计程序在执行的过程中线程状态的数据，比如 RUNNING 状态、WAIT 状态所占用的百分比等。

【StackProfiler样例 - 代码】

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.profile.StackProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.TimeValue;

import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.TimeUnit;

/**
 * JMH测试17：StackProfiler 样例
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Group)
public class JmhTestApp17_StackProfiler {
    
    

    private BlockingQueue<Integer> queue;

    private final static int VALUE = Integer.MAX_VALUE;

    @Setup
    public void init()
    {
    
    
        this.queue = new ArrayBlockingQueue<>(10);
    }

    @GroupThreads(5)
    @Group("blockingQueue")
    @Benchmark
    public void put() throws InterruptedException
    {
    
    
        this.queue.put(VALUE);
    }

    @GroupThreads(5)
    @Group("blockingQueue")
    @Benchmark
    public int take() throws InterruptedException
    {
    
    
        return this.queue.take();
    }

    public static void main(String[] args) throws RunnerException {
    
    
        Options opt = new OptionsBuilder()
                .include(JmhTestApp17_StackProfiler.class.getSimpleName())
                .timeout(TimeValue.seconds(10))
                // 增加StackProfiler
                .addProfiler(StackProfiler.class)
                .build();

        new Runner(opt).run();
    }
}

【StackProfiler样例 - 代码运行结果】

# JMH version: 1.36
# VM version: JDK 1.8.0_281, Java HotSpot(TM) 64-Bit Server VM, 25.281-b09
# VM invoker: D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe
# VM options: -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=1431:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin -Dfile.encoding=UTF-8
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 s per iteration, ***WARNING: The timeout might be too low!***
# Threads: 10 threads (1 group; 5x "put", 5x "take" in each group), will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue

# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: (benchmark timed out, interrupted 1 times) 11.438 ±(99.9%) 6.489 us/op
# Warmup Iteration   2: (benchmark timed out, interrupted 1 times) 11.984 ±(99.9%) 6.463 us/op
# Warmup Iteration   3: 10.120 ±(99.9%) 6.382 us/op
# Warmup Iteration   4: (benchmark timed out, interrupted 1 times) 12.545 ±(99.9%) 7.091 us/op
# Warmup Iteration   5: 10.464 ±(99.9%) 4.765 us/op
Iteration   1: 9.519 ±(99.9%) 4.039 us/op
                 put:    9.834 ±(99.9%) 13.564 us/op
                 take:   9.205 ±(99.9%) 7.101 us/op
                 ·stack: <delayed till summary>

Iteration   2: 10.761 ±(99.9%) 4.969 us/op
                 put:    10.628 ±(99.9%) 12.275 us/op
                 take:   10.893 ±(99.9%) 14.461 us/op
                 ·stack: <delayed till summary>

Iteration   3: 11.941 ±(99.9%) 9.002 us/op
                 put:    11.009 ±(99.9%) 9.926 us/op
                 take:   12.873 ±(99.9%) 32.438 us/op
                 ·stack: <delayed till summary>

Iteration   4: 13.876 ±(99.9%) 10.938 us/op
                 put:    13.309 ±(99.9%) 23.108 us/op
                 take:   14.444 ±(99.9%) 34.647 us/op
                 ·stack: <delayed till summary>

Iteration   5: 11.473 ±(99.9%) 7.817 us/op
                 put:    11.103 ±(99.9%) 20.193 us/op
                 take:   11.843 ±(99.9%) 21.887 us/op
                 ·stack: <delayed till summary>



Result "cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue":
  11.514 ±(99.9%) 6.182 us/op [Average]
  (min, avg, max) = (9.519, 11.514, 13.876), stdev = 1.606
  CI (99.9%): [5.332, 17.696] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue:put":
  11.176 ±(99.9%) 4.977 us/op [Average]
  (min, avg, max) = (9.834, 11.176, 13.309), stdev = 1.293
  CI (99.9%): [6.199, 16.154] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue:take":
  11.852 ±(99.9%) 7.625 us/op [Average]
  (min, avg, max) = (9.205, 11.852, 14.444), stdev = 1.980
  CI (99.9%): [4.226, 19.477] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp17_StackProfiler.blockingQueue:·stack":
Stack profiler:

....[Thread state distributions]....................................................................
 88.7%         WAITING
 11.3%         RUNNABLE

....[Thread state: WAITING].........................................................................
 88.7% 100.0% sun.misc.Unsafe.park

....[Thread state: RUNNABLE]........................................................................
  9.1%  80.2% java.net.SocketInputStream.socketRead0
  1.7%  15.3% sun.misc.Unsafe.unpark
  0.5%   4.3% sun.misc.Unsafe.park
  0.0%   0.1% java.util.concurrent.ArrayBlockingQueue.take
  0.0%   0.0% java.util.concurrent.ArrayBlockingQueue.put
  0.0%   0.0% java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await
  0.0%   0.0% cn.zhuangyt.javabase.jmh.jmh_generated.JmhTestApp17_StackProfiler_blockingQueue_jmhTest.blockingQueue_AverageTime
  0.0%   0.0% cn.zhuangyt.javabase.jmh.jmh_generated.JmhTestApp17_StackProfiler_blockingQueue_jmhTest.take_avgt_jmhStub



# Run complete. Total time: 00:01:43

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                        Mode  Cnt   Score   Error  Units
JmhTestApp17_StackProfiler.blockingQueue         avgt    5  11.514 ± 6.182  us/op
JmhTestApp17_StackProfiler.blockingQueue:put     avgt    5  11.176 ± 4.977  us/op
JmhTestApp17_StackProfiler.blockingQueue:take    avgt    5  11.852 ± 7.625  us/op
JmhTestApp17_StackProfiler.blockingQueue:·stack  avgt          NaN            ---

我们在Options中增加了StackProfiler用于分析线程的堆栈情况，还可以输出线程状态的分布情况。

得出结论：通过上面的输出结果可以看到，线程状态的分布情况为 WAITING：88.7%，RUNNABLE:11.3%，考虑到我们使用的是 BlockingQueue，因此这种分布应该还算合理。

3、GcProfiler

GcProfiler 可用于分析出在测试方法中垃圾回收器在JVM每个内存空间上所花费的时间，本节将使用自定义的类加载器进行类的加载。

【GcProfiler样例 - 代码】

package cn.zhuangyt.javabase.jmh;

import java.net.URL;
import java.net.URLClassLoader;

/**
 * 大白有点菜 类加载器
 * @author 大白有点菜
 */
public class DbydcClassLoader extends URLClassLoader {
    
    
    private final byte[] bytes;

    public DbydcClassLoader(byte[] bytes) {
    
    
        super(new URL[0], ClassLoader.getSystemClassLoader());
        this.bytes = bytes;
    }

    @Override
    protected Class<?> findClass(String name) throws ClassNotFoundException {
    
    
        return defineClass(name, bytes, 0, bytes.length);
    }
}

package cn.zhuangyt.javabase.jmh;

/**
 * 大白有点菜 类
 * @author 大白有点菜
 */
public class Dbydc {
    
    
    private String name = "大白有点菜";
    private int age = 18;
    private byte[] data = new byte[1024 * 10];

    public static void main(String[] args) {
    
    

    }
}

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.profile.GCProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeUnit;

/**
 * JMH测试18：GcProfiler 样例
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class JmhTestApp18_GcProfiler {
    
    

    private byte[] dbydcBytes;

    private DbydcClassLoader classLoader;
    @Setup
    public void init() throws IOException
    {
    
    
        this.dbydcBytes = Files.readAllBytes(
                Paths.get("D:\\Dbydc.class")
        );
        this.classLoader = new DbydcClassLoader(dbydcBytes);
    }

    @Benchmark
    public Object testLoadClass() throws ClassNotFoundException, IllegalAccessException, InstantiationException {
    
    
        Class<?> dbydcClass = Class.forName("cn.zhuangyt.javabase.jmh.Dbydc", true, classLoader);
        return dbydcClass.newInstance();
    }

    public static void main(String[] args) throws RunnerException {
    
    
        final Options opts = new OptionsBuilder()
                .include(JmhTestApp18_GcProfiler.class.getSimpleName())
                // add GcProfiler输出基准方法执行过程中的GC信息
                .addProfiler(GCProfiler.class)
                // 将最大堆内存设置为128MB，会有多次的GC发生
                .jvmArgsAppend("-Xmx128M")
                .build();
        new Runner(opts).run();
    }
}

有些东西需要注意一下，编译 Dbydc.java 为 Dbydc.class 其实很简单，添加一个 main 函数运行一下就可以了，Dbydc.class 生成在 target 目录下。笔者拷贝 Dbydc.class 文件到 D盘根目录下方便测试。使用 Class.forName 方法加载 Dbydc.class，需要把 Dbydc.class 的相对路径写全，即写所在的包路径，只写类名会报找不到类异常。

【GcProfiler样例 - 代码运行结果】

# JMH version: 1.36
# VM version: JDK 1.8.0_281, Java HotSpot(TM) 64-Bit Server VM, 25.281-b09
# VM invoker: D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe
# VM options: -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=9976:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin -Dfile.encoding=UTF-8 -Xmx128M
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass

# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 1.332 us/op
# Warmup Iteration   2: 1.167 us/op
# Warmup Iteration   3: 1.159 us/op
# Warmup Iteration   4: 1.416 us/op
# Warmup Iteration   5: 1.164 us/op
Iteration   1: 1.169 us/op
                 ·gc.alloc.rate:      8385.220 MB/sec
                 ·gc.alloc.rate.norm: 10280.000 B/op
                 ·gc.count:           2041.000 counts
                 ·gc.time:            1168.000 ms

Iteration   2: 1.173 us/op
                 ·gc.alloc.rate:      8359.449 MB/sec
                 ·gc.alloc.rate.norm: 10280.000 B/op
                 ·gc.count:           2037.000 counts
                 ·gc.time:            1175.000 ms

Iteration   3: 1.158 us/op
                 ·gc.alloc.rate:      8462.958 MB/sec
                 ·gc.alloc.rate.norm: 10280.000 B/op
                 ·gc.count:           2060.000 counts
                 ·gc.time:            1202.000 ms

Iteration   4: 1.155 us/op
                 ·gc.alloc.rate:      8489.859 MB/sec
                 ·gc.alloc.rate.norm: 10280.000 B/op
                 ·gc.count:           2068.000 counts
                 ·gc.time:            1196.000 ms

Iteration   5: 1.233 us/op
                 ·gc.alloc.rate:      7953.332 MB/sec
                 ·gc.alloc.rate.norm: 10280.000 B/op
                 ·gc.count:           1938.000 counts
                 ·gc.time:            1107.000 ms



Result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass":
  1.178 ±(99.9%) 0.122 us/op [Average]
  (min, avg, max) = (1.155, 1.178, 1.233), stdev = 0.032
  CI (99.9%): [1.056, 1.300] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass:·gc.alloc.rate":
  8330.164 ±(99.9%) 837.078 MB/sec [Average]
  (min, avg, max) = (7953.332, 8330.164, 8489.859), stdev = 217.387
  CI (99.9%): [7493.085, 9167.242] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass:·gc.alloc.rate.norm":
  10280.000 ±(99.9%) 0.001 B/op [Average]
  (min, avg, max) = (10280.000, 10280.000, 10280.000), stdev = 0.001
  CI (99.9%): [10280.000, 10280.000] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass:·gc.count":
  10144.000 ±(99.9%) 0.001 counts [Sum]
  (min, avg, max) = (1938.000, 2028.800, 2068.000), stdev = 52.371
  CI (99.9%): [10144.000, 10144.000] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp18_GcProfiler.testLoadClass:·gc.time":
  5848.000 ±(99.9%) 0.001 ms [Sum]
  (min, avg, max) = (1107.000, 1169.600, 1202.000), stdev = 37.740
  CI (99.9%): [5848.000, 5848.000] (assumes normal distribution)


# Run complete. Total time: 00:01:41

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                                  Mode  Cnt      Score     Error   Units
JmhTestApp18_GcProfiler.testLoadClass                      avgt    5      1.178 ±   0.122   us/op
JmhTestApp18_GcProfiler.testLoadClass:·gc.alloc.rate       avgt    5   8330.164 ± 837.078  MB/sec
JmhTestApp18_GcProfiler.testLoadClass:·gc.alloc.rate.norm  avgt    5  10280.000 ±   0.001    B/op
JmhTestApp18_GcProfiler.testLoadClass:·gc.count            avgt    5  10144.000            counts
JmhTestApp18_GcProfiler.testLoadClass:·gc.time             avgt    5   5848.000                ms

运行上面的基准测试方法，除了得到 testLoadClass() 方法的基准数据之外，还会得到GC相关的信息。

根据 GcProfiler 的输出信息可以看到，在基准方法执行过程之中，GC总共出现过 10144 次，总共耗时 5848 毫秒，在此期间也发生了多次的堆内存的申请，比如，每秒钟大约会有 8830.164MB 的数据被创建，若换算成对testLoadClass方法的每次调用，那么我们会发现大约有 10280.000 Byte 的内存使用。

4、ClassLoaderProfiler

ClassLoaderProfiler 可以帮助我们看到在基准方法的执行过程中有多少类被加载和卸载，但是考虑到在一个类加载器中同一个类只会被加载一次的情况，因此我们需要将Warmup设置为0，以避免在热身阶段就已经加载了基准测试方法所需的所有类。

【ClassLoaderProfiler样例 - 代码】

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.profile.ClassloaderProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeUnit;

/**
 * JMH测试19：ClassLoaderProfiler 样例
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
// 将热身批次设置为0
@Warmup(iterations = 0)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class JmhTestApp19_ClassLoaderProfiler {
    
    

    private byte[] dbydcBytes;

    private DbydcClassLoader classLoader;
    @Setup
    public void init() throws IOException
    {
    
    
        this.dbydcBytes = Files.readAllBytes(
                Paths.get("D:\\Dbydc.class")
        );
        this.classLoader = new DbydcClassLoader(dbydcBytes);
    }

    @Benchmark
    public Object testLoadClass() throws ClassNotFoundException, IllegalAccessException, InstantiationException {
    
    
        Class<?> dbydcClass = Class.forName("cn.zhuangyt.javabase.jmh.Dbydc", true, classLoader);
        return dbydcClass.newInstance();
    }

    public static void main(String[] args) throws RunnerException {
    
    
        final Options opts = new OptionsBuilder()
                .include(JmhTestApp19_ClassLoaderProfiler.class.getSimpleName())
                // 增加CL Profiler，输出类的加载、卸载信息
                .addProfiler(ClassloaderProfiler.class)
                .build();
        new Runner(opts).run();
    }
}

【ClassLoaderProfiler样例 - 代码运行结果】

# JMH version: 1.36
# VM version: JDK 1.8.0_281, Java HotSpot(TM) 64-Bit Server VM, 25.281-b09
# VM invoker: D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe
# VM options: -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=11136:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin -Dfile.encoding=UTF-8
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: <none>
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass

# Run progress: 0.00% complete, ETA 00:00:50
# Fork: 1 of 1
Iteration   1: 1.251 us/op
                 ·class.load:        952.182 classes/sec
                 ·class.load.norm:   ≈ 10⁻⁵ classes/op
                 ·class.unload:      ≈ 0 classes/sec
                 ·class.unload.norm: ≈ 0 classes/op

Iteration   2: 1.243 us/op
                 ·class.load:        ≈ 0 classes/sec
                 ·class.load.norm:   ≈ 0 classes/op
                 ·class.unload:      ≈ 0 classes/sec
                 ·class.unload.norm: ≈ 0 classes/op

Iteration   3: 1.218 us/op
                 ·class.load:        ≈ 0 classes/sec
                 ·class.load.norm:   ≈ 0 classes/op
                 ·class.unload:      ≈ 0 classes/sec
                 ·class.unload.norm: ≈ 0 classes/op

Iteration   4: 1.179 us/op
                 ·class.load:        ≈ 0 classes/sec
                 ·class.load.norm:   ≈ 0 classes/op
                 ·class.unload:      ≈ 0 classes/sec
                 ·class.unload.norm: ≈ 0 classes/op

Iteration   5: 1.149 us/op
                 ·class.load:        ≈ 0 classes/sec
                 ·class.load.norm:   ≈ 0 classes/op
                 ·class.unload:      ≈ 0 classes/sec
                 ·class.unload.norm: ≈ 0 classes/op



Result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass":
  1.208 ±(99.9%) 0.167 us/op [Average]
  (min, avg, max) = (1.149, 1.208, 1.251), stdev = 0.043
  CI (99.9%): [1.041, 1.375] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.load":
  190.436 ±(99.9%) 1639.715 classes/sec [Average]
  (min, avg, max) = (≈ 0, 190.436, 952.182), stdev = 425.829
  CI (99.9%): [≈ 0, 1830.151] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.load.norm":
  ≈ 10⁻⁶ classes/op

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.unload":
  ≈ 0 classes/sec

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.unload.norm":
  ≈ 0 classes/op


# Run complete. Total time: 00:00:51

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                                          Mode  Cnt    Score      Error        Units
JmhTestApp19_ClassLoaderProfiler.testLoadClass                     avgt    5    1.208 ±    0.167        us/op
JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.load         avgt    5  190.436 ± 1639.715  classes/sec
JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.load.norm    avgt    5   ≈ 10⁻⁶              classes/op
JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.unload       avgt    5      ≈ 0             classes/sec
JmhTestApp19_ClassLoaderProfiler.testLoadClass:·class.unload.norm  avgt    5      ≈ 0              classes/op

运行上面的基准测试方法，我们将会看到在第一个批次的度量时加载了大量的类，在余下的几次度量中将不会再进行类的加载了，这也符合JVM类加载器的基本逻辑。

我们可以看到，在 testLoadClass 方法的执行过程中，每秒大约会有 190 个类的加载。

5、CompilerProfiler

CompilerProfiler 将会告诉你在代码的执行过程中JIT编译器所花费的优化时间，我们可以打开verbose模式观察更详细的输出。

【CompilerProfiler样例 - 代码】

package cn.zhuangyt.javabase.jmh;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.profile.CompilerProfiler;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.TimeUnit;

/**
 * JMH测试20：CompilerProfiler 样例
 * @author 大白有点菜
 */
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
public class JmhTestApp20_CompilerProfiler {
    
    

    private byte[] dbydcBytes;

    private DbydcClassLoader classLoader;
    @Setup
    public void init() throws IOException
    {
    
    
        this.dbydcBytes = Files.readAllBytes(
                Paths.get("D:\\Dbydc.class")
        );
        this.classLoader = new DbydcClassLoader(dbydcBytes);
    }

    @Benchmark
    public Object testLoadClass() throws ClassNotFoundException, IllegalAccessException, InstantiationException {
    
    
        Class<?> dbydcClass = Class.forName("cn.zhuangyt.javabase.jmh.Dbydc", true, classLoader);
        return dbydcClass.newInstance();
    }

    public static void main(String[] args) throws RunnerException {
    
    
        final Options opts = new OptionsBuilder()
                .include(JmhTestApp20_CompilerProfiler.class.getSimpleName())
                .addProfiler(CompilerProfiler.class)
                .verbosity(VerboseMode.EXTRA)
                .build();
        new Runner(opts).run();
    }
}

【CompilerProfiler样例 - 代码运行结果】

# JMH version: 1.36
# VM version: JDK 1.8.0_281, Java HotSpot(TM) 64-Bit Server VM, 25.281-b09
# VM invoker: D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe
# VM options: -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=11289:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin -Dfile.encoding=UTF-8
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: cn.zhuangyt.javabase.jmh.JmhTestApp20_CompilerProfiler.testLoadClass

# Run progress: 0.00% complete, ETA 00:01:40
Forking using command: [D:\Develop\JDK\jdk1.8.0_281\jre\bin\java.exe, -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar=11289:C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\bin, -Dfile.encoding=UTF-8, -XX:CompileCommandFile=C:\Users\ZHUANG~1\AppData\Local\Temp\jmh6395784747893774195compilecommand, -cp, "D:\Develop\JDK\jdk1.8.0_281\jre\lib\charsets.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\deploy.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\access-bridge-64.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\cldrdata.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\dnsns.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\jaccess.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\jfxrt.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\localedata.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\nashorn.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\sunec.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\sunjce_provider.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\sunmscapi.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\sunpkcs11.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\ext\zipfs.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\javaws.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\jce.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\jfr.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\jfxswt.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\jsse.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\management-agent.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\plugin.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\resources.jar;D:\Develop\JDK\jdk1.8.0_281\jre\lib\rt.jar;D:\Projects\JavaProject\spring-cloud-study\java-base\target\classes;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter-web\2.6.3\spring-boot-starter-web-2.6.3.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter\2.6.3\spring-boot-starter-2.6.3.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot\2.6.3\spring-boot-2.6.3.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-autoconfigure\2.6.3\spring-boot-autoconfigure-2.6.3.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter-logging\2.6.3\spring-boot-starter-logging-2.6.3.jar;D:\Develop\MavenRepository\ch\qos\logback\logback-classic\1.2.10\logback-classic-1.2.10.jar;D:\Develop\MavenRepository\ch\qos\logback\logback-core\1.2.10\logback-core-1.2.10.jar;D:\Develop\MavenRepository\org\apache\logging\log4j\log4j-to-slf4j\2.17.1\log4j-to-slf4j-2.17.1.jar;D:\Develop\MavenRepository\org\apache\logging\log4j\log4j-api\2.17.1\log4j-api-2.17.1.jar;D:\Develop\MavenRepository\org\slf4j\jul-to-slf4j\1.7.33\jul-to-slf4j-1.7.33.jar;D:\Develop\MavenRepository\jakarta\annotation\jakarta.annotation-api\1.3.5\jakarta.annotation-api-1.3.5.jar;D:\Develop\MavenRepository\org\yaml\snakeyaml\1.29\snakeyaml-1.29.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter-json\2.6.3\spring-boot-starter-json-2.6.3.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\core\jackson-databind\2.13.1\jackson-databind-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\core\jackson-annotations\2.13.1\jackson-annotations-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\core\jackson-core\2.13.1\jackson-core-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\datatype\jackson-datatype-jdk8\2.13.1\jackson-datatype-jdk8-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\datatype\jackson-datatype-jsr310\2.13.1\jackson-datatype-jsr310-2.13.1.jar;D:\Develop\MavenRepository\com\fasterxml\jackson\module\jackson-module-parameter-names\2.13.1\jackson-module-parameter-names-2.13.1.jar;D:\Develop\MavenRepository\org\springframework\boot\spring-boot-starter-tomcat\2.6.3\spring-boot-starter-tomcat-2.6.3.jar;D:\Develop\MavenRepository\org\apache\tomcat\embed\tomcat-embed-core\9.0.56\tomcat-embed-core-9.0.56.jar;D:\Develop\MavenRepository\org\apache\tomcat\embed\tomcat-embed-el\9.0.56\tomcat-embed-el-9.0.56.jar;D:\Develop\MavenRepository\org\apache\tomcat\embed\tomcat-embed-websocket\9.0.56\tomcat-embed-websocket-9.0.56.jar;D:\Develop\MavenRepository\org\springframework\spring-web\5.3.15\spring-web-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-beans\5.3.15\spring-beans-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-webmvc\5.3.15\spring-webmvc-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-aop\5.3.15\spring-aop-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-context\5.3.15\spring-context-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-expression\5.3.15\spring-expression-5.3.15.jar;D:\Develop\MavenRepository\org\slf4j\slf4j-api\1.7.33\slf4j-api-1.7.33.jar;D:\Develop\MavenRepository\org\hamcrest\hamcrest\2.2\hamcrest-2.2.jar;D:\Develop\MavenRepository\org\springframework\spring-core\5.3.15\spring-core-5.3.15.jar;D:\Develop\MavenRepository\org\springframework\spring-jcl\5.3.15\spring-jcl-5.3.15.jar;D:\Develop\MavenRepository\org\openjdk\jmh\jmh-core\1.36\jmh-core-1.36.jar;D:\Develop\MavenRepository\net\sf\jopt-simple\jopt-simple\5.0.4\jopt-simple-5.0.4.jar;D:\Develop\MavenRepository\org\apache\commons\commons-math3\3.2\commons-math3-3.2.jar;D:\Develop\MavenRepository\org\openjdk\jmh\jmh-generator-annprocess\1.36\jmh-generator-annprocess-1.36.jar;D:\Develop\MavenRepository\junit\junit\4.13.2\junit-4.13.2.jar;D:\Develop\MavenRepository\org\hamcrest\hamcrest-core\2.2\hamcrest-core-2.2.jar;D:\Develop\MavenRepository\com\google\guava\guava\31.0.1-jre\guava-31.0.1-jre.jar;D:\Develop\MavenRepository\com\google\guava\failureaccess\1.0.1\failureaccess-1.0.1.jar;D:\Develop\MavenRepository\com\google\guava\listenablefuture\9999.0-empty-to-avoid-conflict-with-guava\listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar;D:\Develop\MavenRepository\com\google\code\findbugs\jsr305\3.0.2\jsr305-3.0.2.jar;D:\Develop\MavenRepository\org\checkerframework\checker-qual\3.12.0\checker-qual-3.12.0.jar;D:\Develop\MavenRepository\com\google\errorprone\error_prone_annotations\2.7.1\error_prone_annotations-2.7.1.jar;D:\Develop\MavenRepository\com\google\j2objc\j2objc-annotations\1.3\j2objc-annotations-1.3.jar;C:\Program Files\JetBrains\IntelliJ IDEA 2021.1.3\lib\idea_rt.jar", org.openjdk.jmh.runner.ForkedMain, 127.0.0.1, 11291]
# Fork: 1 of 1
# Warmup Iteration   1: 1.175 us/op
# Warmup Iteration   2: 1.165 us/op
# Warmup Iteration   3: 1.104 us/op
# Warmup Iteration   4: 1.028 us/op
# Warmup Iteration   5: 1.093 us/op
Iteration   1: 1.244 us/op
                 ·compiler.time.profiled: ≈ 0 ms
                 ·compiler.time.total:    317.000 ms

Iteration   2: 1.091 us/op
                 ·compiler.time.profiled: 2.000 ms
                 ·compiler.time.total:    320.000 ms

Iteration   3: 1.140 us/op
                 ·compiler.time.profiled: 1.000 ms
                 ·compiler.time.total:    321.000 ms

Iteration   4: 1.182 us/op
                 ·compiler.time.profiled: ≈ 0 ms
                 ·compiler.time.total:    321.000 ms

Iteration   5: 1.119 us/op
                 ·compiler.time.profiled: ≈ 0 ms
                 ·compiler.time.total:    321.000 ms



Result "cn.zhuangyt.javabase.jmh.JmhTestApp20_CompilerProfiler.testLoadClass":
  1.155 ±(99.9%) 0.229 us/op [Average]
  (min, avg, max) = (1.091, 1.155, 1.244), stdev = 0.060
  CI (99.9%): [0.926, 1.385] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp20_CompilerProfiler.testLoadClass:·compiler.time.profiled":
  3.000 ±(99.9%) 0.001 ms [Sum]
  (min, avg, max) = (≈ 0, 0.600, 2.000), stdev = 0.894
  CI (99.9%): [3.000, 3.000] (assumes normal distribution)

Secondary result "cn.zhuangyt.javabase.jmh.JmhTestApp20_CompilerProfiler.testLoadClass:·compiler.time.total":
  321.000 ±(99.9%) 0.001 ms [Maximum]
  (min, avg, max) = (317.000, 320.000, 321.000), stdev = 1.732
  CI (99.9%): [321.000, 321.000] (assumes normal distribution)


# Run complete. Total time: 00:01:41

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                                            Mode  Cnt    Score   Error  Units
JmhTestApp20_CompilerProfiler.testLoadClass                          avgt    5    1.155 ± 0.229  us/op
JmhTestApp20_CompilerProfiler.testLoadClass:·compiler.time.profiled  avgt    5    3.000             ms
JmhTestApp20_CompilerProfiler.testLoadClass:·compiler.time.total     avgt    5  321.000             ms

我们可以看到，在整个方法的执行过程中，profiled 的优化耗时为 3 毫秒，total 的优化耗时为 321 毫秒。