pprof:alloc 全流程解析

memprofile

malloc

go 在堆上的内存分配，会调用malloc函数。

malloc会进行采样，默认平均512kb（指数分布的均值）的分配进行一次采样，采样的数据放在memrecord里头。

memrecord

在memrecord视角，以mark termination为标志。将上一轮mark termination之前分配的内存，和这一轮mark termination之间清除（刚好是上一轮分配）的内存作为一个snapshot。

也就是说，我们通过memprofile获取的记录，事实上已经是两次mark termination之前的内存sanpshot。

在memrecord中，有一个长度为3的[future]存放memrecord cycle，表示不同gc cycle的数据。

而active表示的是当前有效的profiling 数据，为两次gc前的数据。


// A memRecord is the bucket data for a bucket of type memProfile,
// part of the memory profile.
type memRecord struct {
   // The following complex 3-stage scheme of stats accumulation
   // is required to obtain a consistent picture of mallocs and frees
   // for some point in time.
   // The problem is that mallocs come in real time, while frees
   // come only after a GC during concurrent sweeping. So if we would
   // naively count them, we would get a skew toward mallocs.
   //
   // Hence, we delay information to get consistent snapshots as
   // of mark termination. Allocations count toward the next mark
   // termination's snapshot, while sweep frees count toward the
   // previous mark termination's snapshot:
   //
   //              MT          MT          MT          MT
   //             .·|         .·|         .·|         .·|
   //          .·˙  |      .·˙  |      .·˙  |      .·˙  |
   //       .·˙     |   .·˙     |   .·˙     |   .·˙     |
   //    .·˙        |.·˙        |.·˙        |.·˙        |
   //
   //       alloc → ▲ ← free
   //               ┠┅┅┅┅┅┅┅┅┅┅┅P
   //       C+2     →    C+1    →  C
   //
   //                   alloc → ▲ ← free
   //                           ┠┅┅┅┅┅┅┅┅┅┅┅P
   //                   C+2     →    C+1    →  C
   //
   // Since we can't publish a consistent snapshot until all of
   // the sweep frees are accounted for, we wait until the next
   // mark termination ("MT" above) to publish the previous mark
   // termination's snapshot ("P" above). To do this, allocation
   // and free events are accounted to *future* heap profile
   // cycles ("C+n" above) and we only publish a cycle once all
   // of the events from that cycle must be done. Specifically:
   //
   // Mallocs are accounted to cycle C+2.
   // Explicit frees are accounted to cycle C+2.
   // GC frees (done during sweeping) are accounted to cycle C+1.
   //
   // After mark termination, we increment the global heap
   // profile cycle counter and accumulate the stats from cycle C
   // into the active profile.

   // active is the currently published profile. A profiling
   // cycle can be accumulated into active once its complete.
   active memRecordCycle

   // future records the profile events we're counting for cycles
   // that have not yet been published. This is ring buffer
   // indexed by the global heap profile cycle C and stores
   // cycles C, C+1, and C+2. Unlike active, these counts are
   // only for a single cycle; they are not cumulative across
   // cycles.
   //
   // We store cycle C here because there's a window between when
   // C becomes the active cycle and when we've flushed it to
   // active.
   future [3]memRecordCycle
}
复制代码

N gc cycle -> GC() -> mark -> stop the world
mProf_NextCycle() -> Start the world ->  mProf_Flush() -> mark done -> sweep -> gc done -> mProf_PostSweep() -> N + 1 gc cycle
复制代码

memRecordCycle包括分配了bytes，分配的次数。释放的bytes，释放的次数。

// memRecordCycle
type memRecordCycle struct {
   allocs, frees           uintptr
   alloc_bytes, free_bytes uintptr
}
复制代码

bucket

mProfAlloc会使用用户malloc生成的调用栈去生成一个bucket，然后再由bucket存储用户的调用栈，以及memRecordCycle等信息。

这样目的是为了不用存储相同的调用栈。

最后统计的时候，将所有bucket中的memRecordCycle取出来累计就得到了totalalloc等信息。


// Called by malloc to record a profiled block.
func mProf_Malloc(p unsafe.Pointer, size uintptr) {
   var stk [maxStack]uintptr
   nstk := callers(4, stk[:])
   lock(&proflock)
   b := stkbucket(memProfile, size, stk[:nstk], true)
   c := mProf.cycle
   mp := b.mp()
   mpc := &mp.future[(c+2)%uint32(len(mp.future))]
   mpc.allocs++
   mpc.alloc_bytes += size
   unlock(&proflock)

   // Setprofilebucket locks a bunch of other mutexes, so we call it outside of proflock.
   // This reduces potential contention and chances of deadlocks.
   // Since the object must be alive during call to mProf_Malloc,
   // it's fine to do this non-atomically.
   systemstack(func() {
      setprofilebucket(p, b)
   })
}
复制代码

用户的每一次调用，可能包含许多的栈信息。

栈指针，内存分配信息等会存储到一个bucket。

bucket如下。

bucket随后还包括一个[]unitprt的数组（指向调用栈的地址）和memrecord或blockrecord{}

不同的bucket之间通过一个buckethash hashmap放在一块，链接寻址法。

buckethash使用hash函数对调用栈生成hash,调用栈以及malloc分配的size相同的才会被放在相同的bucket。

然后每一次生成新的bucket，都会更新对应的全局变量,对于memprofile就是将mBucket指向这个bucket。

通过mBucket就可以获取到所有的bucket。

type bucket struct {
    next    *bucket
    allnext *bucket
    typ     bucketType
    hash    uintptr
    size    uintptr
    nstk    uintptr
}
复制代码


// newBucket allocates a bucket with the given type and number of stack entries.
func newBucket(typ bucketType, nstk int) *bucket {
   size := unsafe.Sizeof(bucket{}) + uintptr(nstk)*unsafe.Sizeof(uintptr(0))
   switch typ {
   default:
      throw("invalid profile bucket type")
   case memProfile:
      size += unsafe.Sizeof(memRecord{})
   case blockProfile, mutexProfile:
      size += unsafe.Sizeof(blockRecord{})
   }

   b := (*bucket)(persistentalloc(size, 0, &memstats.buckhash_sys))
   bucketmem += size
   b.typ = typ
   b.nstk = uintptr(nstk)
   return b
}


// stk returns the slice in b holding the stack.
func (b *bucket) stk() []uintptr {
   stk := (*[maxStack]uintptr)(add(unsafe.Pointer(b), unsafe.Sizeof(*b)))
   return stk[:b.nstk:b.nstk]
}
复制代码

Memprofile

memprofile就是runtime/pprof真正进行allocs时候的操作。

它将会遍历所有的mbuckets，得到一个调用栈-〉allocs and alloc_bytes的集合。

在遍历mbucket的时候。

如果所有的bucket的active（当前的memrecord）allocs或者free为空，表示gc还没有开始。这时候会收集所有的C，C+1，C+2 cycle时候的memrecordcycle。
否则，只收集active的信息，也就是两次gc termination前的快照。

所以严谨的说，如果gc次数小于2，所得到的其实不能算是一个一致状态下的snap shot。

在文件的第一行，依次写入总的分配的bytes，使用的object等信息，以及memprofileRate。(值得一提的是，这里的memprofilerate为了兼容旧的c++ profiler所以乘以了一个2.并不准确)

fmt.Fprintf(w, "heap profile: %d: %d [%d: %d] @ heap/%d\n",
   total.InUseObjects(), total.InUseBytes(),
   total.AllocObjects, total.AllocBytes,
   2*runtime.MemProfileRate)
复制代码

随后逐个遍历调用栈。对每个调用栈，首先第一行打印该调用栈分配的bytes，栈地址。

对于调用栈，从头到尾打印每个具体的栈的信息。

其中如果第一个的函数名是runtime的话将会忽略（因为runtime的函数通常在debug allocs的时候没什么用处），除非所有的函数名都是runtime。

#   0x10b0074  main.main+0x34    /Users/bytedance/go/src/awesomeProject22/main.go:16
#  0x10341e6  runtime.main+0x226 /Users/bytedance/go/go1.17/src/runtime/proc.go:255
复制代码

对于栈，打印如下数据,包括frame文件名，行号，pc地址，func name等数据。

fmt.Fprintf(w, "#\t%#x\t%s+%#x\t%s:%d\n", frame.PC, name, frame.PC-frame.Entry, frame.File, frame.Line)
复制代码

随后打印memstats和maxrss。

火焰图

当我们想要在命令行中直接调用allocs的时候，如下所示。

p := pprof.Lookup("allocs")
f, _ := os.OpenFile("allocs", os.O_CREATE|os.O_RDWR|os.O_TRUNC, 0755)
p.WriteTo(f, 1)
f1, _ := os.OpenFile("allocs.proto", os.O_CREATE|os.O_RDWR|os.O_TRUNC, 0755)
p.WriteTo(f1, 0)
复制代码

其中debug为1的时候，表示以文本的形式打印人类可读文件，为0的时候以proto的形式打印文件。

不仅是文件形式的不同，proto的时候，还会对返回的数据本身进行一些处理操作。

如果我们同时打印text和proto形式的文件，我们会发现以proto生成的ui，其中的total部分无法跟text中的total heap alloc或者其他任何相对应，要大很多。

截屏2021-10-13 上午11.08.11.png

生成proto文件的时候，会通过memprofilingrate对采样中生成的allocbytes,allocobjexts,inusebytes,inuseobjects等进行一个放大。

values[0], values[1] = scaleHeapSample(r.AllocObjects, r.AllocBytes, rate)
values[2], values[3] = scaleHeapSample(r.InUseObjects(), r.InUseBytes(), rate)
复制代码


// scaleHeapSample adjusts the data from a heap Sample to
// account for its probability of appearing in the collected
// data. heap profiles are a sampling of the memory allocations
// requests in a program. We estimate the unsampled value by dividing
// each collected sample by its probability of appearing in the
// profile. heap profiles rely on a poisson process to determine
// which samples to collect, based on the desired average collection
// rate R. The probability of a sample of size S to appear in that
// profile is 1-exp(-S/R).
func scaleHeapSample(count, size, rate int64) (int64, int64) {
   if count == 0 || size == 0 {
      return 0, 0
   }

   if rate <= 1 {
      // if rate==1 all samples were collected so no adjustment is needed.
      // if rate<1 treat as unknown and skip scaling.
      return count, size
   }

   avgSize := float64(size) / float64(count)
   scale := 1 / (1 - math.Exp(-avgSize/float64(rate)))

   return int64(float64(count) * scale), int64(float64(size) * scale)
}
复制代码