memprofile
malloc
go 在堆上的内存分配,会调用malloc函数。
malloc会进行采样,默认平均512kb(指数分布的均值)的分配进行一次采样,采样的数据放在memrecord里头。
memrecord
在memrecord视角,以mark termination为标志。 将上一轮mark termination之前分配的内存,和这一轮mark termination之间清除(刚好是上一轮分配)的内存作为一个snapshot。
也就是说,我们通过memprofile获取的记录,事实上已经是两次mark termination之前的内存sanpshot。
在memrecord中, 有一个长度为3的[future]存放memrecord cycle,表示不同gc cycle的数据。
而active表示的是当前有效的profiling 数据,为两次gc前的数据。
// A memRecord is the bucket data for a bucket of type memProfile,
// part of the memory profile.
type memRecord struct {
// The following complex 3-stage scheme of stats accumulation
// is required to obtain a consistent picture of mallocs and frees
// for some point in time.
// The problem is that mallocs come in real time, while frees
// come only after a GC during concurrent sweeping. So if we would
// naively count them, we would get a skew toward mallocs.
//
// Hence, we delay information to get consistent snapshots as
// of mark termination. Allocations count toward the next mark
// termination's snapshot, while sweep frees count toward the
// previous mark termination's snapshot:
//
// MT MT MT MT
// .·| .·| .·| .·|
// .·˙ | .·˙ | .·˙ | .·˙ |
// .·˙ | .·˙ | .·˙ | .·˙ |
// .·˙ |.·˙ |.·˙ |.·˙ |
//
// alloc → ▲ ← free
// ┠┅┅┅┅┅┅┅┅┅┅┅P
// C+2 → C+1 → C
//
// alloc → ▲ ← free
// ┠┅┅┅┅┅┅┅┅┅┅┅P
// C+2 → C+1 → C
//
// Since we can't publish a consistent snapshot until all of
// the sweep frees are accounted for, we wait until the next
// mark termination ("MT" above) to publish the previous mark
// termination's snapshot ("P" above). To do this, allocation
// and free events are accounted to *future* heap profile
// cycles ("C+n" above) and we only publish a cycle once all
// of the events from that cycle must be done. Specifically:
//
// Mallocs are accounted to cycle C+2.
// Explicit frees are accounted to cycle C+2.
// GC frees (done during sweeping) are accounted to cycle C+1.
//
// After mark termination, we increment the global heap
// profile cycle counter and accumulate the stats from cycle C
// into the active profile.
// active is the currently published profile. A profiling
// cycle can be accumulated into active once its complete.
active memRecordCycle
// future records the profile events we're counting for cycles
// that have not yet been published. This is ring buffer
// indexed by the global heap profile cycle C and stores
// cycles C, C+1, and C+2. Unlike active, these counts are
// only for a single cycle; they are not cumulative across
// cycles.
//
// We store cycle C here because there's a window between when
// C becomes the active cycle and when we've flushed it to
// active.
future [3]memRecordCycle
}
复制代码
N gc cycle -> GC() -> mark -> stop the world
mProf_NextCycle() -> Start the world -> mProf_Flush() -> mark done -> sweep -> gc done -> mProf_PostSweep() -> N + 1 gc cycle
复制代码
memRecordCycle包括分配了bytes,分配的次数。 释放的bytes,释放的次数。
// memRecordCycle
type memRecordCycle struct {
allocs, frees uintptr
alloc_bytes, free_bytes uintptr
}
复制代码
bucket
mProfAlloc会使用用户malloc生成的调用栈去生成一个bucket,然后再由bucket存储用户的调用栈,以及memRecordCycle等信息。
这样目的是为了不用存储相同的调用栈。
最后统计的时候,将所有bucket中的memRecordCycle取出来累计就得到了totalalloc等信息。
// Called by malloc to record a profiled block.
func mProf_Malloc(p unsafe.Pointer, size uintptr) {
var stk [maxStack]uintptr
nstk := callers(4, stk[:])
lock(&proflock)
b := stkbucket(memProfile, size, stk[:nstk], true)
c := mProf.cycle
mp := b.mp()
mpc := &mp.future[(c+2)%uint32(len(mp.future))]
mpc.allocs++
mpc.alloc_bytes += size
unlock(&proflock)
// Setprofilebucket locks a bunch of other mutexes, so we call it outside of proflock.
// This reduces potential contention and chances of deadlocks.
// Since the object must be alive during call to mProf_Malloc,
// it's fine to do this non-atomically.
systemstack(func() {
setprofilebucket(p, b)
})
}
复制代码
用户的每一次调用,可能包含许多的栈信息。
栈指针,内存分配信息等会存储到一个bucket。
bucket如下。
bucket随后还包括一个[]unitprt的数组(指向调用栈的地址)和memrecord或blockrecord{}
不同的bucket之间通过一个buckethash hashmap放在一块,链接寻址法。
buckethash使用hash函数对调用栈生成hash,调用栈以及malloc分配的size相同的才会被放在相同的bucket。
然后每一次生成新的bucket,都会更新对应的全局变量,对于memprofile就是将mBucket指向这个bucket。
通过mBucket就可以获取到所有的bucket。
type bucket struct {
next *bucket
allnext *bucket
typ bucketType
hash uintptr
size uintptr
nstk uintptr
}
复制代码
// newBucket allocates a bucket with the given type and number of stack entries.
func newBucket(typ bucketType, nstk int) *bucket {
size := unsafe.Sizeof(bucket{}) + uintptr(nstk)*unsafe.Sizeof(uintptr(0))
switch typ {
default:
throw("invalid profile bucket type")
case memProfile:
size += unsafe.Sizeof(memRecord{})
case blockProfile, mutexProfile:
size += unsafe.Sizeof(blockRecord{})
}
b := (*bucket)(persistentalloc(size, 0, &memstats.buckhash_sys))
bucketmem += size
b.typ = typ
b.nstk = uintptr(nstk)
return b
}
// stk returns the slice in b holding the stack.
func (b *bucket) stk() []uintptr {
stk := (*[maxStack]uintptr)(add(unsafe.Pointer(b), unsafe.Sizeof(*b)))
return stk[:b.nstk:b.nstk]
}
复制代码
Memprofile
memprofile就是runtime/pprof真正进行allocs时候的操作。
它将会遍历所有的mbuckets,得到一个调用栈-〉allocs and alloc_bytes的集合。
在遍历mbucket的时候。
- 如果所有的bucket的active(当前的memrecord)allocs或者free为空,表示gc还没有开始。这时候会收集所有的C,C+1,C+2 cycle时候的memrecordcycle。
- 否则,只收集active的信息,也就是两次gc termination前的快照。
所以严谨的说,如果gc次数小于2,所得到的其实不能算是一个一致状态下的snap shot。
在文件的第一行,依次写入总的分配的bytes,使用的object等信息,以及memprofileRate。(值得一提的是,这里的memprofilerate为了兼容旧的c++ profiler所以乘以了一个2.并不准确)
fmt.Fprintf(w, "heap profile: %d: %d [%d: %d] @ heap/%d\n",
total.InUseObjects(), total.InUseBytes(),
total.AllocObjects, total.AllocBytes,
2*runtime.MemProfileRate)
复制代码
随后逐个遍历调用栈。 对每个调用栈,首先第一行打印该调用栈分配的bytes,栈地址。
对于调用栈,从头到尾打印每个具体的栈的信息。
其中如果第一个的函数名是runtime的话将会忽略(因为runtime的函数通常在debug allocs的时候没什么用处),除非所有的函数名都是runtime。
# 0x10b0074 main.main+0x34 /Users/bytedance/go/src/awesomeProject22/main.go:16
# 0x10341e6 runtime.main+0x226 /Users/bytedance/go/go1.17/src/runtime/proc.go:255
复制代码
对于栈,打印如下数据,包括frame文件名,行号,pc地址,func name等数据。
fmt.Fprintf(w, "#\t%#x\t%s+%#x\t%s:%d\n", frame.PC, name, frame.PC-frame.Entry, frame.File, frame.Line)
复制代码
随后打印memstats和maxrss。
火焰图
当我们想要在命令行中直接调用allocs的时候,如下所示。
p := pprof.Lookup("allocs")
f, _ := os.OpenFile("allocs", os.O_CREATE|os.O_RDWR|os.O_TRUNC, 0755)
p.WriteTo(f, 1)
f1, _ := os.OpenFile("allocs.proto", os.O_CREATE|os.O_RDWR|os.O_TRUNC, 0755)
p.WriteTo(f1, 0)
复制代码
其中debug为1的时候,表示以文本的形式打印人类可读文件,为0的时候以proto的形式打印文件。
不仅是文件形式的不同,proto的时候,还会对返回的数据本身进行一些处理操作。
如果我们同时打印text和proto形式的文件,我们会发现以proto生成的ui,其中的total部分无法跟text中的total heap alloc或者其他任何相对应,要大很多。
生成proto文件的时候,会通过memprofilingrate对采样中生成的allocbytes,allocobjexts,inusebytes,inuseobjects等进行一个放大。
values[0], values[1] = scaleHeapSample(r.AllocObjects, r.AllocBytes, rate)
values[2], values[3] = scaleHeapSample(r.InUseObjects(), r.InUseBytes(), rate)
复制代码
// scaleHeapSample adjusts the data from a heap Sample to
// account for its probability of appearing in the collected
// data. heap profiles are a sampling of the memory allocations
// requests in a program. We estimate the unsampled value by dividing
// each collected sample by its probability of appearing in the
// profile. heap profiles rely on a poisson process to determine
// which samples to collect, based on the desired average collection
// rate R. The probability of a sample of size S to appear in that
// profile is 1-exp(-S/R).
func scaleHeapSample(count, size, rate int64) (int64, int64) {
if count == 0 || size == 0 {
return 0, 0
}
if rate <= 1 {
// if rate==1 all samples were collected so no adjustment is needed.
// if rate<1 treat as unknown and skip scaling.
return count, size
}
avgSize := float64(size) / float64(count)
scale := 1 / (1 - math.Exp(-avgSize/float64(rate)))
return int64(float64(count) * scale), int64(float64(size) * scale)
}
复制代码