The essence of GO language interview - what is the underlying implementation principle of map?

what is map

Wikipedia defines map like this:

In computer science, an associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears at most once in the collection.

A brief explanation: In computer science, it is called an associated array, map, symbol table or dictionary, which is an <key, value>abstract data structure composed of a set of pairs, and the same key will only appear once.

There are two key points: map is composed key-valueof pairs; keyit only appears once.

The main operations related to map are:

  1. Add a kv pair - Add or insert;
  2. Delete a kv pair - Remove or delete;
  3. Modify v corresponding to a certain k - Reassign;
  4. Query the v corresponding to a certain k - Lookup;

Simply put, it is the most basic 增删查改.

The design of map is also called "The dictionary problem". Its task is to design a data structure to maintain the data of a collection, and to perform addition, deletion, query, and modification operations on the collection at the same time. There are two main data structures: 哈希查找表(Hash table), 搜索树(Search tree).

The hash lookup table uses a hash function to assign keys to different buckets (that is, different indexes of the array). This way, the overhead is mainly in the calculation of the hash function and the constant access time of the array. In many scenarios, the performance of hash lookup tables is very high.

Hash lookup tables generally have "collision" problems, which means that different keys are hashed to the same bucket. There are generally two ways to deal with it: 链表法and 开放地址法. 链表法Implement a bucket as a linked list, and keys falling in the same bucket will be inserted into this linked list. 开放地址法After the collision occurs, according to certain rules, "vacancies" are selected at the back of the array to place new keys.

The search tree method generally uses self-balancing search trees, including: AVL trees and red-black trees. During interviews, we are often asked and even asked to write red-black tree code by hand. Many times, the interviewer cannot write it himself, which is very excessive.

The worst search efficiency of the self-balancing search tree method is O(logN), while the worst search efficiency of the hash lookup table is O(N). Of course, the average search efficiency of a hash lookup table is O(1). If the hash function is well designed, the worst case scenario will basically not occur. Another point is that when traversing the self-balancing search tree, the returned key sequence will generally be in ascending order; while the hash lookup table is out of order.

How to implement the bottom layer of map

First declare the Go version I use:

go version go1.9.2 darwin/amd64

As mentioned earlier, there are several ways to implement map. The Go language uses hash lookup tables and uses linked lists to resolve hash conflicts.

Next we'll explore the core principles of map and get a glimpse of its internal structure.

map memory model

In the source code, the structure representing map is hmap, which is the "abbreviation" of hashmap:

// A header for a Go map.
type hmap struct {
    // 元素个数,调用 len(map) 时,直接返回此值
	count     int
	flags     uint8
	// buckets 的对数 log_2
	B         uint8
	// overflow 的 bucket 近似数
	noverflow uint16
	// 计算 key 的哈希的时候会传入哈希函数
	hash0     uint32
    // 指向 buckets 数组,大小为 2^B
    // 如果元素个数为0,就为 nil
	buckets    unsafe.Pointer
	// 等量扩容的时候,buckets 长度和 oldbuckets 相等
	// 双倍扩容的时候,buckets 长度会是 oldbuckets 的两倍
	oldbuckets unsafe.Pointer
	// 指示扩容进度,小于此地址的 buckets 迁移完成
	nevacuate  uintptr
	extra *mapextra // optional fields
}

To explain, Bit is the logarithm of the length of the buckets array, which means that the length of the buckets array is 2^B. The key and value are stored in the bucket, which will be discussed later.

buckets is a pointer, which ultimately points to a structure:

type bmap struct {
	tophash [bucketCnt]uint8
}

But this is only the structure of the surface (src/runtime/hashmap.go). It will be added during compilation and a new structure will be dynamically created:

type bmap struct {
    topbits  [8]uint8
    keys     [8]keytype
    values   [8]valuetype
    pad      uintptr
    overflow uintptr
}

bmapIt is what we often call a "bucket". A bucket can contain up to 8 keys. The reason why these keys fall into the same bucket is because after they have been hashed, the hash results are of the same type. In the bucket, the high 8 bits of the hash value calculated by the key will be used to determine where the key falls in the bucket (there are up to 8 positions in a bucket).

Here’s an overall picture:

Insert image description here

When the key and value of the map are not pointers, and the size is less than 128 bytes, the bmap will be marked as not containing pointers, which can avoid scanning the entire hmap during gc. However, we see that bmap actually has an overflow field, which is of pointer type, which destroys the idea that bmap does not contain pointers. In this case, the overflow will be moved to the extra field.

type mapextra struct {
	// overflow[0] contains overflow buckets for hmap.buckets.
	// overflow[1] contains overflow buckets for hmap.oldbuckets.
	overflow [2]*[]*bmap

	// nextOverflow 包含空闲的 overflow bucket,这是预分配的 bucket
	nextOverflow *bmap
}

bmap is where kv is stored. Let's zoom in and take a closer look at the internal composition of bmap.

Insert image description here

The picture above is the memory model of the bucket, HOB Hashwhich refers to the top hash. Note that key and value are put together separately, not in key/value/key/value/...this form. The source code states that the advantage of this is that in some cases the padding field can be omitted to save memory space.

For example, there is a map of this type:

map[int64]int8

If key/value/key/value/...stored in this mode, an additional 7 bytes of padding will be added after each key/value pair; and all keys and values ​​are bound together separately. In this form, only padding needs to be added at the end key/key/.../value/value/....

Each bucket is designed to hold up to 8 key-value pairs. If a ninth key-value falls into the current bucket, another bucket needs to be constructed and connected through pointers overflow.

Create map

Syntactically, creating a map is simple:

ageMp := make(map[string]int)
// 指定 map 长度
ageMp := make(map[string]int, 8)

// ageMp 为 nil,不能向其添加元素,会直接panic
var ageMp map[string]int

It can be seen from the assembly language that the underlying makemapfunction is actually called, and its main task is to initialize hmapvarious fields of the structure, such as calculating the size of B, setting the hash seed hash0, and so on.

func makemap(t *maptype, hint int64, h *hmap, bucket unsafe.Pointer) *hmap {
	// 省略各种条件检查...

	// 找到一个 B,使得 map 的装载因子在正常范围内
	B := uint8(0)
	for ; overLoadFactor(hint, B); B++ {
	}

	// 初始化 hash table
	// 如果 B 等于 0,那么 buckets 就会在赋值的时候再分配
	// 如果长度比较大,分配内存会花费长一点
	buckets := bucket
	var extra *mapextra
	if B != 0 {
		var nextOverflow *bmap
		buckets, nextOverflow = makeBucketArray(t, B)
		if nextOverflow != nil {
			extra = new(mapextra)
			extra.nextOverflow = nextOverflow
		}
	}

	// 初始化 hamp
	if h == nil {
		h = (*hmap)(newobject(t.hmap))
	}
	h.count = 0
	h.B = B
	h.extra = extra
	h.flags = 0
	h.hash0 = fastrand()
	h.buckets = buckets
	h.oldbuckets = nil
	h.nevacuate = 0
	h.noverflow = 0

	return h
}

[Extension 1] What is the difference between slice and map when they are used as function parameters?

Note that the result returned by this function is: *hmap, which is a pointer, while makeslicethe function we talked about before returns Slicea structure:

func makeslice(et *_type, len, cap int) slice

Let’s review the structure definition of slice:

// runtime/slice.go
type slice struct {
    array unsafe.Pointer // 元素指针
    len   int // 长度 
    cap   int // 容量
}

The structure contains underlying data pointers.

The difference between makemap and makeslice brings about one difference: when map and slice are used as function parameters, the operation of map within the function parameters will affect map itself; but not for slice (as mentioned in the previous article about slice) ).

The main reason: one is a pointer ( *hmap) and the other is a structure ( slice). Function parameters in the Go language are all passed by value. Within the function, the parameters will be copied locally. *hmapAfter the pointer is copied, it still points to the same map, so the operation of the map inside the function will affect the actual parameters. After the slice is copied, it will become a new slice, and operations performed on it will not affect the actual parameters.

Hash function

A key point of map is the choice of hash function. When the program starts, it will detect whether the CPU supports aes. If it does, use aes hash, otherwise use memhash. This is alginit()done in function , located src/runtime/alg.gounder path:.

Hash function has encrypted and non-encrypted types.
The encrypted type is generally used to encrypt data, digital digests, etc. Typical representatives are md5, sha1, sha256, aes256; the
non-encrypted type is generally used for search. In the application scenario of map, search is used.
There are two main points to consider when choosing a hash function: performance and collision probability.

We talked about it before, the structure representing the type:

type _type struct {
	size       uintptr
	ptrdata    uintptr // size of memory prefix holding all pointers
	hash       uint32
	tflag      tflag
	align      uint8
	fieldalign uint8
	kind       uint8
	alg        *typeAlg
	gcdata    *byte
	str       nameOff
	ptrToThis typeOff
}

The algfield is related to the hash, which is a pointer to the following structure:

// src/runtime/alg.go
type typeAlg struct {
	// (ptr to object, seed) -> hash
	hash func(unsafe.Pointer, uintptr) uintptr
	// (ptr to object A, ptr to object B) -> ==?
	equal func(unsafe.Pointer, unsafe.Pointer) bool
}

typeAlg contains two functions, the hash function calculates the hash value of the type, and the equal function calculates whether the two types are "hash equal".

For the string type, its hash and equal functions are as follows:

func strhash(a unsafe.Pointer, h uintptr) uintptr {
	x := (*stringStruct)(a)
	return memhash(x.str, h, uintptr(x.len))
}

func strequal(p, q unsafe.Pointer) bool {
	return *(*string)(p) == *(*string)(q)
}

According to the type of key, the alg field of the _type structure will be set with the hash and equal functions of the corresponding type.

key positioning process

After the key is hashed, the hash value is obtained, with a total of 64 bits (64-bit machines and 32-bit machines will not be discussed. Now the mainstream is 64-bit machines). When calculating which bucket it will fall into, only The last B bits are used. Remember B mentioned earlier? If B = 5, then the number of buckets, that is, the length of the buckets array is 2^5 = 32.

For example, after a key is calculated by a hash function, the hash result is:

 10010111 | 000011110110110010001111001010100010010110010101010 │ 01010

Use the last 5 bits, that is 01010, the value is 10, which is bucket No. 10. This operation is actually a remainder operation, but the remainder operation is too expensive, so the code implementation uses bit operations instead.

Then use the high 8 bits of the hash value to find the location of this key in the bucket. This is looking for an existing key. There is no key in the bucket at first, and the newly added key will find the first empty slot and put it in.

The buckets number is the bucket number. When two different keys fall into the same bucket, a hash conflict occurs. The conflict resolution method is to use the linked list method: in the bucket, find the first empty slot from front to back. In this way, when searching for a certain key, first find the corresponding bucket, and then traverse the keys in the bucket.

Here is a reference to a picture in Cao Da's github blog. The original picture is an ASCII picture, which is full of geek flavor. You can find Cao Da's blog from the reference materials. I recommend everyone to take a look.

Insert image description here

In the above figure, it is assumed that B = 5, so the total number of buckets is 2^5 = 32. First, calculate the hash of the key to be found. Use the lower 5 bits 00110to find the corresponding bucket No. 6. Use the upper 8 bits 10010111, which corresponds to decimal 151. Find the key with a tophash value (HOB hash) of 151 in bucket No. 6 and find it. Slot 2, and the entire search process is over.

If it is not found in the bucket and the overflow is not empty, it will continue to search in the overflow bucket until it is found or all key slots have been searched, including all overflow buckets.

Let’s take a look at the source code, haha! Through assembly language, we can see that the underlying function to find a certain key is mapacessa series of functions. The functions of the functions are similar, and the differences will be discussed in the next section. Here we look directly at mapacess1the function:

func mapaccess1(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer {
	// ……
	
	// 如果 h 什么都没有,返回零值
	if h == nil || h.count == 0 {
		return unsafe.Pointer(&zeroVal[0])
	}
	
	// 写和读冲突
	if h.flags&hashWriting != 0 {
		throw("concurrent map read and map write")
	}
	
	// 不同类型 key 使用的 hash 算法在编译期确定
	alg := t.key.alg
	
	// 计算哈希值,并且加入 hash0 引入随机性
	hash := alg.hash(key, uintptr(h.hash0))
	
	// 比如 B=5,那 m 就是31,二进制是全 1
	// 求 bucket num 时,将 hash 与 m 相与,
	// 达到 bucket num 由 hash 的低 8 位决定的效果
	m := uintptr(1)<<h.B - 1
	
	// b 就是 bucket 的地址
	b := (*bmap)(add(h.buckets, (hash&m)*uintptr(t.bucketsize)))
	
	// oldbuckets 不为 nil,说明发生了扩容
	if c := h.oldbuckets; c != nil {
	    // 如果不是同 size 扩容(看后面扩容的内容)
	    // 对应条件 1 的解决方案
		if !h.sameSizeGrow() {
			// 新 bucket 数量是老的 2 倍
			m >>= 1
		}
		
		// 求出 key 在老的 map 中的 bucket 位置
		oldb := (*bmap)(add(c, (hash&m)*uintptr(t.bucketsize)))
		
		// 如果 oldb 没有搬迁到新的 bucket
		// 那就在老的 bucket 中寻找
		if !evacuated(oldb) {
			b = oldb
		}
	}
	
	// 计算出高 8 位的 hash
	// 相当于右移 56 位,只取高8位
	top := uint8(hash >> (sys.PtrSize*8 - 8))
	
	// 增加一个 minTopHash
	if top < minTopHash {
		top += minTopHash
	}
	for {
	    // 遍历 bucket 的 8 个位置
		for i := uintptr(0); i < bucketCnt; i++ {
		    // tophash 不匹配,继续
			if b.tophash[i] != top {
				continue
			}
			// tophash 匹配,定位到 key 的位置
			k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))
			// key 是指针
			if t.indirectkey {
			    // 解引用
				k = *((*unsafe.Pointer)(k))
			}
			// 如果 key 相等
			if alg.equal(key, k) {
			    // 定位到 value 的位置
				v := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.valuesize))
				// value 解引用
				if t.indirectvalue {
					v = *((*unsafe.Pointer)(v))
				}
				return v
			}
		}
		
		// bucket 找完(还没找到),继续到 overflow bucket 里找
		b = b.overflow(t)
		// overflow bucket 也找完了,说明没有目标 key
		// 返回零值
		if b == nil {
			return unsafe.Pointer(&zeroVal[0])
		}
	}
}

The function returns the pointer of h[key]. If there is no such key in h, it will return a zero value of the corresponding type of key and will not return nil.

The code is relatively straightforward overall and there is nothing difficult to understand. Just follow the above comments to understand step by step.

Here, let’s talk about the method of locating key and value and how to write the entire loop.

// key 定位公式
k := add(unsafe.Pointer(b), dataOffset+i*uintptr(t.keysize))

// value 定位公式
v := add(unsafe.Pointer(b), dataOffset+bucketCnt*uintptr(t.keysize)+i*uintptr(t.valuesize))

b is the address of bmap. Here, bmap is still a structure defined in the source code. It only contains a tophash array. The structure expanded by the compiler only contains key, value, and overflow fields. dataOffset is the offset of key relative to the starting address of bmap:

dataOffset = unsafe.Offsetof(struct {
		b bmap
		v int64
	}{}.v)

Therefore, the starting address of the key in the bucket is unsafe.Pointer(b)+dataOffset. The address of the i-th key will span the size of i keys on this basis; and we also know that the address of the value is after all keys, so the address of the i-th value also needs to add the offsets of all keys. . After understanding this, the positioning formulas of key and value above are easy to understand.

Let’s talk about the writing method of the entire big loop. The outermost layer is an infinite loop. Through

b = b.overflow(t)

Traverse all buckets, which is equivalent to a bucket linked list.

When a specific bucket is located, the inner loop traverses all cells in the bucket, or all slots, that is, bucketCnt=8 slots. The entire cycle process:

Insert image description here

Let's talk about minTopHash again. When the tophash value of a cell is less than minTopHash, it marks the migration status of this cell. Because this status value is placed in the tophash array, in order to distinguish it from the normal hash value, the hash value calculated by the key will be incremented: minTopHash. This distinguishes between normal top hash values ​​and hash values ​​representing status.

The following states characterize the bucket situation:

// 空的 cell,也是初始时 bucket 的状态
empty          = 0
// 空的 cell,表示 cell 已经被迁移到新的 bucket
evacuatedEmpty = 1
// key,value 已经搬迁完毕,但是 key 都在新 bucket 前半部分,
// 后面扩容部分会再讲到。
evacuatedX     = 2
// 同上,key 在后半部分
evacuatedY     = 3
// tophash 的最小正常值
minTopHash     = 4

The function used in the source code to determine whether the bucket has been relocated:

func evacuated(b *bmap) bool {
	h := b.tophash[0]
	return h > empty && h < minTopHash
}

Only the first value of the tophash array is taken to determine whether it is between 0-4. Comparing the above constants, when the top hash is evacuatedEmptyone of the three evacuatedXvalues evacuatedY​​​​, , , it means that all the keys in this bucket have been moved to the new bucket.

Guess you like

Origin blog.csdn.net/zy_dreamer/article/details/132799666