[Project] Implementing a high-concurrency memory pool from scratch

The prototype of this project is Thread-Caching Malloc, an open source project of Google, which is thread caching malloc, or tcmalloc for short. It was written by Google's top C++ experts at the time, and it was very well-known. Many companies used it as a performance optimization, and the Go language directly used it as a memory allocator.

This project extracts the core part of tcmalloc and builds a high-concurrency memory pool from scratch. In a multi-threaded environment , it can more efficiently replace system memory allocation related functions such as malloc and free.

2. The technologies involved in the project and bloggers’ previous reference articles

Language: C/C++ (including C++11)

[C++11] Detailed explanation of common features of C++11

Data structure: singly linked list, headed two-way circular linked list, hash bucket

[Data structure] Implementation of singly linked list

[Data structure] Implementation of leading two-way circular linked list

[C++] Hash (unordered series associative containers)

Operating system: memory management, multi-threading, mutex locks

[C language] realloc, malloc, calloc

[C++] Dynamic memory management and generic programming

【C++11】Multi-threading

[Linux System Programming] Creation, waiting, and termination of multi-threads

[Linux system programming] Mutual exclusion and synchronization of multi-threads

Design Pattern: Singleton Pattern

[C++] Special class design + singleton mode

[Linux system programming] Thread pool based on singleton mode lazy implementation

3. Pooling technology

Every time a program applies for resources from the operating system, there will be a certain amount of overhead. In scenarios where resources are frequently requested and released, this overhead will increase. Using pooling technology can reduce this part of the cost. Pooling technology means that the program applies for an excess resource from the operating system in advance as a "reservoir". When the thread needs resources, it goes to this memory pool to apply, which can solve the problem of frequent application for release of small resources . The resulting reduction in system efficiency and memory fragmentation problems. (In fact, the essence of the library function malloc is also a memory pool.)

In computers, there are many places where pooling technology is used. In addition to memory pools, there are also object pools, connection pools, thread pools, etc. Take the thread pool on the server as an example: first start a certain number of threads and let them sleep. state, when receiving a client's request, wake up a thread in the thread pool to process the client's request. When the request is processed, the thread returns to the sleep state.

4. Internal and external fragmentation of the memory pool

Internal fragmentation: The size of the required memory is smaller than the size of the memory block provided by the memory pool, and the excess space is called internal fragmentation;

External fragmentation: When applying for and returning resources from the memory pool, although the remaining resources in the memory pool are greater than the resources required at a certain time, the space application fails due to the discontinuity of the memory space. This is called external fragmentation.

2. Let’s first look at a fixed-length memory pool design.

Design a fixed-length memory pool. _memory points to the starting address of the memory pool. Each application will obtain a T-type resource block. At the same time, _memory+=sizeof(T) moves the address to calibrate the starting address of the next application resource. starting address.

When returning resources, insert each returned resource header into the linked list _freeList, and the first 4/8 bytes of each resource block serve as a pointer to the next returned resource block.

T* New() interface: This interface is used to apply for resource blocks

1. If there are returned resource blocks, the returned resource blocks will be reused first. It should be noted that when the fixed-length resource block is allocated to the end, it may not be enough for one resource block. At this time, the last remaining space will be given up and a new memory pool will be opened up;

2. When applying for a memory pool, use the VirtualAlloc interface of Windows, and directly go to the heap area to apply for space by page without malloc; (Similar interfaces in Linux include brk and mmap)

3. It is necessary to consider that the size of the T type may be smaller than a pointer. In this case, the fixed length of each resource block needs to be adjusted to the size of a pointer; (at least the next pointer must be placed, otherwise what will happen then? head plug)

4. After completing the space opening work of the resource block, you still need to explicitly call the constructor of type T, that is, position new. (Why not do the construction of the resource block from the beginning? Because we need to apply for space in the memory pool that has been applied for, all construction data must be written into the resource block, and if the space is not allocated yet, the construction is called, then The constructed data will be written elsewhere!)

void Delete(T* obj) interface: This interface is used to clean up the data of the returned resource block (note that it is only cleaned, not released)

1. For the returned resource blocks, first call the T-type destructor to clean up the resources. After the cleanup is completed, insert the resource block header into the linked list _freeList.

2. Will the fixed-length memory pool not be released? If not released, the meaning of the memory pool is for businesses that frequently apply for and release resources. Its resource blocks can be used repeatedly, and it is not a memory leak! If the process does not stop and the business does not stop, the memory pool resources allocated by the process will naturally be released.

#pragma once
#include <iostream>
#include <exception>
#include <vector>
#include <ctime>

#ifdef _WIN32
	#include <Windows.h>
#else
	//包含linux下brk mmap等头文件
#endif
using std::cout;
using std::endl;
inline static void* SystemAlloc(size_t kpage)
{
#ifdef _WIN32
	void* ptr = VirtualAlloc(0, kpage <<13, MEM_COMMIT | MEM_RESERVE,
		PAGE_READWRITE);
#else
	// linux下brk mmap等
#endif
	if (ptr == nullptr)
		throw std::bad_alloc();
	return ptr;
}
//template <size_t N>//N代表每个内存块的大小
//class ObjectPool//定长内存池
//{};
template <class T>//T代表内存对象，每个内存对象的大小是一样的，表示内存块的大小
class ObjectPool//定长内存池
{
public:
	T* New()//内存池单次资源申请
	{
		T* obj = nullptr;
		//申请资源时优先重复利用已归还的资源块
		if (_freeList != nullptr)
		{
			obj = (T*)_freeList;
			_freeList = *(void**)_freeList;
		}
		else
		{
			if (_remainBytes < sizeof(T))//当剩余内存小于一个T对象时，重新开辟一块新内存池
			{
				_remainBytes = 128 * 1024;
				_memory = (char*)SystemAlloc(_remainBytes>>13);//128KB换算成每页8kb,得16页
				if (nullptr == _memory)
				{
					throw std::bad_alloc();
				}
				
			}
			//内存池单次资源申请
			obj = (T*)_memory;
			size_t objSize = sizeof(T) < sizeof(void*) ? sizeof(void*) : sizeof(T);
			_memory += objSize;
			_remainBytes -= objSize;
		}
		//使用定位new，显式调用T的构造函数初始化构造函数
		new(obj)T;
		return obj;
	}
	void Delete(T* obj)
	{
		obj->~T();//显式调用obj的析构函数，清理对象
		//归还资源块时，进行单链表的头插
		*(void**)obj = _freeList;//找到资源块的头4/8个字节，取决于32位还是64位机器
		_freeList = obj;
	}
private:
	char* _memory = nullptr;//内存池的起始地址
	size_t _remainBytes = 0;//内存池剩余内存
	void* _freeList = nullptr;//指向归还资源块的单链表头指针
};
struct TreeNode
{
	int val;
	TreeNode* _left;
	TreeNode* _right;
	TreeNode()
		:val(0)
		,_left(nullptr)
		,_right(nullptr)
	{}
};
void TestObjectPool()//测试代码
{
	//申请释放的轮次
	const size_t Rounds = 5;
	//每轮申请释放多少次
	const size_t N = 10000;
	std::vector<TreeNode*> v1;
	v1.reserve(N);
	size_t begin1 = clock();
	for (size_t j = 0; j < Rounds; ++j)
	{
		for (int i = 0; i < N; ++i)
		{
			v1.push_back(new TreeNode);
		}
		for (int i = 0; i < N; ++i)
		{
			delete v1[i];
		}
		v1.clear();
	}
	size_t end1 = clock();
	std::vector<TreeNode*> v2;
	v2.reserve(N);
	ObjectPool<TreeNode> TNPool;
	size_t begin2 = clock();
	for (size_t j = 0; j < Rounds; ++j)
	{
		for (int i = 0; i < N; ++i)
		{
			v2.push_back(TNPool.New());
		}
		for (int i = 0; i < N; ++i)
		{
			TNPool.Delete(v2[i]);
		}
		v2.clear();
	}
	size_t end2 = clock();
	cout << "new cost time:" << end1 - begin1 << endl;
	cout << "object pool cost time:" << end2 - begin2 << endl;
}

The test code compares the fixed-length memory pool with the keyword new. By applying for tree nodes in batches, the fixed-length memory pool is faster. Fixed-length memory pool is just one direction. Please see below for the actual design ideas of this project:

3. Three-layer framework design of high-concurrency memory pool

Many modern development environments are multi-core and multi-threaded. In the scenario of memory application, there must be fierce lock competition. Malloc itself is actually very good, so the prototype of our project, tcmalloc, is even better in multi-threaded high-concurrency scenarios. Therefore, the high-concurrency memory pool we implemented this time needs to take into account the following issues:

1. Performance issues. 2. Lock competition problem in multi-threaded environment. 3. Memory fragmentation problem.

The high-concurrency memory pool mainly consists of three parts:

1. Thread cache: Thread cache, used for memory usage allocation less than 256KB. Each thread created in a process will have its own independent thread cache. Threads requesting resources from here will not need to lock. However, traditional malloc will lock memory applications in a multi-thread environment. This is high concurrency. The foundation of memory pool efficiency.

2. Central cache: Central cache. When the thread cache is not enough, the thread will go to the central cache to apply for object resources. At the same time, the central cache will take back the resources at the right time to prevent resource shortage when other threads apply. There is resource competition in the central cache. When threads apply for resources there, they need to be locked and protected. Because hash bucket locks are used here, locks are only locked when a hash conflict occurs when two threads apply for resources. Secondly, only thread caches are used. Only when they are used up will they go to the central cache to apply for resources, so the competition in the central cache is not fierce.

3. Page cache: Page cache. When the central cache memory object allocation is exhausted, a certain amount of memory is allocated from the page cache in units of pages, and is cut into fixed-size memory blocks and allocated to the central cache. When several span page objects of a span in the central cache are all reclaimed, the page cache will reclaim the span objects that meet the conditions and merge adjacent pages to alleviate the problem of memory fragmentation.

If the page cache is not enough, interfaces such as VirtualAlloc will be used to apply for resources in the heap area.

1. Implementation of thread cache

1.1thread cache overall framework

Through the design of the fixed-length memory pool, we can find that the fixed-length memory pool is better than malloc in the case of fixed length, but it can only be used for objects of one length. Then we can design a memory pool with multiple fixed lengths. Long options are available as needed.

The thread cache is a hash bucket structure. The hash table is mapped according to the size of the memory block object. Each hash bucket mounts a memory block of the current hash bucket set size. Each thread has a thread cache object, so there is no need to lock when applying for resources here.

When applying for resources, provide resource blocks based on space greater than or equal to the demand. For example, I only need 10 bytes, and you can only give me 16 bytes. There is no way, the extra 6 bytes will become internal fragments. After the resource block is returned, the resource block will be re-inserted into the corresponding hash bucket.

1.2 Hash bucket mapping alignment rules

The thread cache has a total space of 256KB. It is impossible for us to map every fixed-length value into the hash table. We can only use certain rules to map a range of values into the same bucket.

So how to determine this range? There is a total space of 256KB. If the alignment interval is too long, it will easily cause internal fragmentation. If it is too short, it will cause too many hash buckets.

First of all, the first bucket must be 8 bytes, because the pointer size on a 64-bit machine is 8 bytes, and the size of the memory block should at least accommodate the pointer! Now that the starting length is available, how should the alignment interval be determined?

Overall control of internal debris waste to around 10% at most

[1,128] 8byte aligned freelist[0,16)

[128+1,1024] 16byte aligned freelist[16,72)

[1024+1,8*1024] 128byte alignment freelist[72,128)

[8*1024+1,64*1024] 1024byte aligned freelist[128,184)

[64*1024+1,256*1024] 8*1024byte alignment freelist[184,208)

Explanation: The memory requirement is 129 bytes. According to the above rules, the bucket adapted to 129 bytes is 128+16=144 bytes; the memory pool gives me a 144-byte memory block, and the internal fragmentation ratio = (144-129) /144=10.4%.

At this time, an external resource block of bytes bytes is required externally, and the size of the memory block after the resource block is upwardly aligned is output through the following code (the comment sub-function is written by normal people, and the highlighted sub-function is written by tcmalloc experts):

/*size_t _RoundUp(size_t size, size_t alignNum)
{
    size_t alignSize = 0;
    if (size % alignNum != 0)
    {
        alignSize = ((size / alignNum) + 1) * alignNum;
    }
    else
        alignSize = size;
    return alignSize;
}*/
static inline size_t _RoundUp(size_t bytes, size_t alignNum)//bytes：需要申请的字节；alignNum：对齐数
{
    return (bytes + alignNum - 1) & ~(alignNum - 1);
}
//用于返回申请内存对应的内存块
static inline size_t RoundUp(size_t bytes)
{
    if (bytes <= 128)
    {
        return _RoundUp(bytes, 128);
    }
    else if (bytes <= 1024)
    {
        return _RoundUp(bytes, 1024);
    }
    else if (bytes <= 8 * 1024)
    {
        return _RoundUp(bytes, 8*1024);
    }
    else if (bytes <= 64 * 1024)
    {
        return _RoundUp(bytes, 64*1024);
    }
    else if (bytes <= 256 * 1024)
    {
        return _RoundUp(bytes, 256*1024);
    }
    else
    {
        assert(false);
        return -1;
    }
}

Use the following code to find which bucket the resource block is located in after the upward alignment of the resource block :

//static inline size_t _Index(size_t bytes,size_t alignNum)//bytes：需要申请的字节；alignNum：对齐数
//{
//	if (bytes % alignNum == 0)
//	{
//		return bytes / alignNum - 1;
//	}
//	else
//		return bytes / alignNum;
//}
static inline size_t _Index(size_t bytes,size_t align_shift)//bytes：需要申请的字节-上一个区间的大小；align_shift：对齐数的移位值
{
    return ((bytes + (1 << align_shift) - 1) >> align_shift) - 1;
}
// 计算映射的哪一个自由链表桶
static inline size_t Index(size_t bytes)//bytes：需要申请的字节
{
    assert(bytes <= MAX_BYTES);
    // 每个区间有多少个链
    static int group_array[4] = { 16, 56, 56, 56 };
    if (bytes <= 128) {
        return _Index(bytes, 3);
    }
    else if (bytes <= 1024) {
        return _Index(bytes - 128, 4) + group_array[0];
    }
    else if (bytes <= 8 * 1024) {
        return _Index(bytes - 1024, 7) + group_array[1] + group_array[0];
    }
    else if (bytes <= 64 * 1024) {
        return _Index(bytes - 8 * 1024, 10) + group_array[2] + group_array[1]
            + group_array[0];
    }
    else if (bytes <= 256 * 1024) {
        return _Index(bytes - 64 * 1024, 13) + group_array[3] +
            group_array[2] + group_array[1] + group_array[0];
    }
    else {
        assert(false);
    }
    return -1;
}

1.3Thread Local Storage (TLS) thread local storage achieves lock-free access

class ThreadCache
{
public:
	//申请和释放对象
	void* Allocate(size_t size);
	void Deallocate(void* ptr, size_t size);
	// 从中央缓存获取对象
	void* FetchFromCentralCache(size_t index, size_t size);//index：计算位于哪一个桶；size:对齐之后的内存块大小
private:
	FreeList _freeList[NFREE_LIST];//挂载208个哈希桶的哈希表
};

//每个线程都有一份自己的pTLSThreadCache
static _declspec(thread)ThreadCache* pTLSThreadCache = nullptr;//pTLSThreadCache:指向Threadcache对象的指针

This code uses the __declspec(thread) keyword of Microsoft Visual C++ to specify that the variable is a thread local storage variable (Thread Local Storage, TLS). This means that each thread will have a copy of the variable, and the variable can only be accessed and modified by that thread and will not be shared by other threads. Each thread has its own ThreadCache instance, so it can avoid competition and synchronization problems between threads and improve program concurrency and performance.

The specific interface implementation of ThreadCache. Note that if the thread cache of a certain thread is not enough, it will ask the central cache for memory.

void* ThreadCache::Allocate(size_t bytes)//申请对象
{
	assert(bytes <= MAX_BYTES);
	size_t alignNum = SizeClass::RoundUp(bytes);//对齐之后的内存块大小
	size_t index = SizeClass::Index(bytes);//计算位于哪一个桶
	if (!_freeList[index].Empty())//去对应的哈希桶中的自由链表中申请资源块
	{
		return _freeList[index].Pop();
	}
	else//如果对应的自由链表已经没有资源块了，那就要去中央缓存申请资源了
	{
		return FetchFromCentralCache(index, alignNum);//index：计算位于哪一个桶；size:对齐之后的内存块大小
	}
}
void ThreadCache::Deallocate(void* ptr, size_t bytes)//释放对象
{
	assert(ptr);
	assert(bytes <= MAX_BYTES);
	size_t index = SizeClass::Index(bytes);//算出位于几号桶
	//头插
	_freeList[index].Push(ptr);
}

Use the following two encapsulation interfaces to implement external incoming required bytes, return and destroy the corresponding resource blocks.

//加上static，防止该头文件被源文件重复包含导致函数重定义
static void* ConcurrentAlloc(size_t size)//线程通过该函数去各自的线程缓存申请内存块
{
	if (nullptr == pTLSThreadCache)//这里不用加锁，每个线程独有一份pTLSThreadCache
	{
		pTLSThreadCache = new ThreadCache;
	}
	//cout << std::this_thread::get_id() << ":" << pTLSThreadCache << endl;
	return pTLSThreadCache->Allocate(size);
}
static void ConcurrentFree(void* ptr,size_t size)//外部调用该函数，释放内存块
{
	assert(pTLSThreadCache);
	pTLSThreadCache->Deallocate(ptr, size);
}

2. Implementation of central cache

2.1The overall framework of central cache

The central cache is very similar to the thread cache. It is also a hash bucket structure that is mapped according to rule conditions. They both have 208 hash buckets. The location of the corresponding bucket of the memory block in the thread cache and the central cache is the same. the difference is:

1. The hash bucket in this area needs to be locked and protected, and the lock used is a bucket lock (each hash bucket will have a lock to ensure thread safety)

2. The hash bucket of the central cache needs to be designed as a leading doubly linked list, because after Span takes back the page, it needs to be returned to the page cache of the next layer, which needs to satisfy insertion and deletion at any position.

3. The central cache needs to be set to singleton mode, while each variable in the thread cache is unique

4. What is mounted on the hash bucket in this area is not each cut into small blocks of objects, but each Span (a large block of memory in pages). The number of pages allocated to the Span of different hash buckets will be different. Different), what is mounted under Span is the real chopped object.

5. The resource blocks requested by the thread cache from the central cache will be inserted into the thread cache when they are returned. When a thread frequently applies for resource blocks from the central cache, all of them will be kept in its "pocket" when they are released. This will cause the thread cache to grow larger and larger. Therefore, after all the borrowed memory blocks are released, the thread cache needs to return the resource objects applied for at that time on a page-by-page basis. The central cache can reclaim these memories and give them to other threads that need memory to achieve balanced scheduling.

6. So how does the central cache know that the "borrowed" memory block thread cache has been used up and has been taken back? You can add a new member variable in the class to record how many memory objects are "lent". Every time one is recovered, --, equal to 0, recycling can be initiated.

7. Similarly, when the central cache memory is insufficient, it will apply for space to the next layer. After the next layer reclaims the pages, it will continue to be consolidated to reduce memory fragmentation (external fragmentation).

2.2 Page Management

We can define a class Span to manage large blocks of memory with multiple consecutive pages. Then these pages need a number (similar to an address). You can define a _spanId member variable in the class to represent the page number.

For example, a 32-bit machine can be divided into 2^32/2^13=2^19 pages, but the number of pages on a 64-bit machine is directly exponentially doubled to 2^64/2^13=2^51 pages. In order to solve the problem of 64-bit If the machine has too many pages (to solve the problem of insufficient page number size), you can use conditional compilation to distinguish. Note that win64 has both win64 and win32 macro definitions, so you must first determine whether _WIN64 exists on the current machine:

#ifdef _WIN64
	typedef unsigned long long PAGE_ID;
#elif _WIN32
	typedef size_t PAGE_ID;
#elif
	//Linux等平台
#endif

//管理多个连续页的大块内存结构(带头双向循环链表的节点——页节点)
struct Span
{
	PAGE_ID _pageId = 0;//页号，类似地址。32位程序2^32，每一页8K，即2^32/2^13=2^19页；64位会有2^51页。
	size_t _n = 0;//页数
	//带头双向循环链表
	Span* _next = nullptr;//记录上一个页的地址
	Span* _prev = nullptr;//记录下一个页的地址
	size_t _useCount = 0;//已经“借给”Thread Cache的小块内存的计数
	void* _freeList = nullptr;//未被分配的切好的小块内存自由单链表，挂载在页下
};
//页的带头双向循环链表(带桶锁)
class SpanList
{
	SpanList()
	{
		_head = new Span;
		assert(_head);
		_head->_prev = _head;
		_head->_next = _head;
	}
	void Insert(Span* pos, Span* newSpan);//在pos位置之前插入
	void Earse(Span* pos);
private:
	Span* _head;//哨兵位
	std::mutex _mtx;//桶锁
};

2.3 Use singleton mode to generate global static central cache objects (hungry mode)

In the thread caceh, we use TLS to implement a thread caceh unique to each thread, allowing it to achieve lock-free access in this area. The central cache needs to use bucket locks to ensure thread safety, so all threads must be able to access the same central cache. Isn't this the singleton mode, which generates a static global object for all threads to access.

//单例饿汉模式
class CentralCache
{
public:
	//使用偷家函数获取静态单例对象，加static是因为静态方法无需对象即可调用
	static CentralCache* GetInStance()
	{
		return &_sInst;
	}
private:
	SpanList _spanLists[NFREE_LIST];//208个桶
private:
	static CentralCache _sInst;//静态单例对象(不要在.h文件定义，因为源文件包含了该头文件即可看到单例对象)
	//构造函数私有+禁用拷贝构造和赋值
	CentralCache(const CentralCache&) = delete;
	CentralCache& operator=(const CentralCache&) = delete;
};

CentralCache CentralCache::_sInst;//单例对象的定义（在源文件定义）

2.4 Use the slow start feedback adjustment algorithm to solve the memory page allocation problem

When a thread cache makes a memory request to the central cache, there are two issues that need attention:

1. How many pages of memory should the central cache give it each time? If you give too much, resources will become idle; if you give too little, the thread application frequency will increase, the probability of access conflicts will increase, and the efficiency will decrease.

2. The thread cache applies for memory objects of different sizes. What are the allocation rules of the central cache? For example, if the thread cache requires an 8byte memory block and the thread cache requires a 256KB memory block, the number of pages allocated by the central cache is definitely different.

The number of allocated memory objects is determined based on the size of the memory object that the thread cache needs to apply for. Small objects are allocated more points (control upper limit), and large objects are allocated fewer points.

// thread cache一次从central cache获取多少个对象
static size_t NumMoveSize(size_t size)//size:单个对象的大小
{
    assert(size > 0);
    // [2, 512]，一次批量移动多少个对象的(慢启动)上限值
    // 小对象一次批量上限低
    int num = MAX_BYTES / size;//256KB/size
    if (num < 2)//申请256KB的大对象，分配2个对象
        num = 2;
    if (num > 512)//小对象，num大于512，仅分配512个对象
        num = 512;
    return num;
}

If you apply for an 8-byte memory object, then according to the code logic above, the central cache will allocate 512 8-byte objects each time. If the thread cache only needs to use a few, there will be a lot of idle resources. Therefore, the following logic needs to be added to realize the slow growth allocation of memory objects of different sizes (new member variable size_t MaxSize=1 in the _freeList class):

void* ThreadCache::FetchFromCentralCache(size_t index, size_t size)
{
	//慢开始调节算法
	size_t batchNum = std::min(_freeList[index].MaxSize(),SizeClass::NumMoveSize(size));
	if (batchNum == _freeList[index].MaxSize())
	{
		_freeList[index].MaxSize() += 1;
	}
	return nullptr;
}

2.5How to retrieve memory objects from central cache?

Here are some basic linked list operations:

size_t CentralCache::FetchRangeObj(void*& start, void*& end, size_t batchNum, size_t size)
{
	size_t index = SizeClass::Index(size); //计算线程申请的内存对象向上对齐后位于哪一个桶
	_spanLists[index]._mtx.lock();//线程进来先上锁
	Span* span = CentralCache::GetOneSpan(_spanLists[index], size);//获取非空的页
	assert(span);
	assert(span->_freeList);
	end = start;
	for (size_t i = 0; i < batchNum - 1; ++i)
	{
		end = NextObj(end);
	}
	span->_freeList= NextObj(end);
	NextObj(end) = nullptr;
	_spanLists[index]._mtx.unlock();
	return batchNum;
}

However, when the number of objects that need to be retrieved is greater than the number of objects remaining in the current page, the code written above will cause an out-of-bounds/null pointer dereference crash. For example, in case 2, in this case, it is necessary to control the end so that it does not reach empty, and take away as many objects as there are left on the current page.

/**
  * @brief  从中央缓存获取一定数量的对象给thread cache
  * @param  start：申请对象的起始地址
  * @param  end：申请对象的结束地址
  * @param  batchNum：通过调节算法得出的中央缓存应该给线程缓存的对象个数
  * @param  size：线程缓存申请的单个对象大小
  * @retval 中央缓存实际给的对象个数，因为当前页的资源对象可能不足了。
  */
size_t CentralCache::FetchRangeObj(void*& start, void*& end, size_t batchNum, size_t size)
{
	size_t index = SizeClass::Index(size); //计算线程申请的内存对象向上对齐后位于哪一个桶
	_spanLists[index]._mtx.lock();//线程进来先上锁
	Span* span = CentralCache::GetOneSpan(_spanLists[index], size);//获取非空的页
	//从该非空页中拿走对象交给threan cache
	assert(span);
	assert(span->_freeList);
	end = start;
	size_t i = 0;
	while (i < batchNum - 1 && NextObj(end) != nullptr)
	{
		++i;
		end = NextObj(end);
	}
	span->_freeList= NextObj(end);
	NextObj(end) = nullptr;
	_spanLists[index]._mtx.unlock();
	return i + 1;
}

3. Implementation of page cache

3.1The overall framework of page cache

There are still some differences between page cache, thread cache and central cache. The hash bucket of the page cache is divided in units of pages. This is a mapping of the direct addressing method. How many pages of cache the central cache needs, go to the hash bucket of the corresponding page size in the page cache. The pages in the page cache hash bucket are connected using a headed two-way circular linked list and will not be cut into small objects.

Page cache details answer:

1. The logic of page cache

If the central cache needs two pages of objects, but the page cache has just used up the two pages of objects, then it will find a hash bucket with more than two pages of pages and split them, and deliver the two pages of memory to the central cache. The remaining memory will be mounted into the hash bucket mapped by the remaining memory. If you look up, all hash buckets are empty. At this time, the page cache will ask the heap area for a large block of 128 pages of memory to allocate.

2. Why should the page cache be designed in singleton mode?

The page cache and the thread cache are not like the thread cache, which uses TLS to allow each thread to occupy an exclusive thread cache. Instead, it has a unique global copy for each thread to lock and access, so it needs to be designed in a singleton mode. (Also using hungry man mode)

3. Why is the largest page set at 128 pages?

This is because the largest memory object in the central cache and thread cache is only 256KB. According to the upper-level code design, it is possible to apply for 2 such memory objects at a time, totaling 512KB. Then 128 pages calculated based on 4 bytes per page on a 32-bit platform is exactly 512 bytes, a perfect match. (Of course, the size of 128 pages in a 64-bit environment is 1M, which is more than enough)

4. Why can’t the page cache use bucket locks like the central cache? But it can only be locked as a whole?

You can use a barrel lock, but the efficiency will be reduced. As mentioned in the first point, when the thread finds that the object in the target bucket is gone, it will look for a larger memory object in the page cache. To put it bluntly, it traverses upward which bucket is not empty. If the bucket lock is used, a certain phenomenon will occur during the traversal. A thread frequently applies for and releases locks, which slows down the thread's traversal speed. Therefore, the page cache needs to be locked as a whole.

5. How to perform memory recycling?

If the useCount of the span in the central cache is equal to 0, it means that all the small memory blocks allocated to the thread cache have been returned, and the central cache returns the span to the page cache. The page cache uses the page number to check whether the adjacent pages are free. , if free, merge them to solve the memory fragmentation problem.

3.2How does central cache apply for memory from pagecache?

1. Process of thread applying for memory

As mentioned above, when a thread finds that the thread cache memory is insufficient, it will use the slow start feedback adjustment algorithm to apply for memory from the central cache. If the memory object in the hash bucket corresponding to the central cache is empty, memory will be applied to the page cache. The way the page cache allocates memory is related to the slow start feedback algorithm. For example, the thread cache applies for an 8-byte object, obtains the number of 2 objects that should be allocated through the slow start feedback adjustment algorithm, and then multiplies 8 bytes to get 16 words. section, and then shift it to the right by 13 bits (equivalent to dividing by 8KB). If the last page is less than one page, it will be counted as one page.

/**
  * @brief  central cache向page cache申请内存，page cache分配内存算法
  * @param  size：单个对象的大小
  * @retval 返回申请的页数
  */
static size_t NumMovePage(size_t size)
{
    size_t num = NumMoveSize(size);//返回线程缓存单次从中央缓存获取多少个对象
    size_t npage = num * size;
    npage >>= PAGE_SHIFT;//右移13位,相当于除等8KB
    if (npage == 0)
        npage = 1;
    return npage;
}

2. Split memory

1. Push back the starting address virtual address of the page based on the page number

2. Divide the obtained page into small objects, and insert the end of the object into the corresponding free linked list of the central cache.

When inserting, you need to ensure that the memory of each small object is still continuous after the insertion. This allows the thread cache to ensure memory continuity and improve memory usage when applying for memory usage. Insertion method: First take a piece as the head and insert the tail one at a time from left to right.

3. Get the span of k pages from the page cache

See NewSpan in that screenshot above? It is the method used to obtain K page span.

If you want to get the 3-page object from the page cache, you can directly go to 3page to get a piece, delete the header and take it away. If you want a 2page object, but there is no object in the 2page hash bucket in the picture, you need to go to 3page to get a piece of memory, of which 2page is given to the thread, and the remaining 1page is hung on the 1page hash bucket. If you keep searching and there is no memory in all hash buckets, then the page cache will apply for 128 pages of memory on the heap and call itself recursively to complete the segmentation and suspension logic of the 128-page object.

/**
  * @brief  从page cache获取k页的span
  * @param  k：页数
  * @retval 返回获取到的span的地址
  */
Span* PageCache::NewSpan(size_t k)
{
	assert(k > 0 && k < NPAGES);
	//先看一下映射桶有没有对象，有的话直接拿走
	if (!_spanLists[k].Empty())
	{
		return _spanLists->PopFront();
	}
	//检查一下后面的哈希桶里有没有span
	for (auto i = k+1; i < NPAGES; ++i)
	{
		if (!_spanLists[i].Empty())
		{
			//拿出来切分,切分成k页和i-k页，k页的span给central cache，i-k页的span挂到i-k的哈希桶
			Span* nSpan = _spanLists[i].PopFront();
			Span* kSpan = new Span;
			//对nSpan进行头切k页
			kSpan->_pageId = nSpan->_pageId;
			kSpan->_n = k;
			nSpan->_pageId += k;
			nSpan->_n -= k;
			//将剩余对象头插进哈希桶
			_spanLists[nSpan->_n].PushFront(nSpan);
			return kSpan;
		}
	}
	//找遍了还是没有，这时，page cache就要找堆要一块128页的span
	Span* bigSpan = new Span;
	void* ptr = SystemAlloc(NPAGES - 1);
	bigSpan->_pageId = (PAGE_ID)ptr >> PAGE_SHIFT;
	bigSpan->_n = NPAGES - 1;
	//将bigSpan插入到对应的桶中
	_spanLists[bigSpan->_n].PushFront(bigSpan);
	return PageCache::NewSpan(k);//递归调用自己，让下一个栈帧处理切分、挂起逻辑
}

3.3 The thread is unlocked when the central cache memory is insufficient and applies for memory from the page cache.

4. Memory recycling mechanism of three-layer cache

1. Recycling mechanism of thread cache

When the length of the linked list in the hash bucket in the thread cache is greater than the memory requested in one batch, recycling begins. (As mentioned before, the memory requested by the thread cache in a batch will be increased by the slow start feedback adjustment algorithm. As the number of times the bucket requests memory from the central cache increases, the length of the next recycling list will be longer.)

2. Central cache recycling mechanism

2.1 The central cache recycles the memory objects returned by the thread cache

The addresses of the memory objects recycled by the thread cache are not consecutive. Their addresses may be located in different spans of the same hash bucket in the central cache. So how to correctly mount each memory into the appropriate span? Create an unordered_map<PAGE_ID,Span*> so that the page ID where the memory object is located corresponds to the bucket pointer, so as to find the mounting location. See the example below:

At this time, some friends may ask, if the thread cache returns the object, just directly traverse the page ID to which the returned object belongs, and then traverse the _sapnLists of the hash bucket corresponding to the central cache to find the span corresponding to the memory object. . This idea is indeed simple. However, the time complexity is O( $n^2$ ), which is a bit touching, so a new unordered_map container is added to establish the mapping relationship between objects and span*. The search time complexity of unordered_map is O(1), and the time complexity of traversing memory objects is O(N), so the unordered_map container is used here to reduce the time complexity to O(N).

2.2 The mechanism by which the central cache returns memory pages to the page cache

When the _useCount of the central cache (the number of memory objects lent to the thread cache) is equal to 0, it means that all the memory objects lent to the thread cache in the current span have been returned. At this time, the ReleaseSpanToPageCache function of the page cache must be called to perform page relocation. Recycle. The code in the figure below implements the code described in 2.1 and 2.2 of this section:

3. Page cache recycling mechanism

3.1 The page cache attempts to merge the preceding and following pages to reduce external fragmentation.

The large block of memory originally allocated to the central cache by the page cache is divided into small blocks by the central cache and applied for by the thread cache. In this way, the starting address of the memory block has been messed up. When the page cache receives these smaller blocks of memory again, block, you need to try to merge the preceding and following pages to reduce the problem of external fragmentation.

Only memory objects located in the page cache can initiate merging. So how to distinguish that a span is currently located in the page cache? There is a _useCount variable in the Span class. Don't think that if this variable is 0, it means that the memory is located in the page cache, because the central cache may have just applied for a large block of memory from the page cache and has not yet allocated it. _useCount is equal to 0. At this time, if When the page cache initiates a reclamation, problems may arise. The correct solution is to add a bool _isUse=false; member variable in the Span class. If span is assigned to the central cache, this member variable will be modified to true. If this variable is false, it proves that it is in the page cache and can be merged.

For example, the page ID of span is 2000, and there are currently 6 consecutive pages of memory in span. Then you need to continue to search whether adjacent pages exist, as shown in the two examples in the figure:

When merging, the base span is taken as the center, and the ids of the surrounding spans are judged from both sides to see if they are in a recyclable state. This step is repeated to merge to both sides until the ids on both sides of the span are in an unmergeable state or the span reaches a maximum of 128 pages. Or the correspondence between the id and span in the hash map cannot be found (it means that the span id is not allocated to the page cache by the operating system and does not belong to the memory allocated by the operating system to the memory pool. Naturally, there is no corresponding relationship in the hash. ).

/**
  * @brief  页缓存回收中央缓存的span，并合并相邻的span
  * @param  span：span的地址
  * @retval 无返回值
  */
void PageCache::ReleaseSpanToPageCache(Span* span)
{
	//对span前后的页，尝试进行合并
	//向前合并
	while (1)
	{
		PAGE_ID prevId = span->_pageId - 1;//先判断一下前一个span是否可以被回收
		auto ret = _idSpanMap.find(prevId);
		if (ret == _idSpanMap.end())//说明ret位置的span不可被回收
		{
			break;
		}
		Span* prevSpan = ret->second;
		if (prevSpan->_isUse == true)//说明该span正在中央缓存，无法回收
		{
			break;
		}
		if (prevSpan->_n + span->_n > NPAGES-1)//超过108页
		{
			break;
		}
		//进行合并
		span->_pageId = prevSpan->_pageId;
		span->_n += prevSpan->_n;
		_spanLists[prevSpan->_n].Erase(prevSpan);
		delete prevSpan;
	}
	//向后合并
	while (1)
	{
		PAGE_ID nextId = span->_pageId +span->_n;//先判断一下前一个span是否可以被回收
		auto ret = _idSpanMap.find(nextId);
		if (ret == _idSpanMap.end())//说明ret位置的span不可被回收
		{
			break;
		}
		Span* nextSpan = ret->second;
		if (nextSpan->_isUse == true)//说明该span正在中央缓存，无法回收
		{
			break;
		}
		if (nextSpan->_n + span->_n > NPAGES - 1)//超过108页
		{
			break;
		}
		//进行合并
		span->_n += nextSpan->_n;
		_spanLists[nextSpan->_n].Erase(nextSpan);
		delete nextSpan;
	} 
	//走到这里，将span头插至对应页缓存中的哈希桶中
	_spanLists[span->_n].PushFront(span);//
	span->_isUse = false;//确保回收的大块内存可回收
	_idSpanMap[span->_pageId] = span;//建立映射
	_idSpanMap[span->_pageId+span->_n-1] = span;//建立映射
}

5. Cache application problem larger than 256KB

If a thread applies for a cache smaller than 256K at a time, it must follow the normal process, that is, request memory from the three-layer cache;

If a thread's single request for an object is greater than or equal to 32 pages and less than or equal to 128 pages, that is, when 256KB <= application size <= 1024KB, it can directly request memory from the page cache;

If a thread's single request for an object is greater than 128 pages, that is, greater than 1024KB. The thread will directly ask for memory from the heap area.

6. Use a fixed-length memory pool to separate the project from new/delete

pageCache.cpp uses new and delete in many places. This article has already written about the design of a fixed-length memory pool in the second section. In this project, new and delete are both based on span, which just fits the design of a fixed-length memory pool. At the same time, the second section of this article has proven that the speed of applying for space in the fixed-length memory pool is much faster than the speed of new, so it is very appropriate to use the interface of applying for and releasing memory in the fixed-length memory pool here.

In addition, STL's unordered_map is used. When establishing the mapping relationship, new is used internally. Later, the radix tree is used to break away from unordered_map.

7. Analysis of performance bottlenecks of high-concurrency memory pools

It can be seen that the memory pool at this time is not as fast as malloc and free in the library in applying for and releasing memory in a multi-threaded environment. Through the analysis of the performance profiler that comes with Visual Studio, it can be seen that most of the performance consumption comes from the locking, unlocking and searching of unordered_map<PAGE_ID, Span*> _idSpanMap.

Therefore, radix tree is used to replace unordered_map to optimize performance.

8. Use radix tree instead of unordered_map to improve memory pool performance

1. Two different radix trees

Use one-level and two-level radix trees to optimize performance:

The first level uses the direct addressing method, and establishes a one-to-one mapping relationship between all ids and span* in a 32-bit environment. The space bytes that need to be allocated $=2^(19)*4$ = 2M. But it is not suitable in 64-bit environment, because the space that needs to be opened in 64-bit environment $=2^(51)*8$ is 4TB.

The second level occupies exactly the same space as the first level radix tree, and is also not suitable for 64-bit machines. When an ID comes over, the lower 19 bits of this int type ID are valid IDs (the higher bits are all 0). The upper 5 bits of these 19 bits determine which bit of the first layer the ID will be in. Among these 19 bits The lower 14 bits of will determine which bit of the second layer it is located on. For each id, there is a unique mapping location. In the project, you can choose one of the first-level and second-level radix trees.

2. Why does the use of data structures such as radix trees do not require locking?

Let’s first talk about why unordered_map needs to be locked! The unordered_map of this project is used to establish the mapping relationship between PAGE_ID and Span*, and the hash is used to find and add the mapping relationship.

where lookup is used for:

1. When releasing the object, you need to find the mapping relationship. (ConcurrentFree)

2. The memory objects returned by the thread cache may not be in the same hash bucket of the central cache. Each returned memory object needs to call a search to confirm which hash bucket of the central cache it should be returned to. (ReleaseListToSpans)

The mapping relationship is added for: (the memory removed from the page cache will be mapped)

1. All objects applied by the lower-level cache to the page cache must establish mapping relationships one by one in unordered_map (NewSpan)

2. The page cache recycles the centrally cached span and merges adjacent spans to create a mapping (ReleaseSpanToPageCache)

Although the unordered_map search efficiency is O(1), in order to avoid the impact of insertion in a multi-threaded environment, locks must be locked during search. In order to ensure thread safety, the mapping relationship must not be added during search, because the hash expansion or the new linked list node in the event of a conflict will affect the accuracy of the search. The radix tree uses a one-to-one mapping relationship. The ID converted from 4G of memory will have an exclusive pit for storage. At the same time, only one thread can read or write to a certain location at the same time. In this way, the additional searches of all memory objects do not affect each other, and there is no need to lock again when searching.