[C++] STL - use a hash table to encapsulate unordered_map and unordered_set

Use a hash table (bucket) to encapsulate unordered_map and unordered_set

insert image description here

1. Hash table source code

According to the previous study of unordered_map and unordered_set, we know that the underlying layer is implemented with the help of a hash table. Next, we will use the open hash hash table implemented in the previous blog to simulate the implementation of unordered_map and unordered_set. The source of the hash table Code link:

Hash/Hash/HashBucket.h wei/cplusplus - Code Cloud - Open Source China (gitee.com)

Next, we modify the hash table so that it can encapsulate unordered_map and unordered_set well.


2. Control of hash function template parameters

We all know that unordered_map is a KV model, and unordered_set is a K model, and the previously implemented hash table (bucket) is a KV model of pair key-value pairs. Obviously, the K model of unordered_set is not applicable, so to implement a generic, this We need to control the template parameters of the hash table.

Change the template parameters of the hash node:

  • In order to distinguish it from the template parameter of the original hash table, here the second template parameter of the hash table is set to T. If the type of the subsequent input node is a K model, then T is a K model. If it is a pair key-value pair KV model, then KV model

image-20230414200938920

template<class T>
struct HashNode
{
     
     
	T _data;
	HashNode<T>* _next;
	//构造函数
	HashNode(const T& data)
		:_data(data)
		, _next(nullptr)
	{
     
     }
};

Change the template parameters for the hash table:

  • Here we set the second template parameter as T, which is convenient for identifying the data type passed in later
//unordered_map -> HashTable<K, pair<K, V>> _ht;
//unordered_set -> HashTable<K, K> _ht;
template<class K, class T, class Hash = HashFunc<K>>

Parameter control of unordered_set:

  • If the upper layer uses the unordered_set container, then the parameters of the hash table correspond to K, K type.
template<class K>
class unordered_set
{
     
     
private:
	HashBucket_realize::HashBucket<K, K, Hash, SetKeyOfT> _hb;
};

Parameter control of unordered_map:

  • If the upper layer uses the unordered_map container, then the parameters of the hash table correspond to K, pair<K, V> type.
template<class K, class V>
class unordered_map
{
     
     
private:
	HashBucket_realize::HashBucket<K, pair<const K, V>, Hash, MapKeyOfT> _hb;
};

3. Construct a functor for the upper container to facilitate subsequent mapping

In the process of hash mapping, we need to obtain the key value of the element, and then calculate the address of the mapping through the corresponding hash function. In the previous step, in order to adapt to unordered_set and unordered_map, we have rectified the template parameters. Now the hash node The stored data type is T. T may be a key-value or a key-value pair. The underlying hash does not know the type of the incoming data, so we need to set a layer of functors on the upper container to tell underlying hash.

Functor for unordered_set:

  • For K-type containers such as unordered_set, we can directly return the key.
template<class K>
class unordered_set
{
     
     
	//仿函数
	struct SetKeyOfT
	{
     
     
		const K& operator()(const K& key)
		{
     
     
			return key;
		}
	};
private:
	HashBucket_realize::HashBucket<K, K, Hash, SetKeyOfT> _hb;
};

Notice:

Although the T that the unordered_set container passes into the hash table is the key value, the underlying hash table does not know the type of the upper container. It is also necessary to provide a functor to the underlying hash table.

Functor for unordered_map:

  • The data type of unordered_map is a pair key-value pair, we only need to take out the first data key in the key-value pair and return it.
template<class K, class V>
class unordered_map
{
     
     
	//仿函数
	struct MapKeyOfT
	{
     
     
		const K& operator()(const pair<K, V>& kv)
		{
     
     
			return kv.first;
		}
	};
private:
	HashBucket_realize::HashBucket<K, pair<const K, V>, Hash, MapKeyOfT> _hb;
};

Notice:

Now because the data type we store in the hash node is T, this T may be a key value, or it may be a key-value pair. For the underlying hash table, it does not know the data type in the hash node. What type of data is stored, so the upper container needs to provide a functor to obtain the key value in the T type data.

Change the template parameter of the underlying hash:

  • The functor of the upper-level container has been set, and the template parameters of the lower-level hash need to be rectified to receive the data type of the upper-level container.
template<class K, class T, class Hash, class KeyOfT>
class HashBucket
  • The first parameter K: the type of key is K. The search function is searched according to the key, so K is required.

  • The second parameter T: the data type stored by the hash table node. Such as int, double, pair, string, etc.

  • The third parameter KeyOfT: get the key of type T (node ​​data type).

  • The fourth parameter Hash: indicates the hash function used


4. The string type cannot be modulo

  • Strings cannot be modulo, which is the most common problem in hashing problems.

After the above analysis, we add a template parameter to the hash table. At this time, no matter whether the upper container is unordered_set or unordered_map, we can obtain the key value of the element through the functor provided by the upper container.

But in the code we write every day, it is very common to use strings as key-value keys. For example, when we use the unordered_map container to count the number of times fruits appear, we need to use the names of each fruit as the key value.

The string is not an integer, which means that the string cannot be directly used to calculate the hash address. We need to convert the string into an integer by some method before it can be substituted into the hash function to calculate the hash address.

But unfortunately, we can't find a way to achieve one-to-one conversion between strings and integers, because in computers, the size of integers is limited, such as the largest number that can be stored with unsigned integers It is 4294967295, but the types of character strings that can be formed by many characters are infinite.

In view of this, no matter what method we use to convert a string into an integer, there will be a hash collision, but the probability of collision is different.

After the experiments of the predecessors, it was found that the effect of the BKDRHash algorithm is the most prominent both in the actual effect and in the coding implementation. The algorithm is named after it was shown in the book "The C Programming Language" by Brian Kernighan and Dennis Ritchie. It is a simple and fast hash algorithm, and it is also the hash algorithm of strings currently used by Java.

Therefore, now we need to add another functor to the template parameter of the hash table to convert the key value key into the corresponding integer.

template<class K, class T, class KeyOfT, class Hash = HashFunc<K>>
class HashBucket

If the upper layer does not pass in the functor, we use the default functor, which can directly return the key value key, but it is more common to use a string as the key value key, so we can write a string type A specialization of the class template. At this time, when the key value is a string type, the functor will return a corresponding integer according to the BKDRHash algorithm.

template<class K>
struct Hash
{
     
     
	size_t operator()(const K& key) //返回键值key
	{
     
     
		return key;
	}
};
//string类型的特化
template<>
struct Hash<string>
{
     
     
	size_t operator()(const string& s) //BKDRHash算法
	{
     
     
		size_t value = 0;
		for (auto ch : s)
		{
     
     
			value = value * 131 + ch;
		}
		return value;
	}
};

Five, hash table default member function implementation

1. Constructor

There are two member variables in the hash table, when we instantiate an object:

  • _table will automatically call the default constructor of vector to initialize.
  • _n will be set to 0 according to the default value we gave.
vector<Node*> _table; //哈希表
size_t _n = 0; //哈希表中的有效元素个数

We write a constructor and initialize the corresponding space with the first data of the prime number table.

//构造函数
//HashBucket() = default; //显示指定生成默认构造函数
HashBucket()
	:_n(0)
{
     
     
	//_tables.resize(10);
	_tables.resize(__stl_next_prime(0));
}

Notice:

If we don't initialize the space, we don't need to write a constructor, it is enough to use the default generated constructor, but because we need to write a copy constructor later, after writing the copy constructor, the default constructor will not be generated , at this point we need to use the default keyword to display and specify the generated default constructor.


2. Copy constructor

The hash table needs to be deeply copied when copying, otherwise the copied hash table and the original hash table store the same batch of nodes.

The implementation logic of the copy constructor of the hash table is as follows:

  1. Resize the hash table to the size of ht._table.
  2. Copy the nodes in each bucket of ht._table to its own hash table one by one.
  3. Change the number of valid data in the hash table.
//拷贝构造函数
HashBucket(const HashBucket& hb)
{
     
     
	//1、将哈希表的大小调整为hb._tables的大小
	_tables.resize(hb._tables.size());
	//2、将hb._tables每个桶当中的结点一个个拷贝到自己的哈希表中(深拷贝)
	for (size_t i = 0; i < hb._tables.size(); i++)
	{
     
     
		if (ht._tables[i]) //桶不为空
		{
     
     
			Node* cur = hb._tables[i];
			while (cur) //将该桶的结点取完为止
			{
     
     
				Node* copy = new Node(cur->_data); //创建拷贝结点
				//将拷贝结点头插到当前桶
				copy->_next = _tables[i];
				_tables[i] = copy;
				cur = cur->_next; //取下一个待拷贝结点
			}
		}
	}
	//3、更改哈希表当中的有效数据个数
	_n = hb._n;
}

3. Assignment operator overloading function

When implementing the assignment operator overloaded function, you can indirectly call the copy constructor through the parameters, and then exchange the two member variables of the hash table constructed by copying and the current hash table respectively. When the assignment operator overloaded function is called, The hash table constructed by copying will be automatically destructed because it is out of scope, and the data before the original hash table will be released along the way.

//赋值运算符重载函数
HashBucket& operator=(HashBucket hb)
{
     
     
	//交换哈希表中两个成员变量的数据
	_table.swap(hb._table);
	swap(_n, hb._n);

	return *this; //支持连续赋值
}

4. Destructor

Because the nodes stored in the hash table are all new, the nodes must be released when the hash table is destroyed. When destructing the hash table, we only need to take out the non-empty hash buckets one by one, traverse the nodes in the hash buckets and release them.

//析构函数
~HashBucket()
{
     
     
	//将哈希表当中的结点一个个释放
	for (size_t i = 0; i < _tables.size(); i++)
	{
     
     
		Node* cur = _tables[i];
		while (cur) //将该桶的结点取完为止
		{
     
     
			Node* next = cur->_next; //记录下一个结点
			delete cur; //释放结点
			cur = next;
		}
		_tables[i] = nullptr; //将该哈希桶置空
	}
}

6. Implementation of the underlying iterator of the hash table

1. The basic framework of the iterator

The forward iterator of the hash table actually encapsulates the hash node pointer, but since it is necessary to find the next non-empty hash bucket in the hash table when implementing the ++ operator overload, each In addition to storing node pointers in the forward iterator, an address of a hash table should also be stored. Finally, write a constructor to initialize _node and _hb.

You will find such a phenomenon that a hash table is used in the iterator, and an iterator is used in the hash table, that is, the two classes refer to each other

  • If the iterator is written in front of the hash table, then the compiler will find that the hash table is undefined at compile time (the compiler will only look for identifiers forward/up).

  • If the hash table is written in front of the iterator, the compiler will find that the iterator is undefined at compile time.

Here the position of our iterator is placed above the hash table (HashBucket), and I use HashBucket inside the iterator, because the compiler is looking upwards, according to this position, it is not found in the class of the iterator to HashBucket, so we need to declare the HashBucket class in front of the iterator .

// 哈希表前置声明
template<class K, class T, class Hash, class KeyOfT>
class HashBucket;
//正向迭代器
template<class K, class T, class Hash, class KeyOfT>
struct __HTIterator
{
     
     
	typedef HashNode<T> Node; //哈希节点的类型
	typedef __HTIterator<K, T, Hash, KeyOfT> Self; //正向迭代器的类型
	typedef HashBucket<K, T, Hash, KeyOfT> HB; // 哈希表

	Node* _node; // 节点指针
	HB* _hb; // 哈希表地址
	// 构造函数
	__HTIterator(Node* node, HB* hb);
	// *运算符重载
	T& operator*();
	// ->运算符重载
	T* operator->();
	//==运算符重载
	bool operator==(const Self& s) const;
	//!=运算符重载
	bool operator!=(const Self& s) const;
	//++运算符重载
	Self& operator++();
    
    Self& operator++(int);
};

2.++ operator overloading

Suppose the hash table structure at this time is as shown in the figure:

image-20230412001629135

Note that this is a hash bucket structure, and each bucket is a string of single-linked lists. In a single bucket, we get the head node pointer, and we can traverse the linked list one by one with ++it, but this is a multi-single-linked list, we It is necessary to ensure that after one bucket is finished, it will go to the next bucket, so the following rules must be followed in setting the overload of the ++ operator:

  • If the current node is not the last node in the current hash bucket, go to the next node of the current hash bucket after ++.
  • If the current node is the last node of the current hash bucket, then ++ goes to the first node of the next non-empty hash bucket.
//++运算符重载
Self& operator++()
{
     
     
	if (_node->_next)
	{
     
     
		_node = _node->_next;
	}
	else//当前桶已经走完,需要到下一个不为空的桶
	{
     
     
		KeyOfT kot;//取出key数据
		Hash hash;//转换成整型
		size_t hashi = hash(kot(_node->_data)) % _hb->_tables.size();
        ++hashi;
		while(hashi < _hb->_tables.size())
        {
     
     
			if (_hb->_tables[hashi])//更新节点指针到非空的桶
			{
     
     
				_node = _hb->_tables[hashi];
				break;
			}
            else
            {
     
     
                hashi++;
            }
		}
		//没有找到不为空的桶,用nullptr去做end标识
		if (hashi == _hb->_tables.size())
		{
     
     
			_node = nullptr;
		}
	}
	return *this;
}

// 后置++
Self& operator++(int) 
{
     
     
	Self tmp = *this;
	operator++();
	return tmp;
	}

Since the ++ operator overload function in the forward iterator will access the member variable _tables in the hash table when looking for the next node, and the _tables member variable is a private member of the hash table, so we need to forward The iterator class is declared as a friend of the hashtable class.

template<class K, class T, class KeyOfT, class HashFunc>
class HashTable
{
     
     
	//把迭代器设为HashTable的友元
	template<class K, class T, class KeyOfT, class HashFunc>
	friend class __HTIterator;

	typedef HashNode<T> Node;//哈希结点类型
public:
	//……
}

Note: The iterator of the hash table is a one-way iterator, there is no – operator overload.


3. == and != operator overloading

To compare whether the two iterators are equal, you only need to judge whether the nodes encapsulated by the two iterators are the same.

//!=运算符重载
bool operator!=(const Self& s) const
{
     
     
	return _node != s._node;
}
//==运算符重载
bool operator==(const Self& s) const
{
     
     
	return _node == s._node;
}

4. * and -> operator overloading

  • The * operator returns a reference to the hash node data
  • The -> operator returns the address of the hash node data
//*运算符重载
T& operator*()
{
     
     
	return _node->_data;//返回哈希节点中数据的引用
}
//->运算符重载
T* operator->()
{
     
     
	return &(_node->_data);//返回哈希节点中数据的地址
}

Seven, the begin and end of the hash table

Here we need to typedef the forward iterator type. It should be noted that in order to allow the outside to use the forward iterator type iterator after typedef, we need to typedef in the public area. After typedef, the functions of begin() and end() can be realized.

template<class K, class T, class KeyOfT, class Hash>
class HashBucket
{
     
     
 //把迭代器设为HashTable的友元
	template<class K, class T, class KeyOfT, class Hash>
	friend class __HTIterator;

	typedef HashNode<T> Node;//哈希结点类型
public:
	typedef __HTIterator<K, T, KeyOfT, Hash> iterator;//正向迭代器的类型
}

begin():

  1. Traverse the hash table, return the iterator position of the first node of the first non-empty bucket, complete with the constructor of the forward iterator, and pass the this pointer (the pointer of the hash table)
  2. If it is not found at the end of the traversal, it means that none of the hash tables are empty, and the iterator position at the end of end() is returned directly.
//begin
iterator begin()
{
     
     
	for (size_t i = 0; i < _tables.size(); i++)
	{
     
     
		Node* cur = _tables[i];
		//找到第一个不为空的桶的节点位置
		if (cur)
		{
     
     
			return iterator(cur, this);
		}
	}
	return end();
}

end():

  • The end() of the hash table can directly return the constructor of the iterator (the node pointer is empty, and the hash table pointer is this).
//end()
iterator end()
{
     
     
	return iterator(nullptr, this);
}

8. Optimization of hash table (prime number table)

When removing the remainder method, it is best to modulo a prime number, so that hash conflicts will not be so prone to occur after the modulus is completed, so we can write a prime number table to solve it.

inline unsigned long __stl_next_prime(unsigned long n)
{
     
     
    //素数序列
	static const int __stl_num_primes = 28;
	static const unsigned long __stl_prime_list[__stl_num_primes] =
	{
     
     
		53ul, 97ul, 193ul, 389ul, 769ul,
		1543ul, 3079ul, 6151ul, 12289ul, 24593ul,
		49157ul, 98317ul, 196613ul, 393241ul, 786433ul,
		1572869ul, 3145739ul, 6291469ul, 12582917ul, 25165843ul,
		50331653ul, 100663319ul, 201326611ul, 402653189ul, 805306457ul,
		1610612741ul, 3221225473ul, 4294967291ul
	};
    
	// 获取比prime大那一个素数
	for (int i = 0; i < __stl_num_primes; ++i)
	{
     
     
		if (__stl_prime_list[i] > n)
		{
     
     
			return __stl_prime_list[i];
		}
	}

	return __stl_prime_list[__stl_num_primes - 1];
}

Nine, insert operation and [] operator overloading

  • The data type of unordered_map is a KV model, which inserts a pair key-value pair, which is different from unordered_set, and the implementation method is also different.

image-20230412135747114

The unordered_set inserts the key value, and we need to modify the return value of their insert here:

image-20230412141428667

image-20230412141713301

Because unordered_set does not have [] operator overload, so do not have to provide this function, only in unordered_map to provide this function.

  1. First call the insert function to insert the key-value pair and return the iterator ret
  2. Call the element value value through the returned iterator ret

Note : The first parameter of the key-value pair is the key value passed in by the user, and the second parameter is the default constructor of the second template parameter declared by the user .

//[]运算符重载
V& operator[](const K& key)
{
     
     
	pair<iterator, bool> ret = insert(make_pair(key, V()));
	return ret.first->second;
}

Next, we need to modify the return value of the Insert of the hash table to match the pair data type of unordered_map. There are two changes, as follows:

image-20230412142818310


10. Hash table (modified version) source code link

The modified hash table source code link:
HashBucket.h wei/cplusplus - Code Cloud - Open Source China (gitee.com)


Eleven, unordered_set, unordered_map simulation implementation code

1. The code of unordered_set

#pragma once
#include "HashBucket.h"

namespace unordered_set_realize
{
     
     
	template<class K, class Hash = HashFunc<K>>
	class unordered_set
	{
     
     
		struct SetKeyOfT
		{
     
     
			const K& operator()(const K& key)
			{
     
     
				return key;
			}
		};

	public:
		typedef typename HashBucket_realize::HashBucket<K, K, Hash, SetKeyOfT>::iterator iterator;

		iterator begin()
		{
     
     
			return _hb.begin();
		}

		iterator end()
		{
     
     
			return _hb.end();
		}

		pair<iterator, bool> insert(const K& key)
		{
     
     
			return _hb.Insert(key);
		}

	private:
		HashBucket_realize::HashBucket<K, K, Hash, SetKeyOfT> _hb;
	};

	void test_unordered_set()
	{
     
     
		unordered_set<int> us;
		us.insert(13);
		us.insert(3);
		us.insert(23);
		us.insert(5);
		us.insert(5);
		us.insert(6);
		us.insert(15);
		us.insert(223342);
		us.insert(22);

		unordered_set<int>::iterator it = us.begin();
		while (it != us.end())
		{
     
     
			cout << *it << " ";
			++it;
		}
		cout << endl;

		for (auto e : us)
		{
     
     
			cout << e << " ";
		}
		cout << endl;
	}
}

2. The code of unordered_map

#pragma once
#include "HashBucket.h"

namespace unordered_map_realize
{
     
     
	template<class K, class V, class Hash = HashFunc<K>>
	class unordered_map
	{
     
     
		struct MapKeyOfT
		{
     
     
			const K& operator()(const pair<const K, V>& kv)
			{
     
     
				return kv.first;
			}
		};

	public:
		typedef typename HashBucket_realize::HashBucket< K, pair<const K, V>, Hash, MapKeyOfT>::iterator iterator;
	
		iterator begin()
		{
     
     
			return _hb.begin();
		}

		iterator end()
		{
     
     
			return _hb.end();
		}

		pair<iterator, bool> insert(const pair<K, V>& data)
		{
     
     
			return _hb.Insert(data);
		}

		V& operator[](const K& key)
		{
     
     
			pair<iterator, bool> ret = _hb.Insert(make_pair(key, V()));
			return ret.first->second;
		}

	private:
		HashBucket_realize::HashBucket<K, pair<const K, V>, Hash, MapKeyOfT> _hb;
	};

	void test_unordered_map()
	{
     
     
		string arr[] = {
     
      "苹果", "西瓜", "香蕉", "草莓", "苹果", "西瓜",
			"苹果", "苹果", "西瓜", "苹果", "香蕉", "苹果", "香蕉" };

		unordered_map<string, int> countMap;
		for (auto& e : arr)
		{
     
     
			countMap[e]++;
		}

		for (const auto& kv : countMap)
		{
     
     
			cout << kv.first << ":" << kv.second << endl;
		}
	}
}

Reference blog:

1. Detailed explanation of STL (thirteen) - use a hash table to encapsulate unordered_map and unordered_set_2021 dragon at the same time

Guess you like

Origin blog.csdn.net/m0_64224788/article/details/130186863