Principle and Implementation of C++ Bloom Filter

concept

Bloom Filter (Bloom Filter) was proposed by Bloom in 1970. It's actually a long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a set. Its advantage is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that there is a certain rate of misrecognition and difficulty in deletion.

For example, we want me to find out whether a certain string exists in 4 billion data. Our first thought is to compare them one by one by traversing, but the data is huge, and traversing consumes a lot of time and space resources. But we can solve this problem with bitmap + hash. Determines whether the string exists in constant time. Bitmap + hash, we call bloom filter. In fact, this does not confirm the existence of the string 100%, it only means that the string may exist . The time complexity is O(k), where k is the number of hash functions. Let's take a look at the principle of the Bloom filter.
If you don't know bitmap yet, you can take a look at this partial blog post C++ bitmap and bitmap implementation

principle

Pass the string through n hash functions to get n numbers, and then use n numbers to set the specified position in the bitmap to 1. Why can't I
insert image description here
100% confirm whether the string exists at the beginning? Bloom filters are composed of hashes and bitmaps, and using hashes will inevitably cause hash conflicts. At this time, the hash positions calculated by different strings may be the same. At this time, it is impossible to determine which string is located. It is 1, so it can only be said that the string may exist. If the position is 0, it means that the string must not exist.
insert image description here
Therefore, the hash function calculation of a string and the number of hash functions are very important. Here The number of hash functions is directly calculated by the formula calculated by the seniors: k = m / (n * ln2)(m=bit size required by the bitmap; n=number of elements; k=number of hash functions). The calculation method of the hash function is also summarized by the predecessors, which can reduce the hash conflict

accomplish

Member variables and constructors : Member variables are mainly a bitmap plus the number of bits. The constructor needs to initialize these two members, and the bit size required by the bitmap is m=k * n * ln2. Note that rounding is required.

Storing data : When storing data, we need to obtain n subscripts through the n hash function of the incoming string, but the value calculated by the hash function may exceed the number of bits in our bitmap. When it is necessary to reduce the number of bits on the modulus to [0, m). Prevent out-of-bounds access. The calculated hash value is brought into the bitmap, and the specified position of the hash value in the bitmap is set to 1. For the hash function here, we directly use the one used by the boss (the code at the end of the article)

Finding data : Finding data is similar, as long as the corresponding hash position is calculated, and then check whether the bitmap where the hash position is located is 1, as long as there is 1 bit that is 0, it means it does not exist, and returns false. Returns true if all exist

Delete data : The Bloom filter does not support the delete operation. If the hash position of a certain bitmap may conflict, use the above figure to explain it. If you want to delete the string str, it means to set the bitmap corresponding to the hash position of the string to 0. Then the string str2 at this time cannot be found, because the second hash position of the string str2 is set to 0, which means that the string str2 does not exist, which is ambiguous, so the Bloom filter It does not support the delete operation
insert image description here
, but if you want to provide delete technology, it is not impossible, but the price is relatively high. We can bind a counter to each bit. If there are multiple 1s in a certain bit, let the counter accumulate. Only when the number of deletions of the bit is equal to the counter, the bit will be set to 0. But this is not practical either. For example, we can increase the number of bits occupied by 1 number to 2, non-zero indicates the size of the counter, and 00 indicates that the data does not exist. But when there is a lot of data, the accumulation will eventually overflow. So in the end more bits are needed. But this increases the use of space, and the Bloom filter cannot be 100% sure whether the data exists, so the cost is greater than the benefit, so such operations are generally not provided

code :

struct HashFun1
{
    
    
	//将字符串的每个字符通过计算得到一个hash值
	size_t operator()(const string& str)
	{
    
    
		size_t hash = 0;
		for (const auto& ch : str)
		{
    
    
			hash = hash * 131 + ch;
		}
		return hash;
	}
};
struct HashFun2
{
    
    
	size_t operator()(const string& str)
	{
    
    
		size_t hash = 0;
		for (const auto& ch : str)
		{
    
    
			hash = hash * 65599 + ch;
		}
		return hash;
	}
};
struct HashFun3
{
    
    
	size_t operator()(const string& str)
	{
    
    
		size_t hash = 0;
		for (const auto& ch : str)
		{
    
    
			hash = hash * 1313131 + ch;
		}
		return hash;
	}
};
template<class T, class HashFun1, class HashFun2, class HashFun3>
class BloomFilter
{
    
    
public:
	BloomFilter(const size_t num)
		:_bit(5 * num)
		, _bitCount(5 * num)
	{
    
    }

	void set(const T& val)
	{
    
    
		HashFun1 h1;
		HashFun2 h2;
		HashFun3 h3;
		int idx1 = h1(val) % _bitCount;
		int idx2 = h2(val) % _bitCount;
		int idx3 = h3(val) % _bitCount;
		_bit.set(idx1);
		_bit.set(idx2);
		_bit.set(idx3);
	}

	bool find(const T& val)
	{
    
    
		HashFun1 h1;
		HashFun2 h2;
		HashFun3 h3;
		int idx1 = h1(val) % _bitCount;
		int idx2 = h2(val) % _bitCount;
		int idx3 = h3(val) % _bitCount;

		if (!_bit.find(idx1))
			return false;
		if (!_bit.find(idx2))
			return false;
		if (!_bit.find(idx3))
			return false;

		return true;//可能存在
	}
private:
	BitMap _bit;//位图(上一篇博文有实现代码)
	size_t _bitCount;
};

Guess you like

Origin blog.csdn.net/qq_44443986/article/details/117363623