Don't know Bloom filter? One article gives you a clear explanation!

The two scenarios of massive data processing and cache penetration let me know the Bloom filter. I consulted some materials to understand it, but many existing materials do not meet my needs, so I decided to summarize an article about Bloom. Filter articles. I hope that through this article, more people will understand the Bloom filter and will actually use it!

1. What is a Bloom filter?

布隆过滤器:它是由二进制向量(或者说位数组)和一系列随机映射函数(哈希函数)两部分组成的比较巧妙的概率性数据结构。
  • The feature is efficient insertion and query, but it is not easy to delete, and the results returned are probabilistic.
    At the same time, it can tell you that a certain data must not exist (it can be used in the cache to prevent cache penetration and determine whether the key exists in the database) and a certain data may exist!
    Compared with traditional list, map, set is more efficient and takes up less space.
  • Usage scenario: Bloom filter can use smaller memory to process larger data, save and judge whether the data exists (must not exist, may exist)
    Insert picture description here
    , each element in the bit array only occupies 1bit, and each element can only It is 1 or 0, so the bit array applying for 1 million elements only occupies 1000000Bit / 8 = 125000Byte = 125000/1024 kb ≈ 122kb. How is it extremely small? Let's analyze its operating principle.

2. Introduction of the principle of Bloom filter.

  • When an element is added to the Bloom filter, the following operations are performed:
  1. Use the Bloom filtering hash function to calculate the elements and get the hash value (there are several hash functions with several hash values),
  2. According to the obtained hash value, the value in the corresponding index of the bit array is set to 1.
  • When we need to judge whether a data exists in Bloom filter is:
  1. Perform the same hash function again on the judged data,
  2. After obtaining the hash value, determine whether all values ​​in the corresponding subscript in the bit array are 1, if the values ​​are 1, it means that the data is in the Bloom filter, if there is a value other than 1, it means that the data is not Bloom filter.

As a simple example:
Insert picture description here
As shown in the figure, when the bit array is initialized, all positions are 0, when the string is to be added to the Bloom filter, the string generates different hash values ​​through different hash functions, Then set the element of the bit array subscript corresponding to the hash value to 1.
If we need to determine whether a string is in the Bloom filter, we only need to perform the same hash calculation on the given string again, and after obtaining the value, determine whether each element in the bit array is 1 if the value Both are 1, it means that this value is in the Bloom filter, if there is a value other than 1, it means that the element is not in the Bloom filter.

Different strings may get the same hash value, and the positions of multiple hashes may be the same. In this case, you can appropriately increase the size of the bit array to adjust the hash function (number, algorithm)

In summary, we can conclude that the Bloom filter says that an element exists, and a small probability will misjudge. The Bloom filter says that an element is not present, then the element must be absent.

3. Bloom filter usage scenario.

  • Determine whether the given data exists:
  1. Determine whether a number is in a number set that contains a large number (more than 500 million).
  2. Prevent cache penetration (the data of the request key bypasses the cache and acts directly on the database to determine whether the key exists in the database)
  3. Email spam filtering, blacklist function.
  • Deduplication:
  1. For example, when crawling a given URL, the URLs that have been crawled are deduplicated.

4. Manually implement Bloom filter through Java programming.

Below we manually implement the Bloom filter:
steps:

  1. An appropriately sized bit array holds the data
  2. Several different hash functions.
  3. Implementation of adding data to the bit array (Bloom filter)
  4. Implementation of judging whether there is a bit array of element data (Bloom filter)
import java.util.BitSet;
public class MyBloomFilter {
    /**
     * 位数组的大小
     */
    private static final int DEFAULT_SIZE = 2 << 24;
    /**
     * 通过这个数组可以创建 6 个不同的哈希函数
     */
    private static final int[] SEEDS = new int[]{3, 13, 46, 71, 91, 134};
    /**
     * 位数组。数组中的元素只能是 0 或者 1
     */
    private BitSet bits = new BitSet(DEFAULT_SIZE);
    /**
     * 存放包含 hash 函数的类的数组
     */
    private SimpleHash[] func = new SimpleHash[SEEDS.length];
    /**
     * 初始化多个包含 hash 函数的类的数组,每个类中的 hash 函数都不一样
     */
    public MyBloomFilter() {
        // 初始化多个不同的 Hash 函数
        for (int i = 0; i < SEEDS.length; i++) {
            func[i] = new SimpleHash(DEFAULT_SIZE, SEEDS[i]);
        }
    }
    /**
     * 添加元素到位数组
     */
    public void add(Object value) {
        for (SimpleHash f : func) {
            bits.set(f.hash(value), true);
        }
    }
    /**
     * 判断指定元素是否存在于位数组
     */
    public boolean contains(Object value) {
        boolean ret = true;
        for (SimpleHash f : func) {
            ret = ret && bits.get(f.hash(value));
        }
        return ret;
    }
    /**
     * 静态内部类。用于 hash 操作!
     */
    public static class SimpleHash {
        private int cap;
        private int seed;
        public SimpleHash(int cap, int seed) {
            this.cap = cap;
            this.seed = seed;
        }
        /**
         * 计算 hash 值
         */
        public int hash(Object value) {
            int h;
            return (value == null) ? 0 : Math.abs(seed * (cap - 1) & ((h = value.hashCode()) ^ (h >>> 16)));
        }

    }
}

test:

	    String value1 = "https://javaguide.cn/";
        String value2 = "https://github.com/Snailclimb";
        MyBloomFilter filter = new MyBloomFilter();
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));
        filter.add(value1);
        filter.add(value2);
        System.out.println(filter.contains(value1));
        System.out.println(filter.contains(value2));

Output:

false
false
true
true

5. Use the Bloom filter that comes with Google's open source Guava.

In actual projects, do not implement the Bloom filter yourself. The implementation of the Bloom filter in Guava is considered authoritative. We use it.

First: reflect dependencies in the project

 <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>28.0-jre</version>
</dependency>

The actual use is as follows:
we have created a Bloom filter that can store up to 1500 integers, and the probability that we can tolerate false positives is 0.01% (0.01)

 // 创建布隆过滤器对象
        BloomFilter<Integer> filter = BloomFilter.create(
                Funnels.integerFunnel(),
                1500,
                0.01);
        // 判断指定元素是否存在
        System.out.println(filter.mightContain(1));
        System.out.println(filter.mightContain(2));
        // 将元素添加进布隆过滤器
        filter.put(1);
        filter.put(2);
        System.out.println(filter.mightContain(1));
        System.out.println(filter.mightContain(2));

In our example, when the mightContain () method returns true, we can be 99% sure that the element is in the filter, and when the filter returns false, we can be 100% sure that the element does not exist in the filter.

The implementation of the Bloom filter provided by Guava is still very good (you can look at its source code implementation for more details), but it has a major flaw that it can only be used on a single machine (in addition, capacity expansion is not easy), Nowadays, the Internet is generally distributed. In order to solve this problem, we need to use the Bloom filter in Redis.

6. Bloom filter in Redis

  • Introduction: After
    Redis v4.0, there is a Module (module / plug-in) function, Redis Modules allows Redis to use external modules to extend its functions. Bloom filter is the Module. For details, you can check Redis official introduction to Redis Modules: https://redis.io/modules.

In addition, the official website recommends a RedisBloom as a module of Redis Bloom filter, address: https://github.com/RedisBloom/RedisBloom . Others include:

redis-lua-scaling-bloom-filter (lua script implementation): https://github.com/erikdubbelboer/redis-lua-scaling-bloom-filter
pyreBloom (fast Redis bloom filter in Python): https: / /github.com/seomoz/pyreBloom
...
RedisBloom provides multilingual client support, including: Python, Java, JavaScript and PHP.

  • Installation using Docker
    If we need to experience the Bloom filter in Redis is very simple, just use Docker! We directly searched docker redis bloomfilter on Google and found the answer we wanted after excluding the first search result of the advertisement (this is a way I usually solve the problem, share it), the specific address: https: // hub.docker.com/r/redislabs/rebloom / (the introduction is very detailed).

The specific operations are as follows:

➜  ~ docker run -p 6379:6379 --name redis-redisbloom redislabs/rebloom:latest
➜  ~ docker exec -it redis-redisbloom bash
root@21396d02c252:/data# redis-cli
127.0.0.1:6379> 
  • List of common commands
    Note: key: name of Bloom filter, item: added element.

    • BF.ADD: Add the element to the Bloom filter. If the filter does not already exist, create the filter. Format: BF.ADD {key} {item}.

    • BF.MADD: Add one or more elements to the "Bloom filter" and create a filter that does not yet exist. The operation mode of this command is the same as BF.ADD, except that it allows multiple inputs and returns multiple values. Format: BF.MADD {key} {item} [item…].

    • ** BF.EXISTS **: Determine whether the element exists in the Bloom filter. Format: BF.EXISTS {key} {item}.

    • BF.MEXISTS: Determine whether one or more elements exist in the Bloom filter format: BF.MEXISTS {key} {item} [item…].

    • In addition, the BF.RESERVE command needs to be introduced separately:
      The format of this command is as follows:
      BF.RESERVE {key} {error_rate} {capacity} [EXPANSION expansion].
      The following briefly introduces the specific meaning of each parameter:

      1. key: the name of the Bloom filter
      2. error_rate: The expected probability of false positives. This should be a decimal value between 0 and 1. For example, for an expected false alarm rate of 0.1% (1 in 1000), error_rate should be set to 0.001. The closer this number is to zero, the greater the memory consumption of each item and the higher the CPU usage of each operation.
      3. capacity: the capacity of the filter. When the actual number of stored elements exceeds this value, performance will begin to decline. The actual degradation will depend on the extent to which the limit is exceeded. As the number of filter elements increases exponentially, performance will decrease linearly.
        Optional parameters:
      4. expansion: If a new sub-filter is created, its size will be the current filter size multiplied by expansion. The default expansion value is 2. This means that each subsequent sub-filter will be twice the previous sub-filter.

6.4 Actual use

127.0.0.1:6379> BF.ADD myFilter java
(integer) 1
127.0.0.1:6379> BF.ADD myFilter javaguide
(integer) 1
127.0.0.1:6379> BF.EXISTS myFilter java
(integer) 1
127.0.0.1:6379> BF.EXISTS myFilter javaguide
(integer) 1
127.0.0.1:6379> BF.EXISTS myFilter github
(integer) 0

Reference: JavaGuide has not changed much
More: Deng Xin

Published 34 original articles · Likes0 · Visits 1089

Guess you like

Origin blog.csdn.net/qq_42634696/article/details/105245665