Google Guava BloomFilter源码分析

简单代码实现

首先我们先来看一下guava如何使用的
添加依赖

  <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>27.0.1-jre</version>
 </dependency>

简单实现

package com.example.demo;

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.Funnels;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

import java.nio.charset.Charset;

@SpringBootApplication
public class DemoApplication {

    public static void main(String[] args) {
        BloomFilter<CharSequence> bloomFilter = BloomFilter.create(
                Funnels.stringFunnel(Charset.forName("utf-8")),
                10000,0.0001);

        for(int i =0;i<5000;i++){
            bloomFilter.put(""+i);
        }
        System.out.println("数据写入完毕");

        for(int i =0;i<10000;i++){
            if(bloomFilter.mightContain(""+i)){
                System.out.println(i+"存在");
            }else {
                System.out.println(i+"不存在");
            }
        }

        SpringApplication.run(DemoApplication.class, args);
    }

}

Guava底层使用的是long

源码分析

Guava的布隆过率涉及BloomFilter和BloomFilterStrategies两个类

废话不多说,直接上源码
BloomFilter有四个参数

  /** The bit set of the BloomFilter (not necessarily power of 2!) */
  private final LockFreeBitArray bits;

  /** Number of hashes per element */
  private final int numHashFunctions;

  /** The funnel to translate Ts to bytes */
  private final Funnel<? super T> funnel;

  /** The strategy we employ to map an element T to {@code numHashFunctions} bit indexes. */
  private final Strategy strategy;
  • Funnel这是Guava中定义的一个接口,它和PrimitiveSink配套使用,主要是把任意类型的数据转化成Java基本数据类型(primitive value,如char,byte,int……),默认用java.nio.ByteBuffer实现,最终均转化为byte数组;
  • strategy是定义在BloomFilter类内部的接口,有三个方法,put(元素),mightContain(判定元素是否存在)和ordinal方法。此接口由BloomFilterStragies实现,BloomFilterStragies是一个枚举类型。
  • numHashFunctions 哈希函数个数
  • LockFreeBitArray 封装这对bit数组的各种操作,如set某个位为1,计算位的大小。(在BloomFilterStrategies中)

create

static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
    checkNotNull(funnel);
    checkArgument(
        expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
    checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
    checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
    checkNotNull(strategy);

    if (expectedInsertions == 0) {
      expectedInsertions = 1;
    }
    /*
     * TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
     * is proportional to -log(p), but there is not much of a point after all, e.g.
     * optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
     */
    long numBits = optimalNumOfBits(expectedInsertions, fpp);
    int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
    try {
      return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
    } catch (IllegalArgumentException e) {
      throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
    }
  }

expectedInsertions预期元素个数
fpp允许的误差率

计算m的个数

一个数学运算

  static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }

hash函数个数

就是一个数学计算

static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }

置于公式如何得到的请看
https://editor.csdn.net/md/?articleId=105003199

存入元素(put)

Guava是通过BloomFilterStrategies来实现存放元素的。它是一个枚举类,有两个枚举值,分别对应32位hash映射和64位hash映射。

对于32位的映射是通过
long hash64=Hashing.murmur3_128().hashObject(object, funnel).asLong();
获得一个long型的hash值,然后高低位各取32位。来得到两个hash值。

      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

对于64位的映射,通过
byte[] bytes = Hashing.murmur3_128().hashObject(object,funnel).getBytesInternal();
获得一个长度为16的byte数组,即共128位,hash1为前8个,64位,hash2为后8个,64位

byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
    
private /* static */ long lowerEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}

private /* static */ long upperEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}

为什么通过高低位来获得两个hash值

因为可以一次运算就得到两个hash值的话,就没必要通过两次运算来得到两个hash,可以节省时间。

获得多个Hash值 ,并存入

32位是通过hash1+i*hash2来获得多个hash值,当获得的hash值为负数时,取绝对值

	for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        bitsChanged |= bits.set(combinedHash % bitSize);
      }

64位是通过hash1+=hash2来获得多个hash值

long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
        combinedHash += hash2;
      }

32位映射和64位映射获得其他的Hash值的方式是一样的,只是描述的不一样,微小的差别是32位的第一个hash函数是hash1+hash2,第二个是hash1+2hash2以此类推,64位的第一个是hash1,第二个是hash1+hash2,第三个是hash1+2hash2。

博主感觉像两个人写的,哈哈哈

模Long.MAX_VALUE是为了防止hash值越界后,得到负值,对Long.MAX_VALUE(0111…1)取&后,把负号去掉,保证index恒为正。

为什么通过相加方法获得多个hash值

相除不行:除到0怎么办呢
相减也不行:减到负数定位不到在哪个位。
相乘:容易越界,越界后取绝对值貌似没什么影响,但感觉可控性不强,而且位置跨度太大了。很容易越界,不断致行取绝对值的操作,浪费性能。

判断是否存在某个key(mightContain)

32位

@Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        if (!bits.get(combinedHash % bitSize)) {
          return false;
        }
      }
      return true;
    }

64位

@Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
          return false;
        }
        combinedHash += hash2;
      }
      return true;
    }

两个枚举实例的整体代码

MURMUR128_MITZ_32() {
    @Override
    public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      boolean bitsChanged = false;
      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        bitsChanged |= bits.set(combinedHash % bitSize);
      }
      return bitsChanged;
    }

    @Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        if (!bits.get(combinedHash % bitSize)) {
          return false;
        }
      }
      return true;
    }
  },
  /**
   * This strategy uses all 128 bits of {@link Hashing#murmur3_128} when hashing. It looks different
   * than the implementation in MURMUR128_MITZ_32 because we're avoiding the multiplication in the
   * loop and doing a (much simpler) += hash2. We're also changing the index to a positive number by
   * AND'ing with Long.MAX_VALUE instead of flipping the bits.
   */
  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      boolean bitsChanged = false;
      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
        combinedHash += hash2;
      }
      return bitsChanged;
    }

    @Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
          return false;
        }
        combinedHash += hash2;
      }
      return true;
    }

    private /* static */ long lowerEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
    }

    private /* static */ long upperEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
    }
  };

总结

  1. BloomFilter类的作用在于接收输入,利用公式完成对参数的估算,最后初始化Strategy接口的实例;
  2. BloomFilterStrategies是一个枚举类,具有两个实现了Strategy接口的成员,分别为MURMUR128_MITZ_32和MURMUR128_MITZ_64,另外封装了long型的数组作为布隆过滤器底层的bit数组,其中在get和set方法中完成核心的位运算。

HashCode

有IntHashCode,LongHashCode,BytesHashCode三个实现类,来实现hashCode.对应的hash值分别为int型(32位),Long型(64位),byte[]型

发布了12 篇原创文章 · 获赞 0 · 访问量 194

猜你喜欢

转载自blog.csdn.net/N_a_n/article/details/105009907