Handwriting Bloom filter preventing penetration of cache

What is the Bloom filter?

Bloom filter (Bloom Filter) in 1970 proposed by Bloom. It is actually a long series of random binary vector and mapping functions. Bloom filter can be used to retrieve if an element in a set. The advantage is space efficiency and query time than the general algorithm is much better, disadvantage is that there is a certain error recognition rate and remove difficulties.

To determine if an element exists, the previous idea is to walk through the collection, the elements compared in turn. Linked list data structures, arrays, and all these lines, but a different structure different performance data is determined, the time complexity, different spatial complexity.
These data structures must be stored raw data, with the increase of elements required storage space is growing, retrieval speed is getting slower and slower.

Thought Bloom filter are: creating a bit array of a long (Bit Array), by default all bits are bit 0, when there is an additive element, the element group index is obtained by a different hash algorithms, bit array corresponding to an index. It is determined whether the element is present, only needs to determine whether the bit corresponding to the bit to 1.

Bloom filter advantages:

  • No storage elements, small footprint
  • High performance

Bloom filter disadvantages:

  • Does not support removing
  • There is a certain degree of fault tolerance

When the bit array is large enough, the tolerance for small, a Bloom filter can effectively filter out most of the data.

What is Cache penetrate?

When using Redis as a cache relational database, the general logic is: first check Redis cache query when the cache does not exist, go to the database query.

Use cache can reduce the pressure on the database, but there is "Cache penetration" problem.

Cache penetration refers to a certain query data does not exist, because the cache is needed from the database query is not hit, can not find the data cache is not written, it will lead to a time that does not exist in the data request should go to the database query, and then putting pressure on the database.

For a " data certainly does not exist " is not going to access the database, the database to avoid unnecessary stress.

The reason may arise

  • Business Code own problems
  • Web Crawler
  • attack on purpose

Bloom filter can help us to filter out " the data certainly does not exist " for these requests, we can filter out directly, to avoid access to the database.

solution

  • Empty the cache objects (simple)
  • Bloom filter

Bloom filter implementation

Based BitMaps Redis provided here to achieve, BitMaps can direct bit binary data bit operation, exactly in line with our needs.

Redi Bloom Filters

@Component
public class RedisBloomFilter {
	//预计数据总量
	private long size = 1000000;
	//容错率
	private double fpp = 0.01;
	//二进制向量大小
	private long numBits;
	//哈希算法数量
	private int numHashFunctions;
	//redis中的key
	private final String key = "goods_filter";
	@Autowired
	private RedisTemplate redisTemplate;
	@Autowired
	private GoodsMapper goodsMapper;

	@PostConstruct
	private void init(){
		numBits = optimalNumOfBits();
		numHashFunctions = optimalNumOfHashFunctions();
		//数据重建
		List<Goods> goods = goodsMapper.selectList(null);
		for (Goods good : goods) {
			put(String.valueOf(good.getId()));
		}
	}

	//向布隆过滤器中put
	public void put(String id){
		long[] indexs = getIndexs(id);
		//将对应下标改为1
		for (long index : indexs) {
			redisTemplate.opsForValue().setBit(key, index, true);
		}
	}

	//判断id是否可能存在
	public boolean isExist(String id){
		long[] indexs = getIndexs(id);
		//只要有一个bit位为1就表示可能存在
		for (long index : indexs) {
			if (redisTemplate.opsForValue().getBit(key, index)) {
				return true;
			}
		}
		return false;
	}

	//根据key获取bitmap下标(算法借鉴)
	private long[] getIndexs(String key) {
		long hash1 = hash(key);
		long hash2 = hash1 >>> 16;
		long[] result = new long[numHashFunctions];
		for (int i = 0; i < numHashFunctions; i++) {
			long combinedHash = hash1 + i * hash2;
			if (combinedHash < 0) {
				combinedHash = ~combinedHash;
			}
			result[i] = combinedHash % numBits;
		}
		return result;
	}

	//计算哈希值(算法借鉴)
	private long hash(String key) {
		Charset charset = Charset.defaultCharset();
		return Hashing.murmur3_128().hashObject(key, Funnels.stringFunnel(charset)).asLong();
	}

	//计算二进制向量大小(算法借鉴)
	private long optimalNumOfBits(){
		return (long)((double)(-size) * Math.log(fpp) / (Math.log(2.0D) * Math.log(2.0D)));
	}
	//计算哈希算法数量(算法借鉴)
	private int optimalNumOfHashFunctions() {
		return Math.max(1, (int)Math.round((double)numBits / (double)size * Math.log(2.0D)));
	}
}

Control layer, according to an example of ID information query commodity

@RestController
@RequestMapping("goods")
public class GoodsController {
	@Autowired
	private GoodsMapper goodsMapper;
	@Autowired
	private RedisBloomFilter redisBloomFilter;
	@Autowired
	private RedisTemplate<String,Object> redisTemplate;

	//使用布隆过滤器 根据ID查询商品
	@GetMapping("/{id}")
	public R id(@PathVariable String id){
		//先查询布隆过滤器,过滤掉不可能存在的数据请求
		if (!redisBloomFilter.isExist(id)) {
			System.err.println("id:"+id+",布隆过滤...");
			return R.success(null);
		}
		//布隆过滤器认为可能存在,再走流程查询
		return R.success(noFilter(id));
	}

	//不使用过滤器
	private Object noFilter(String id){
		//先查Redis缓存
		Object o = redisTemplate.opsForValue().get(id);
		if (o != null) {
			//命中缓存
			System.err.println("id:"+id+",命中redis缓存...");
			return o;
		}
		//缓存未命中 查询数据库
		System.err.println("id:"+id+",查询DB...");
		Goods goods = goodsMapper.selectById(id);
		//结果存入Redis
		redisTemplate.opsForValue().set(id, goods);
		return goods;
	}
}

Start item, access to several nonexistent ID, the results as shown below:
Here Insert Picture Description
For data may be present, as shown in FIG test:
Here Insert Picture Description

to sum up

Use Redis cache, the database can reduce the pressure to some extent, but the face of some special cases, such as: malicious attacks. If the program is not processing, database will still be dangerous.

Use Bloom filter can effectively filter out the vast majority of meaningless DB queries, when the data is too large, there may be some hash conflict, there is a certain degree of fault tolerance Bloom filter, but it is still very efficient.

Published 100 original articles · won praise 23 · views 90000 +

Guess you like

Origin blog.csdn.net/qq_32099833/article/details/103844890