数据结构 Roaring Bitmaps 介绍

背景:

  BitMap 是一种比较常用的数据机构,位图索引被广泛应用与数据库和搜索引擎中,能快速定位一个数值是否在存在,是一种高效的数据压缩算法,能显著加快查询速度。但是BitMap还是会占用大量内存(线性增长),所以我们一般还需要对BitMap进行压缩处理。Roaring BitMaps (简称RBM) 就是一种压缩算法。

  所以:BitMap 是一种数据结构/压缩算法,RBM 是一种基于BitMap思想的数据结构/压缩算法。

原理:

  附上一段论文原文  

  1.   We partition the range of 32-bit indexes ([0; n)) into chunks of 216 integers sharing the same 16 most significant digits. We use specialized containers to store their 16 least significant bits.
  2.   When a chunk contains no more than 4096 integers, we use a sorted array of packed 16-bit integers. When there are more than 4096 integers, we use a 216-bit bitmap. Thus, we have two types of containers: an array container for sparse chunks and a bitmap container for dense chunks. The 4096 threshold insures that at the level of the containers, each integer uses no more than 16 bits: we either use 216 bits for more than 4096 integers, using less than 16 bits/integer, or else we use exactly 16 bits/integer.
  3. The containers are stored in a dynamic array with the shared 16 most-significant bits: this serves as a first-level index. The array keeps the containers sorted by the 16 most-significant bits.We expect this first-level index to be typically small: when n = 1 000 000, it contains at most 16 entries. Thus it should often remain in the CPU cache. The containers themselves should never use much more than 8 kB.

  白话文:

  1、将0-32-bit [0, n) 内的数据劈成 高16位和低16位两部分数据

  2、高16位用于查找数据存储位置,低16位存在在一个容器中(不就是一个类似HashMap的结构么)

  容器补充:容器是一个动态的数组,当数据小于4096个时,使用16bit的short数组存储,多余4096个时,使用216bits的BitMap存储;

  为什么使用两种数据结构来存储低16位的值:

    short数组:2bit * 4096 = 8KB 

    BitMap:存储16位范围内数据 65536/8 = 8192b,

  所以低于 4096个数,short 数组更省空间。

  

猜你喜欢

转载自www.cnblogs.com/souyoulang/p/9903202.html