hbase实践（十六） BlockCache

0 引言

和其他数据库一样，优化IO也是HBase提升性能的不二法宝，而提供缓存更是优化的重中之重。

根据二八法则，80%的业务请求都集中在20%的热点数据上，因此将这部分数据缓存起就可以极大地提升系统性能。

HBase在实现中提供了两种缓存结构：MemStore和BlockCache。其中MemStore称为写缓存，HBase执行写操作首先会将数据写入MemStore，并顺序写入HLog，等满足一定条件后统一将MemStore中数据刷新到磁盘，这种设计可以极大地提升HBase的写性能。不仅如此，MemStore对于读性能也至关重要，假如没有MemStore，读取刚写入的数据就需要从文件中通过IO查找，这种代价显然是昂贵的！BlockCache称为读缓存，HBase会将一次文件查找的Block块缓存到Cache中，以便后续同一请求或者邻近数据查找请求，可以直接从内存中获取，避免昂贵的IO操作。

1. Contents of the BlockCache

了解BlockCache中存放的内容，可以帮助我们更好设计BlockCache的大小。

Your data: Each time a Get or Scan operation occurs, the result is added to the BlockCache if it was not already cached there. If you use the BucketCache, data blocks are always cached in the BucketCache.
Row keys: When a value is loaded into the cache, its row key is also cached. This is one reason to make your row keys as small as possible. A larger row key takes up more space in the cache.
hbase:meta: The hbase:meta catalog table keeps track of which RegionServer is serving which regions. It can consume several megabytes of cache if you have a large number of regions, and has in-memory access priority, which means HBase attempts to keep it in the cache as long as possible.
Indexes of HFiles: HBase stores its data in HDFS in a format called HFile. These HFiles contain indexes which allow HBase to seek for data within them without needing to open the entire HFile. The size of an index is a factor of the block size, the size of your row keys, and the amount of data you are storing. For big data sets, the size can exceed 1 GB per RegionServer, although the entire index is unlikely to be in the cache at the same time. If you use the BucketCache, indexes are always cached on-heap.
Bloom filters: If you use Bloom filters, they are stored in the BlockCache. If you use the BucketCache, Bloom filters are always cached on-heap.

The sum of the sizes of these objects is highly dependent on your usage patterns and the characteristics of your data. For this reason, the HBase Web UI and Cloudera Manager each expose several metrics to help you size and tune the BlockCache.

将上面的缓存分为两类：

数据本身：your data、rowkeys
索引： hbase:meta、Indexes of HFiles、Bloom filters。

当使用BucketCache，数据缓存的最小单位是Block。

2. BlockCache

BlockCache是Region Server级别的，一个Region Server只有一个Block Cache，在Region Server启动的时候完成Block Cache的初始化工作。到目前为止，HBase先后实现了3种Block Cache方案，LRUBlockCache是最初的实现方案，也是默认的实现方案；HBase 0.92版本实现了第二种方案SlabCache，见HBASE-4027；HBase 0.96之后官方提供了另一种可选方案BucketCache，见HBASE-7404。

2.1 LRU Least-Recently-Used

LRU缓存把最近最少使用的数据移除，让给最新读取的数据。而往往最常读取的，也是读取次数最多的，所以，利用LRU缓存，我们能够提高系统的performance.

2.2 LRUBlockCache

LRUBlockCache将缓存分为三块：single-access区、mutil-access区、in-memory区，分别占到整个BlockCache大小的25%、50%、25%。

memory区表示数据可以常驻内存，一般用来存放访问频繁、数据量小的数据，比如元数据，用户也可以在建表的时候通过设置列族属性IN-MEMORY= true将此列族放入in-memory区。很显然，这种设计策略类似于JVM中young区、old区以及perm区。无论哪个区，系统都会采用严格的Least-Recently-Used算法

LruBlockCache内部是通过一个ConcurrentHashMap来保存所有cache的block的。

/** Concurrent map (the cache) */
  private final Map<BlockCacheKey,LruCachedBlock> map;
  map = new ConcurrentHashMap<BlockCacheKey,LruCachedBlock>(mapInitialSize,
        mapLoadFactor, mapConcurrencyLevel);

LRU方案优缺点:
- 优点：LRU方案使用JVM提供的HashMap管理缓存，简单有效。
- 缺点：Full GC：在大内存条件下，一次Full GC很可能会持续较长时间，甚至达到分钟级别

2.2 SlabCache

使用Java NIO DirectByteBuffer技术实现了堆外内存存储。

SlabCache有两个缓存区，分别占整个BlockCache大小的80%和20%，每个缓存区分别存储固定大小的Block块：

前者主要存储小于等于64K大小的Block，后者存储小于等于128K Block，如果一个Block太大就会导致两个区都无法缓存。

Q: 用户设置BlockSize = 256K怎么办？

HBase实际实现中将SlabCache和LRUBlockCache搭配使用，称为DoubleBlockCache。一次随机读中，一个Block块从HDFS中加载出来之后会在两个Cache中分别存储一份；缓存读时首先在LRUBlockCache中查找，如果Cache Miss再在SlabCache中查找，此时如果命中再将该Block放入LRUBlockCache中。
SlabCache存在的问题？

SlabCache设计中固定大小内存设置会导致实际内存使用率比较低，
而且使用LRUBlockCache缓存Block依然会因为JVM GC产生大量内存碎片。

2.3 BlockCache

2.3.1 BucketCache内存组织形式

每个bucket会有一个baseoffset变量和一个size标签，其中baseoffset变量表示这个bucket在实际物理空间中的起始地址，因此block的物理地址就可以通过baseoffset和该block在bucket的偏移量唯一确定；而size标签表示这个bucket可以存放的block块的大小。

默认14个不同大小的Bucket ：4, 8, 16, 32, 40, 48, 56, 64, 96, 128, 192, 256, 384, 512 KB

优点：可以缓存不同大小的数据块Block。

2.3.2 使用BucketAllocator类实现对Bucket的组织管理

HBase会根据每个bucket的size标签对bucket进行分类，相同size标签的bucket由同一个BucketSizeInfo管理；
HBase在启动的时候为每种size标签分配一个bucket，最后所有剩余的bucket都分配最大的size标签，默认分配(512+1)K
Bucket的size标签可以动态调整，比如64K的block数目比较多，65K的bucket被用完了以后，其他size标签的完全空闲的bucket可以转换成为65K的bucket，但是至少保留一个该size的bucket。

2.3.3 Block缓存写入、读取流程

RAMCache是一个存储blockkey和block对应关系的HashMap；
WriteThead是整个block写入的中心枢纽，主要负责异步的写入block到内存空间；
BucketAllocator主要实现对bucket的组织管理，为block分配内存空间；
IOEngine是具体的内存管理模块，主要实现将block数据写入对应地址的内存空间；
BackingMap也是一个HashMap，用来存储blockKey与对应物理内存偏移量的映射关系，用来根据blockkey定位具体的block；

其中实线表示cache block流程，虚线表示get block流程。

2.3.4 BucketCache工作模式

BucketCache默认有三种工作模式：

heap、offheap和file；

三者不同之处是对应的最终存储介质有所不同，即上述所讲的IOEngine有所不同。

heap模式和offheap模式，使用内存作为最终存储介质。

分配内存时，offheap快；读内存时，heap快。heap模式受GC影响
file模式

它使用Fussion-IO或者SSD等作为存储介质，相比昂贵的内存，这样可以提供更大的存储容量，因此可以极大地提升缓存命中率。

3. BlockCache参数设置

heap模式

<hbase.bucketcache.ioengine>heap</hbase.bucketcache.ioengine>
//bucketcache占用整个jvm内存大小的比例
<hbase.bucketcache.size>0.4</hbase.bucketcache.size>
//bucketcache在combinedcache中的占比
<hbase.bucketcache.combinedcache.percentage>0.9</hbase.bucketcache.combinedcache.percentage>

offheap模式

<hbase.bucketcache.ioengine>offheap</hbase.bucketcache.ioengine>
<hbase.bucketcache.size>0.4</hbase.bucketcache.size>
<hbase.bucketcache.combinedcache.percentage>0.9</hbase.bucketcache.combinedcache.percentage>

file模式

<hbase.bucketcache.ioengine>file:/cache_path</hbase.bucketcache.ioengine>
//bucketcache缓存空间大小，单位为MB
<hbase.bucketcache.size>10 * 1024</hbase.bucketcache.size>
//高速缓存路径
<hbase.bucketcache.persistent.path>file:/cache_path</hbase.bucketcache.persistent.path>

4. BlockCache使用场景

在’缓存全部命中’场景下，LRU君可谓完胜CBC君。因此如果总数据量相比JVM内存容量很小的时候，选择LRU君；
在所有其他存在缓存未命中情况的场景下， LRU君的GC性能几乎只有CBC君的1/3，而吞吐量、读写延迟、IO、CPU等指标两者基本相当，因此建议选择CBC。

之所以在’缓存全部命中’场景下LRU的各项指标完胜CBC，而在’缓存大量未命中’的场景下，LRU各项指标与CBC基本相当，是因为HBase在读取数据的时候，如果都缓存命中的话，对于CBC，需要将堆外内存先拷贝到JVM内，然后再返回给用户，流程比LRU君的堆内内存复杂，延迟就会更高。而如果大量缓存未命中，内存操作就会占比很小，延迟瓶颈主要在于IO，使得LRU和CBC两者各项指标基本相当。

BlockCache不同配置场景

1）如果请求的数据比较符合缓存，命中率比较高，使用LRUBlockCache方式会比CombinedBlockCache的吞吐量高上20%（但也会牺牲一些垃圾回收）。

1）如果需要缓存的数据超过堆大小的情况下，推荐使用Block Cache下的off-heap。

2）当scan获取数据时，可以通过setCacheBlocks方法来设置是否使用block cache，对于频繁访问的行才建议使用block cache。

3）对于MapReduce的Scan作为输入任务，应该设置为setCacheBlocks（false）。

4）如果缓存遇到持续高的驱逐速率，这会导致LruBlockCache大量的垃圾回收，请使用CombinedBlockCache。

5）CombinedBlockCache在固态磁盘上使用file文件模式具有更好的垃圾回收，但吞吐量低于CombinedBlockCache使用offheap模式。

5. 小结

BlockCache先使用默认参数LRUBlockCache

最佳参数：根据监控指标做优化，加大内存并兼顾GC时间。