Nikkatsu live how users monthly statistics - Redis HyperLogLog Detailed

HyperLogLog is a probabilistic data structure, the data used to estimate the cardinality. Data sets can be an IP address of website visitors, E-mail mailbox or user ID.

Cardinality refers to the number of different values ​​in a set, such as a, b, c, d is the base 4, a, b, c, d, a base, or 4. Although a appears twice, it will only be counted once.

Exact calculation data set cardinality requires large amounts of memory to store the data set. When traversing the data set, it is determined whether the current value already exists traversed only way to compare this value with the eleven values ​​already traversed. When the number of data sets increases, memory consumption can not be ignored, and even became a key issue.

Redis using statistical aggregate base are generally three methods, namely the use of Redis HashMap, BitMap and HyperLogLog. The first two data structures in the order of magnitude increase collections, memory consumed will be greatly increased, but HyperLogLog not.

The Redis HyperLogLog be reduced by sacrificing accuracy consumes memory space, only 12K of memory, in the 0.81% standard error of the premise, capable of statistical data 2 ^ 64. So HyperLogLog suitability for such statistics Nikkatsu month living such accuracy is not high either scenario right.

This is a very surprising result, with such a small memory to record such a large magnitude of the data base. Here we will take you to a closer look HyperLogLog use of the basic principles, and source code to achieve specific test data analysis.

HyperLogLog in the Redis

Redis provides PFADD, PFCOUNTand PFMERGEthree commands available to the user HyperLogLog.

PFADD Used to add elements to HyperLogLog.

> PFADD visitors alice bob carol
(integer) 1
> PFCOUNT visitors
(integer) 3

If HyperLogLog estimated at approximately base PFADDhas changed after the command is executed, then the command returns 1, otherwise it returns 0. Given key does not exist if the command is executed, the program will create an empty HyperLogLog structure, and then execute the command.

PFCOUNTHyperLogLog command given base contained approximately. After calculating the base, PFCOUNTthe values are stored in the cache in HyperLogLog, we know next time PFADDbefore executing successful, you do not need to calculate the base again.

PFMERGE The combined as a plurality HyperLogLog HyperLogLog, HyperLogLog close to the base of the combined set of all input HyperLogLog and cardinality.

> PFADD customers alice dan
(integer) 1
> PFMERGE everyone visitors customers
OK
> PFCOUNT everyone
(integer) 4

Memory consumption comparison test

Here we come to compare the memory consumption of the following three data structures, HashMap, BitMap and HyperLogLog through real experiments.

We first use Lua scripts into a certain number of number corresponding to Redis data structure, and then perform
bgsave command rdb last use of redis-rdb-tools commands to view the memory size of each key share.

The following is a Lua script, I do not understand the Redis Lua script execution students can look at the article I wrote before "and Redis Lua-based Distributed limiting" .

local key = KEYS[1]
local size = tonumber(ARGV[1])
local method = tonumber(ARGV[2])

for i=1,size,1 do
  if (method == 0)
  then
    redis.call('hset',key,i,1)
  elseif (method == 1)
  then
    redis.call('pfadd',key, i)
  else
    redis.call('setbit', key, i, 1)
  end
end

We redis-cli by the script loadLua script is loaded into the Redis commands, and then use evalshacommands are inserted into a million the number to HashMap, HyperLogLog and BitMap three data structures, and then use the rdbcommand to view the structure of each memory consumption.

[root@VM_0_11_centos ~]# redis-cli -a 082203 script load "$(cat HyperLogLog.lua)"
"6255c6d0a1f32349f59fd2c8711e93f2fbc7ecf8"
[root@VM_0_11_centos ~]# redis-cli -a 082203 evalsha 6255c6d0a1f32349f59fd2c8711e93f2fbc7ecf8 1 hashmap 10000000 0
(nil)
[root@VM_0_11_centos ~]# redis-cli -a 082203 evalsha 6255c6d0a1f32349f59fd2c8711e93f2fbc7ecf8 1 hyperloglog 10000000 1
(nil)
[root@VM_0_11_centos ~]# redis-cli -a 082203 evalsha 6255c6d0a1f32349f59fd2c8711e93f2fbc7ecf8 1 bitmap 10000000 2
(nil)


[root@VM_0_11_centos ~]# rdb -c memory dump.rdb 
database,type,key,size_in_bytes,encoding,num_elements,len_largest_element,expiry

0,string,bitmap,1310768,string,1250001,1250001,
0,string,hyperloglog,14392,string,12304,12304,
0,hash,hashmap,441326740,hashtable,10000000,8,

We conducted two experiments were inserted into the ten thousand ten million digital and digital, memory statistics shown three data consumption structure is as follows.

statistic chart

From the table it is clear that, when ten thousand orders of magnitude BitMap minimum memory consumption, when ten million orders of magnitude HyperLogLog minimum memory consumption, but overall, HyperLogLog memory consumption is 14392 bytes, visible HyperLogLog has its own in terms of memory consumption unique.

Fundamental

HyperLogLog is a probabilistic data structure, which uses a probabilistic algorithm to count approximately cardinality. And it is the origin of most of the algorithm is Bernoulli process.

Bernoulli process is a process of experiment coin toss. A normal coin toss, floor may be positive, it could be negative, the probability of both is 1/2. Bernoulli process is to have a coin toss, when the front position until landing, and record the number of tossing k. For example, toss a coin appeared positive, and this time k is 1; the first is the opposite of a coin toss, then continue to throw, did not appear until the front third time, this time is 3 k.

For the n-th Bernoulli process, we will get n occurrences front throw count value $ k_1, k_2 ... k_n $, where the maximum here is k_max.

The meal mathematical derivation, we can draw a conclusion: $ 2 ^ {k_ max} $ as the estimated value of n. That means you can calculate the approximate conducted several Bernoulli process according to the maximum throw count.

schematic diagram

Below, we explain how HyperLogLog analog Bernoulli process, and the final statistics collection base.

HyperLogLog when adding an element, by Hash function will be the 64-bit bit string into an element, such as an input 5, then converted to 101 (0 foregoing is omitted, the same below). These bit strings Bernoulli process is similar to a coin flip. Bit string, 0 represents the coin toss floor is the opposite, a representative of a coin toss landing is positive, if a data eventually transformed 10010000, then look from low to high, we believe that this may represent a bit string string Bernoulli Lee process, for the first time digits 1 to 5, is to throw five times before they appear positive.

Therefore HyperLogLog basic idea is to use a bit string of the digital set maximum occurs in a position to estimate the entire base 1, but there is a large error in prediction of this method, in order to improve the error condition, the introduction points tub average HyperLogLog concept, calculation of the harmonic mean of the m buckets.

schematic diagram

Redis 中 HyperLogLog 一共分了 2^14 个桶,也就是 16384 个桶。每个桶中是一个 6 bit 的数组,如下图所示。

barrel

HyperLogLog 将上文所说的 64 位比特串的低 14 位单独拿出,它的值就对应桶的序号,然后将剩下 50 位中第一次出现 1 的位置值设置到桶中。50位中出现1的位置值最大为50,所以每个桶中的 6 位数组正好可以表示该值。

在设置前,要设置进桶的值是否大于桶中的旧值,如果大于才进行设置,否则不进行设置。示例如下图所示。
Examples

此时为了性能考虑,是不会去统计当前的基数的,而是将 HyperLogLog 头的 card 属性中的标志位置为 1,表示下次进行 pfcount 操作的时候,当前的缓存值已经失效了,需要重新统计缓存值。在后面 pfcount 流程的时候,发现这个标记为失效,就会去重新统计新的基数,放入基数缓存。

在计算近似基数时,就分别计算每个桶中的值,带入到上文将的 DV 公式中,进行调和平均和结果修正,就能得到估算的基数值。

Redis 源码分析

我们首先来看一下 HyperLogLog 对象的定义

struct hllhdr {
    char magic[4];      /* 魔法值 "HYLL" */
    uint8_t encoding;   /* 密集结构或者稀疏结构 HLL_DENSE or HLL_SPARSE. */
    uint8_t notused[3]; /* 保留位, 全为0. */
    uint8_t card[8];    /* 基数大小的缓存 */
    uint8_t registers[]; /* 数据字节数组 */
};

HyperLogLog 对象中的 registers 数组就是桶,它有两种存储结构,分别为密集存储结构和稀疏存储结构,两种结构只涉及存储和桶的表现形式,从中我们可以看到 Redis 对节省内存极致地追求。

Dense storage structure

我们先看相对简单的密集存储结构,它也是十分的简单明了,既然要有 2^14 个 6 bit的桶,那么我就真使用足够多的 uint8_t 字节去表示,只是此时会涉及到字节位置和桶的转换,因为字节有 8 位,而桶只需要 6 位。

所以我们需要将桶的序号转换成对应的字节偏移量 offset_bytes 和其内部的位数偏移量 offset_bits。需要注意的是小端字节序,高位在右侧,需要进行倒转。

当 offset_bits 小于等于2时,说明一个桶就在该字节内,只需要进行倒转就能得到桶的值。

schematic diagram

如果 offset_bits 大于 2 ,则说明一个桶分布在两个字节内,此时需要将两个字节的内容都进行倒置,然后再进行拼接得到桶的值,如下图所示。

schematic diagram

HyperLogLog 的稀疏存储结构是为了节约内存消耗,它不像密集存储模式一样,真正找了那么多个字节数组来表示2^14 个桶,而是使用特殊的字节结构来表达。

schematic diagram

Redis 为了方便表达稀疏存储,它将上面三种字节表示形式分别赋予了一条指令。

  • ZERO : 一字节,表示连续多少个桶计数为0,前两位为标志00,后6位表示有多少个桶,最大为64。
  • XZERO : 两个字节,表示连续多少个桶计数为0,前两位为标志01,后14位表示有多少个桶,最大为16384。
  • VAL : 一字节,表示连续多少个桶的计数为多少,前一位为标志1,四位表示连桶内计数,所以最大表示桶的计数为32。后两位表示连续多少个桶。

schematic diagram

所以,一个初始状态的 HyperLogLog 对象只需要2 字节,也就是一个 XZERO 来存储其数据,而不需要消耗12K 内存。当 HyperLogLog 插入了少数元素时,可以只使用少量的 XZERO、VAL 和 ZERO 进行表示,如下图所示。

schematic diagram

Redis从稀疏存储转换到密集存储的条件是:

  • 任意一个计数值从 32 变成 33,因为 VAL 指令已经无法容纳,它能表示的计数值最大为 32
  • 稀疏存储占用的总字节数超过 3000 字节,这个阈值可以通过 hll_sparse_max_bytes 参数进行调整。

具体 Redis 中的 HyperLogLog 源码由于涉及较多的位运算,这里就不多带大家学习了。大家对 HyperLogLog 有什么更好的理解或者问题都欢迎积极留言。

参考

https://thoughtbot.com/blog/hyperloglogs-in-redis
https://juejin.im/post/5c7fe7525188251ba53b0623
https://juejin.im/post/5bef9c706fb9a049c23204a3

Guess you like

Origin yq.aliyun.com/articles/705688