基数估计（cardinality estimation）

2016-12-02 本文已影响627人 zoyanhui

基数是指一个集合中，不同的数的个数。
基数统计是集合不同的数的个数。比如说一个集合｛0, 1, 2, 2, 4, 5}，其基数是5，而个数是6。因为1重复出现了两次。基数是个去重统计。
基数估计是估计一个集合中不同的数的个数，不是数据总量的估计，也不是基数的精确计算。而是用概率算法的思想，来用低空间和时间成本，以一个很低的误差度来估计数据的基数。

Flajolet–Martin algorithm

这是一个概率算法，把集合的数，通过hash算法，均匀(理论上尽量均匀)hash到一个区间
![](http://www.forkosh.com/mathtex.cgi? \Small [0, 2^L-1])

定义 y的二进制表示的第k位的值
![](http://www.forkosh.com/mathtex.cgi? \Small bit(y,k))
所以:
![](http://www.forkosh.com/mathtex.cgi? \Small y=\sum_{k>=0}bit(y,k)2^k)
定义：
![](http://www.forkosh.com/mathtex.cgi? \Small \rho(y)=\min_{k>=0}bit(y,k)\ne0, \rho(0)=L)
算法描述如下：

Initialize a bit-vector BITMAP to be of length L and contain all 0's.

For each element X in M:
![](http://www.forkosh.com/mathtex.cgi? \Small index=\rho(y); y =hash(x) )
![](http://www.forkosh.com/mathtex.cgi? \Small BITMAP(index)=1 )
Let R denote the smallest index i such that
![](http://www.forkosh.com/mathtex.cgi? \Small BITMAP(i)=0 )
Estimate the cardinality of M as
![](http://www.forkosh.com/mathtex.cgi? \Small 2^{R}/\phi )
where ![](http://www.forkosh.com/mathtex.cgi? \Small \phi \approx 0.77351 )

算法思想:

假设集合M一共有n个数，当集合M的x，均匀hash到新的hash空间[0, 2^L], 则BIT(0)能被访问n/2次。因为最低bit位的0,1表示了奇偶数，均匀分布下，奇数大概有一半。所以有n/2的机会，使得BITMAP[0]=1，同理 BITMAP[1]=1有n/4的机会
所以, 对于BITMAP[i] = 0的位置i，表示 n < 2^i，所以算法中的R可以估计，n能使得BITMAP为1的位置的最高点。使得n约为2^R。其中大约的系数和严格的证明，可以参考引用的论文。

[引用]

Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). Journal of Computer and System Sciences. 31 (2): 182–209. doi:10.1016/0022-0000(85)90041-8.

LogLog algorithm

LogLog也是概率算法，使用m个bit位，使用的bit位m越大算法效果越好，具体算法过程如下：

与上面算法类似，需要把原始hash到一个数值空间M，即：
![](http://www.forkosh.com/mathtex.cgi? \Small M: Multiset \ of \ hashed \ values; \ m = 2^k )

初始化m个存储位为0，即：
![](http://www.forkosh.com/mathtex.cgi? \Small s^{(i)}= 0; 1 \le i \le m; \ m=2^k )
同样定义：
![](http://www.forkosh.com/mathtex.cgi? \Small \rho(y)=\min_{i>=0}bit(y,i)\ne0 \qquad [1])
对M中的每个x，
![](http://www.forkosh.com/mathtex.cgi? \Small for \quad x =b_1b_2\cdots \in M \quad do)
![](http://www.forkosh.com/mathtex.cgi? \Small set \quad j :=1 + <b_1 \cdots b_k>2; \quad value \ of \ first \ k \ bits \ in \ x; \ m=2^k )
![](http://www.forkosh.com/mathtex.cgi? \Small set \quad s^{(j)} := max \lbrack s^{(j)}, \rho(b{k+1}b_{k+2}\cdots) \rbrack ; )
计算基数估计，
![](http://www.forkosh.com/mathtex.cgi? \Small return \quad E := \alpha_mm2^{{\frac{1}{m}\sum_{j}s}{(j)}} )

算法思想：

在上面Flajolet–Martin算法的基础上，进行改进。将集合M的元素分散到m个桶，然后每个桶中的元素采用Flajolet–Martin一致的思路。
对于有n个不同元素的集合M，大约有n/2^k个不同元素，它们的rho值(LogLog算法中式子[1])等于k。当k=1的时候，也就是n/2个元素会有BITMAP[0]=1(见Flajolet–Martin算法)。所以，
![](http://www.forkosh.com/mathtex.cgi? \Small R(M) := \max_{x \in M}\rho(x) )
可以作为![](http://www.forkosh.com/mathtex.cgi? \Small \log_2n ) 的粗略估计。
使用元素x的前k个bit位的值，作为m个桶的桶序号，把x落入相应的桶中。这样每个桶中的元素个数，大约是n/m
每个桶，采用2中描述的，可以对每个桶i的基数有估计出，有:
![](http://www.forkosh.com/mathtex.cgi? \Small \log_2{n_i} = s^{(i)})
综合3，4可以知道，每个桶基数的期望是 n/m，等于每个桶基数估计的均值
![](http://www.forkosh.com/mathtex.cgi? \Small \log_2{\frac{n}{m}} = \frac{1}{m}\sum_{j=1}^{m}s{(j)})
所以大致上，
![](http://www.forkosh.com/mathtex.cgi? \Small E \approx m2^{{\frac{1}{m}\sum_{j}s}{(j)}} )
根据更深入的分析推导，有
![](http://www.forkosh.com/mathtex.cgi? \Small E := \alpha_mm2^{{\frac{1}{m}\sum_{j}s}{(j)}} )
其中，
![](http://www.forkosh.com/mathtex.cgi? \Small \alpha_m := \lbrack \Gamma(-1/m)\frac{1-2^{1/m}}{log2} \rbrack ^{-m}; \Gamma(s)=\frac1s\int_{0}^{\infty}e{-t}t^sdt)
来修正均值带来的系统偏差
更详细的推导和论证参考引用文章

[引用]

Durand, Marianne; Flajolet, Philippe (2003). "Loglog Counting of Large Cardinalities" (PDF). Algorithms - ESA 2003. Lecture Notes in Computer Science. 2832. p. 605. doi:10.1007/978-3-540-39658-1_55. ISBN 978-3-540-20064-2.

HyperLogLog algorithm

HyperLogLog算法，在LogLog算法上做了改进，就是把m个桶的平均值，从LogLog的几何平均数，改成了调和平均数.

在HyperLogLog的最终实现上，另外做了修正。对小、中、大范围的值，分别做了修正。具体算法描述如下：(描述了[0,10^9]内的基数估计)

![](http://www.forkosh.com/mathtex.cgi? \Small Let \quad h: D \to {0,1}^32 \quad hash \ data \ from \ D \ to \ binary \ 32-bit \ words.)
![](http://www.forkosh.com/mathtex.cgi? \Small \rho(y)=\min_{i>=0}bit(y,i)\ne0 )
![](http://www.forkosh.com/mathtex.cgi? \Small define \quad \alpha_{16}=0.673; \alpha_{32}=0.697; \alpha_{64}=0.709; \alpha_{m}=0.7213/(1+1.079/m) \ for \ m \ge 128; )
Program HyperLogLog(input M: multiset of items from domain D):

![](http://www.forkosh.com/mathtex.cgi? \Small assume \quad m = 2^k \ with \ k \in [4 \cdots16] )
![](http://www.forkosh.com/mathtex.cgi? \Small initialize \ a \ collection \ of \ m \ registers, \ s^{(1)},s{(2)},\cdots,s^{(m)}, to \ 0;)
![](http://www.forkosh.com/mathtex.cgi? \Small for \quad v \in M \quad do)
![](http://www.forkosh.com/mathtex.cgi? \Small set \quad x =b_1b_2\cdots )
![](http://www.forkosh.com/mathtex.cgi? \Small set \quad j :=1 + <b_1 \cdots b_k>2; \quad value \ of \ first \ k \ bits \ in \ x; \ m=2^k )
![](http://www.forkosh.com/mathtex.cgi? \Small set \quad s^{(j)} := max \lbrack s^{(j)}, \rho(b{k+1}b_{k+2}\cdots) \rbrack ; )
![](http://www.forkosh.com/mathtex.cgi? \Small compute \quad E := \alpha_mm^2 \lbrack \sum_{j=1}^{m}2{-s^{(j)}} \rbrack ^{-1} )
small range correction
![](http://www.forkosh.com/mathtex.cgi? \Small if \ E \le \frac52m \ then )
![](http://www.forkosh.com/mathtex.cgi? \Small let \ V \ be \ the \ number \ of \ registers \ equals \ to \ 0 )
![](http://www.forkosh.com/mathtex.cgi? \Small if \ V \ne 0 \ then \ set \ E^{}:=mlog(m/V) \ else \ set \ E^{}:=E; \ endif)
![](http://www.forkosh.com/mathtex.cgi? \Small \ endif)
intermediate range - no correction
![](http://www.forkosh.com/mathtex.cgi? \Small \ if E \le \frac1{30}2^{32} \ then \ set \ E^:=E; \ endif)
large range correction
![](http://www.forkosh.com/mathtex.cgi? \Small \ if \ E > \frac1{30}2^{32} \ then \ set \ E^:=-2^{{32}log(1-E/2}{32}); \ endif)
return
![](http://www.forkosh.com/mathtex.cgi? \Small cardinality \ estimate \ E^* \ with \ typical \ relative \ error \ \pm1.04/\sqrt{m})

[引用]

Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm" (PDF). AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.

基数估计算法的比较

Paste_Image.png

应用

基数估计的应用比较广泛，对于存储空间和实时要求比较高，但是精度要求能容忍一定误差的时候，基数估计是个很好的选择。
基数估计的应用举例：

HyperLogLog算法，在redis中也有应用。
基数估计应用在数据库优化中，用来估计复杂查询涉及的映射、连接等操作的数量。
路由器使用基数估计，在线实时的分析特定类型的事件，为防止Dos、port scan服务。
复杂的网络拓扑结构中，用来估计连接结点对的数量

基数估计（cardinality estimation）

Flajolet–Martin algorithm

LogLog algorithm

HyperLogLog algorithm

基数估计算法的比较

应用

猜你喜欢

热点阅读