redis zskiplist

2019-08-15 本文已影响0人多多的大白

skiplist 是一种有序数据结构。它通过在每个节点中维持多个指向其他节点的指针，从而达到快速访问节点的目的。跳跃表支持平均O(log N) 最坏 O(N)
复杂度的节点查找。

1、跳表介绍

为了使得链表支持类似二分查找的算法，对原始的链表进行修改，修改后的链表就是跳跃表，简称跳表。跳表支持快速的插入、删除、查找操作，是一种动态的数据结构。

1-1

如果我们想查找图1-1 的链表里面的某个元素，那必须从头遍历到尾才能找到，时间复杂度为O(n)

1-2

我们可以给原始链表加一级索引，如上图1-2，从图中可以看出如果我们此时查找是6这个元素的，从红线表示看我们只需要查找5次，原始链表查询需要6次

1-3

我们可以给原始链表再增加一级索引，来减少我们的查找时间复杂度。

由于这个数据量不大，查找效率提升的不明显，对于数据量较大的时候，查找效率提高很快，拿空间换时间。
JAVA中也有跳表的实现-ConcurrentSkipListMap。

2、redis skiplist 结构

在redis中对外暴露的数据结构：sorted set，然后sorted set 底端不仅仅使用skiplist，还是使用ziplist和dict （关于ziplist 后面章节会介绍），那么具体什么情况下面才会使用skiplist

# Similarly to hashes and lists, sorted sets are also specially encoded in
# order to save a lot of space. This encoding is only used when the length and
# elements of a sorted set are below the following limits:
  zset-max-ziplist-entries 128
  zset-max-ziplist-value 64

通过redis.conf 上面这俩个配置来具体确定 ziplist 到skiplist的转换

zset-max-ziplist-entries ：当sorted set中的元素个数，即(数据, score)对的数目超过128的时。
zset-max-ziplist-value ：当sorted set中插入的任意一个数据的长度超过了64的时。

1-1

#define ZSKIPLIST_MAXLEVEL 32 /* Should be enough for 2^32 elements */
#define ZSKIPLIST_P 0.25      /* Skiplist P = 1/4 */
/* ZSETs use a specialized version of Skiplists */
typedef struct zskiplistNode {
    sds ele;
    double score;
    struct zskiplistNode *backward;
    struct zskiplistLevel {
        struct zskiplistNode *forward;
        unsigned int span;
    } level[];
} zskiplistNode;
typedef struct zskiplist {
    struct zskiplistNode *header, *tail;
    unsigned long length;
    int level;
} zskiplist;
typedef struct zset {
    dict *dict;
    zskiplist *zsl;
} zset;

上面代码出自redis-4.0 的 server.h ，对上面结构做些简单分析

ZSKIPLIST_MAXLEVEL：创建zskiplist会创建一个空的zskiplistNode （zskiplist.header的指向）该node.level[] 的长度即该参数，还有就是每次创建新node的时会随机数字，该随机数范围最大值就是该参数。生成的随机数为node.level[]的长度
ZSKIPLIST_P:生成随机数概率期望

/* Returns a random level for the new skiplist node we are going to create.
 * The return value of this function is between 1 and ZSKIPLIST_MAXLEVEL
 * (both inclusive), with a powerlaw-alike distribution where higher
 * levels are less likely to be returned. */
int zslRandomLevel(void) {
    int level = 1;
    while ((random()&0xFFFF) < (ZSKIPLIST_P * 0xFFFF))
        level += 1;
    return (level<ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}

zskiplistNode定义了skiplist节点结构。
ele : 存储数据为sds 类型（关于sds 可参考前面章节）
score ：数据对应的分数
*backward：指向位于当前节点的前一个节点（双向链表提升遍历时间复杂度）
level ：节点中用 L1 、 L2 、 L3 等字样标记节点的各个层（参考图1-1）存放指向各层链表后一个节点的指针。forward表示每层的一个向后指针。span 表示记录两个节点之间的距离，可以用于计算元素排名(rank)。
如图1-1 我们需要找到o3这个元素对应的score为2.0，从header开始找到level最大一个元素即o3，o3.span 为3 ，在根据o3 的backward 找到o2，o3到o2的span为1，那么计算排名则为（3-1）-1 =1 ，rank是从0开始计算，所以从图中可以看出o2的rank。这也是redis对skiplist做的一层扩展，并且level[] 是一个柔性数组。
zskiplist 定义了skiplist真正结构
header和tail 分别指向skiplist 的头尾节点。
length ：记录所有的节点数，其中不包含新创建zskiplist时创建的空node。
level ：记录所有节点 zskiplistNode.level[] 长度最大的数字，其中也不计算新创建zskiplist时创建的空node（这个node的level[] 长度为32）

3、redis skiplist 创建过程以及skiplist的转换

创建过程

为什么把创建过程单独拿出来作为一项，因为笔者在刚刚开始的时候对跳表完全不了解，看了很多资料也还是一知半解。特别当笔者第一次看到图1-1（别人画的借用下）完全不理解header指向为什么是一个null的node 而且还是长度为32的level[]，所以这里面做下特殊的源码讲解。

/* Create a skiplist node with the specified number of levels.
 * The SDS string 'ele' is referenced by the node after the call. */
zskiplistNode *zslCreateNode(int level, double score, sds ele) {
    zskiplistNode *zn =
        zmalloc(sizeof(*zn)+level*sizeof(struct zskiplistLevel));
    zn->score = score;
    zn->ele = ele;
    return zn;
}
/* Create a new skiplist. */
zskiplist *zslCreate(void) {
    int j;
    zskiplist *zsl;
    zsl = zmalloc(sizeof(*zsl));
    zsl->level = 1;
    zsl->length = 0;
    zsl->header = zslCreateNode(ZSKIPLIST_MAXLEVEL,0,NULL);
    for (j = 0; j < ZSKIPLIST_MAXLEVEL; j++) {
        zsl->header->level[j].forward = NULL;
        zsl->header->level[j].span = 0;
    }
    zsl->header->backward = NULL;
    zsl->tail = NULL;
    return zsl;
}

上述代码出自redis-4.0 的 t_zset.c ，简单做下分析

zslCreate 创建skiplist 并且创建header 的node，可以看出来传过去的参数为ZSKIPLIST_MAXLEVEL，直接创建一个长度为32的level[]，看到这里就很清楚上图1-1 上面所画的。
zslCreateNode 创建具体node节点。

下面分析插入节点操作：

/* Insert a new node in the skiplist. Assumes the element does not already
 * exist (up to the caller to enforce that). The skiplist takes ownership
 * of the passed SDS string 'ele'. */
zskiplistNode *zslInsert(zskiplist *zsl, double score, sds ele) {
    zskiplistNode *update[ZSKIPLIST_MAXLEVEL], *x;
    unsigned int rank[ZSKIPLIST_MAXLEVEL];
    int i, level;
    serverAssert(!isnan(score));
    x = zsl->header;
    for (i = zsl->level-1; i >= 0; i--) {
        /* store rank that is crossed to reach the insert position */
        rank[i] = i == (zsl->level-1) ? 0 : rank[i+1];
        while (x->level[i].forward &&
                (x->level[i].forward->score < score ||
                    (x->level[i].forward->score == score &&
                    sdscmp(x->level[i].forward->ele,ele) < 0)))
        {
            rank[i] += x->level[i].span;
            x = x->level[i].forward;
        }
        update[i] = x;
    }
    /* we assume the element is not already inside, since we allow duplicated
     * scores, reinserting the same element should never happen since the
     * caller of zslInsert() should test in the hash table if the element is
     * already inside or not. */
    level = zslRandomLevel();
    if (level > zsl->level) {
        for (i = zsl->level; i < level; i++) {
            rank[i] = 0;
            update[i] = zsl->header;
            update[i]->level[i].span = zsl->length;
        }
        zsl->level = level;
    }
    x = zslCreateNode(level,score,ele);
    for (i = 0; i < level; i++) {
        x->level[i].forward = update[i]->level[i].forward;
        update[i]->level[i].forward = x;
        /* update span covered by update[i] as x is inserted here */
        x->level[i].span = update[i]->level[i].span - (rank[0] - rank[i]);
        update[i]->level[i].span = (rank[0] - rank[i]) + 1;
    }
    /* increment span for untouched levels */
    for (i = level; i < zsl->level; i++) {
        update[i]->level[i].span++;
    }
    x->backward = (update[0] == zsl->header) ? NULL : update[0];
    if (x->level[0].forward)
        x->level[0].forward->backward = x;
    else
        zsl->tail = x;
    zsl->length++;
    return x;
}

代码很长，总体思路大概分为3步：

根据目前传入的score找到插入位置x，这个过程会保存各层x的前一个位置节点。
根据随机函数获取level，并且生成新节点。
修改各个指针的指向，将创建的新节点插入。

skiplist的转换

前面介绍sorted set并不是一开始就用skiplist 而是后面才开始转换成的。通过上面介绍的zset-max-ziplist-entries 和zset-max-ziplist-value来控制的。

当数据量不满足zset-max-ziplist-entries 和zset-max-ziplist-value时，sorted set是由一个ziplist来实现的。
当满足zset-max-ziplist-entries 和zset-max-ziplist-value其中之一，sorted set会通zset 的数据结构来实现转换skiplist

typedef struct zset {
    dict *dict;
    zskiplist *zsl;
} zset;

4、为什么redis用skiplist而不用平衡树

关于这个话我们可以从查找、新增删除的角度考虑

平衡树:采用二分法思维把数据按规则组装成一个树形结构的数据，用这个树形结构的数据减少无关数据的检索，大大的提升了数据检索的速度。

查找：在平衡树上，我们找到指定范围的小值之后，还需要以中序遍历的顺序继续寻找其它不超过大值的节点。然而用skiplist 进行范围查找的时候，我们只需要找到小值在对第一层进行遍历即可得到；查找单个key 时平衡树和skiplist 时间复杂度基本差不多都为O(log n)
新增删除：平衡树新增和删除可能会引起子树的调整，而skiplist 只需要改变向前指针或者向后指针的改变，操作简单快速。

从上面三点其实已经说明很多选择skiplist的优势，还有一点skiplist 实现起来比平衡树简单很多。

Redis作者：(原话地址）

There are a few reasons:

They are not very memory intensive. It's up to you basically. Changing parameters about the probability of a node to have a given number of levels will make then less memory intensive than btrees.

A sorted set is often target of many ZRANGE or ZREVRANGE operations, that is, traversing the skip list as a linked list. With this operation the cache locality of skip lists is at least as good as with other kind of balanced trees.

They are simpler to implement, debug, and so forth. For instance thanks to the skip list simplicity I received a patch (already in Redis master) with augmented skip lists implementing ZRANK in O(log(N)). It required little changes to the code.

5、总结

redis skiplist 在原始的跳表上面做了一些优化。
redis sorted set 在zset-max-ziplist-entries 和zset-max-ziplist-value都不满足的情况时使用ziplist 来节省内存开销，超过那俩个参数利用查询、新增时间复杂度比较低的skiplist来实现。