Redis 存储效率的追求-ziplist

2019-08-22 本文已影响0人多多的大白

前面章节介绍过zskiplist ，其中多次提到ziplist。主要说明redis sorted set 结构为了节约内存开销，在数据量比较小的情况，会使用ziplist 来作为数据结构。下面我们主要介绍zip的结构以及这样的结构为了节省内存做了哪些的约定。
作为一直用高级语言Java来讨生活的笔者来说，一直认为这种数据结构的思路是可以运用到我们实际的项目中，值得我们去思考一番。

1、什么是ziplist

Redis 官方在源码上注释如：

/* The ziplist is a specially encoded dually linked list that is designed
 * to be very memory efficient. It stores both strings and integer values,
 * where integers are encoded as actual integers instead of a series of
 * characters. It allows push and pop operations on either side of the list
 * in O(1) time. However, because every operation requires a reallocation of
 * the memory used by the ziplist, the actual complexity is related to the
 * amount of memory used by the ziplist.

上面说明来自redis-4.0-14 ziplist.c ，该文件上有对ziplist 做了比较详细的解释，包含ziplist的结构。
翻译大意就是：

ziplist 是为了存储效率提供的一种经过特殊编码的双向链表。它用于存储字符串和整型，其中整型用二进制来进行编码，它能以O(1)的时间复杂度在表的两端提供push和pop操作。

/* The size of a ziplist header: two 32 bit integers for the total
 * bytes count and last item offset. One 16 bit integer for the number
 * of items field. */
#define ZIPLIST_HEADER_SIZE     (sizeof(uint32_t)*2+sizeof(uint16_t))
/* Create a new empty ziplist. */
unsigned char *ziplistNew(void) {
    unsigned int bytes = ZIPLIST_HEADER_SIZE+1;
    unsigned char *zl = zmalloc(bytes);
    ZIPLIST_BYTES(zl) = intrev32ifbe(bytes);
    ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(ZIPLIST_HEADER_SIZE);
    ZIPLIST_LENGTH(zl) = 0;
    zl[bytes-1] = ZIP_END;
    return zl;
}

官方解释ziplist 是一个双向链表，我们都知道一个普通的链表每一项都占用独立的内存空间，然后通过指针或者引用来进行关联，这会导致大量的内存碎片。ziplist占用一整块的内存区域，从上面创建ziplist，我们就可以看出，也可以理解它就是一个表，而非链表。

2、ziplist 结构

ziplist 并不像前面章节所写的zskiplist、sds 等有自定义的结构，从ziplistNew代码就可以看出来。ziplist 为了节省内存在很多细节也做了约定，后面有很多地方可以看出。

ziplist的内存结构如下

/**
 * The general layout of the ziplist is as follows:
 *  <zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>
 **/

 *
 * NOTE: all fields are stored in little endian, if not specified otherwise.
 *
 * <uint32_t zlbytes> is an unsigned integer to hold the number of bytes that
 * the ziplist occupies, including the four bytes of the zlbytes field itself.
 * This value needs to be stored to be able to resize the entire structure
 * without the need to traverse it first.
 *
 * <uint32_t zltail> is the offset to the last entry in the list. This allows
 * a pop operation on the far side of the list without the need for full
 * traversal.
 *
 * <uint16_t zllen> is the number of entries. When there are more than
 * 2^16-2 entires, this value is set to 2^16-1 and we need to traverse the
 * entire list to know how many items it holds.
 *
 * <uint8_t zlend> is a special entry representing the end of the ziplist.
 * Is encoded as a single byte equal to 255. No other normal entry starts
 * with a byte set to the value of 255.

各个部分在内存中都是连续的，ziplist.c 注释上都对内存结构进行了解释，下面看下各个部分的定义：

zlbytes：32bit 表示ziplist占用的字节总数，其中包含自己本身占用的4个字节。
zltail：32bit 表示ziplist 最后一个entry 距离列表起始地址有多少个字节。它的存在意味可以很方便地找到最后一项，而不需要遍历整个列表，并且可以快速的进行pop和push操作。
zllen：16bit 表示entry的个数，zllen只有16bit 意味最大存储为2¹⁶-1。如果entry的个数小于等于2¹⁶-2，该值表示entry的个数。如果超过的话，zllen 所有的bit 都为1，那么如果需要entry的个数，就只能遍历整个列表了（我们可以从ziplistLen 这个API就可以看出来）。
entry :表示真正存放数据的数据项。
zlend ：8bit ，ziplist最后一项，表示结尾，固定值255。

下面我们在看下entry的构成

 /** Every entry in the ziplist is prefixed by metadata that contains two pieces
 * of information. First, the length of the previous entry is stored to be
 * able to traverse the list from back to front. Second, the entry encoding is
 * provided. It represents the entry type, integer or string, and in the case
 * of strings it also represents the length of the string payload.
 * So a complete entry is stored like this:
 **/
  <prevlen> <encoding> <entry-data>

prevlen:表示前一个数据项的字节总数，为了让ziplist能够从后向前遍历（从后一项的位置，只需向前偏移prevlen个字节，就找到了前一项。这里也就体现前面说的它像个单向链表的特性。prevlen这个字段采用的是变长编码。
我们看下为什么叫变长编码？
官方解释：

/** The length of the previous entry, <prevlen>, is encoded in the following way:
 * If this length is smaller than 254 bytes, it will only consume a single
 * byte representing the length as an unsinged 8 bit integer. When the length
 * is greater than or equal to 254, it will consume 5 bytes. The first byte is
 * set to 254 (FE) to indicate a larger value is following. The remaining 4
 * bytes take the length of the previous entry as value.
 *
 * So practically an entry is encoded in the following way:
 *
 * <prevlen from 0 to 253> <encoding> <data>
 *
 * Or alternatively if the previous entry length is greater than 253 bytes
 * the following encoding is used:
 *
 * 0xFE <4 bytes unsigned little endian prevlen> <encoding> <data>
**/

这句话的意思是：有俩种可能，一种是1个字节表示，一种是5个字节表示。
1、如果前一项字节数小于254的话，prevlen就用一个字节表示。
2、如果大于253的话，就用5个字节表示，第一个字节为FE （即254），后面4个字节真正存储前一个数据项的占用字节数。

encoding：表示当前数据项所保存数据的类型以及长度。也是采用变长编码。最前的2个字节决定存储是字符串类型还是整型，如果都为11的话表示为不同长度的整型，其他情况为字符串类型。

|00pppppp| - 1 byte。第1个字节最高两个bit是00，那么<encoding>字段只有1个字节，剩余的6个bit用来表示字符串的长度值，最高可以表示63 (2^6-1)。
|01pppppp|qqqqqqqq| - 2 bytes。第1个字节最高两个bit是01，那么<encoding>字段占2个字节，总共有14个bit用来表示字符串的长度值，最高可以表示16383 (2^14-1)。
10__|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes。第1个字节最高两个bit是10，那么len字段占5个字节，总共使用32个bit来表示字符串的长度值（6个bit舍弃不用），最高可以表示2^32-1。
在这三种情况下，<data>都是按字符串来存储的；从下面第4种情况开始，<data>开始变为按整数来存储了。
|11000000| - 1 byte。<encoding>字段占用1个字节，值为0xC0，后面的数据<data>存储为2个字节的int16_t类型。
|11010000| - 1 byte。<encoding>字段占用1个字节，值为0xD0，后面的数据<data>存储为4个字节的int32_t类型。
|11100000| - 1 byte。<encoding>字段占用1个字节，值为0xE0，后面的数据<data>存储为8个字节的int64_t类型。
|11110000| - 1 byte。<encoding>字段占用1个字节，值为0xF0，后面的数据<data>存储为3个字节长的整数。
|11111110| - 1 byte。<encoding>字段占用1个字节，值为0xFE，后面的数据<data>存储为1个字节的整数。
|1111xxxx| - - (xxxx的值在0001和1101之间)。这是一种特殊情况，xxxx从1到13一共13个值，这时就用这13个值来表示真正的数据。注意，这里是表示真正的数据，而不是数据长度了。也就是说，在这种情况下，后面不再需要一个单独的<data>字段来表示真正的数据了，而是<encoding>和<data>合二为一了。另外，由于xxxx只能取0001和1101这13个值了（其它可能的值和其它情况冲突了，比如0000和1110分别同前面第7种第8种情况冲突，1111跟结束标记冲突），而小数值应该从0开始，因此这13个值分别表示0到12，即xxxx的值减去1才是它所要表示的那个整数数据的值。

demo

做个简单的demo。可以根据上图的解释理解。

3、连锁更新

上面我们对ziplist存储结构已经清楚了，就是存储一块大的连续的内存块里面，如果针对这样一个结构进行插入数据、删除数据，看着都觉得很麻烦。我们可以想象下插入数据的一个过程，需要计算插入的位置，还需要计算entry里面的prevlen，prevlen的规则又比较复杂。

entry

假设我们现在有上图这样一个ziplist存储，其中entry1-entryN的data 都是小于254
entry1-entryN的prevlen的长度都占用1个字节。
如果在此时我们插入了一个数据长度大于254 暂且叫它entryM吧，插入entry1 的前面，这时会发现我们entry1-entryN 的prevlen长度都必须从1个字节变成5个字节（前面介绍过prevlen，它是一个变长编码），整个ziplist 就会发生连锁反应，全部都得变，大量的内存拷贝性能肯定会急剧下降。作为做内存存储的redis ，我们肯定会认为是灾难。
其实并没有我们想象的那么糟糕，Redis 肯定早就考虑过这样的情况，所以才敢大胆的这么玩。
下面介绍为什么可以不用这么慌：

zset-max-ziplist-entries 128
zset-max-ziplist-value 64

这俩个配置参数在前面章节《Redis zskiplist》多次介绍过，sorted set 在数据量比较小的情况下使用ziplist ，数据量达到上面的配置参数就直接用zskiplist来进行存储了。

hash-max-ziplist-entries 512
hash-max-ziplist-value 64

这俩个配置参数是在使用Hash 情况的ziplist转换，当数据量达到上面的数据，ziplist 就会转换成dist。

当hash中的数据项（即field-value对）的数目超过512的时候，也就是ziplist数据项超过1024的时候。
当hash中插入的任意一个value的长度超过了64的时候。

当然，这些配置写redis.conf里面，意味着我们可以随意配置。但是在清楚这4个配置参数的情况下，我们肯定不会随性的配置这里面的值。

综合上面的介绍：
1.压缩列表里要恰好有多个连续的、长度在250-253 字节之间的节点连锁更新才有可能被引发，在实际中，这种情况并不多见。
2.即使出现连锁更新，但只要被更新的节点数量不多，就不会对性能造成较大的影响。

4、ziplist的插入

有这么多规矩和约定的ziplist，想必插入操作肯定是很复杂的。下面我们介绍下
ziplist的插入API----ziplistInsert。

/* Insert item at "p". */
unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {
    size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), reqlen;
    unsigned int prevlensize, prevlen = 0;
    size_t offset;
    int nextdiff = 0;
    unsigned char encoding = 0;
    long long value = 123456789; /* initialized to avoid warning. Using a value
                                    that is easy to see if for some reason
                                    we use it uninitialized. */
    zlentry tail;
    /* Find out prevlen for the entry that is inserted. */
    if (p[0] != ZIP_END) {
        ZIP_DECODE_PREVLEN(p, prevlensize, prevlen);
    } else {
        unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl);
        if (ptail[0] != ZIP_END) {
            prevlen = zipRawEntryLength(ptail);
        }
    }
    /* See if the entry can be encoded */
    if (zipTryEncoding(s,slen,&value,&encoding)) {
        /* 'encoding' is set to the appropriate integer encoding */
        reqlen = zipIntSize(encoding);
    } else {
        /* 'encoding' is untouched, however zipStoreEntryEncoding will use the
         * string length to figure out how to encode it. */
        reqlen = slen;
    }
    /* We need space for both the length of the previous entry and
     * the length of the payload. */
    reqlen += zipStorePrevEntryLength(NULL,prevlen);
    reqlen += zipStoreEntryEncoding(NULL,encoding,slen);
    /* When the insert position is not equal to the tail, we need to
     * make sure that the next entry can hold this entry's length in
     * its prevlen field. */
    int forcelarge = 0;
    nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p,reqlen) : 0;
    if (nextdiff == -4 && reqlen < 4) {
        nextdiff = 0;
        forcelarge = 1;
    }
    /* Store offset because a realloc may change the address of zl. */
    offset = p-zl;
    zl = ziplistResize(zl,curlen+reqlen+nextdiff);
    p = zl+offset;
    /* Apply memory move when necessary and update tail offset. */
    if (p[0] != ZIP_END) {
        /* Subtract one because of the ZIP_END bytes */
        memmove(p+reqlen,p-nextdiff,curlen-offset-1+nextdiff);
        /* Encode this entry's raw length in the next entry. */
        if (forcelarge)
            zipStorePrevEntryLengthLarge(p+reqlen,reqlen);
        else
            zipStorePrevEntryLength(p+reqlen,reqlen);
        /* Update offset for tail */
        ZIPLIST_TAIL_OFFSET(zl) =
            intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+reqlen);
        /* When the tail contains more than one entry, we need to take
         * "nextdiff" in account as well. Otherwise, a change in the
         * size of prevlen doesn't have an effect on the *tail* offset. */
        zipEntry(p+reqlen, &tail);
        if (p[reqlen+tail.headersize+tail.len] != ZIP_END) {
            ZIPLIST_TAIL_OFFSET(zl) =
                intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff);
        }
    } else {
        /* This element will be the new tail. */
        ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p-zl);
    }
    /* When nextdiff != 0, the raw length of the next entry has changed, so
     * we need to cascade the update throughout the ziplist */
    if (nextdiff != 0) {
        offset = p-zl;
        zl = __ziplistCascadeUpdate(zl,p+reqlen);
        p = zl+offset;
    }
    /* Write the entry */
    p += zipStorePrevEntryLength(p,prevlen);
    p += zipStoreEntryEncoding(p,encoding,slen);
    if (ZIP_IS_STR(encoding)) {
        memcpy(p,s,slen);
    } else {
        zipSaveInteger(p,value,encoding);
    }
    ZIPLIST_INCR_LENGTH(zl,1);
    return zl;
}

我们来简单解析一下这段代码：

这个函数是在指定的位置p插入一段新的数据，待插入数据的地址指针是s，长度为slen。插入后形成一个新的数据项，占据原来p的配置，原来位于p位置的数据项以及后面的所有数据项，需要统一向后移动，给新插入的数据项留出空间。参数p指向的是ziplist中某一个数据项的起始位置，或者在向尾端插入的时候，它指向ziplist的结束标记<zlend>。
函数开始先计算出待插入位置前一个数据项的长度prevlen。这个长度要存入新插入的数据项的<prevrawlen>字段。
然后计算当前数据项占用的总字节数reqlen，它包含三部分：<prevrawlen>, <len>和真正的数据。其中的数据部分会通过调用zipTryEncoding先来尝试转成整数。
由于插入导致的ziplist对于内存的新增需求，除了待插入数据项占用的reqlen之外，还要考虑原来p位置的数据项（现在要排在待插入数据项之后）的<prevrawlen>字段的变化。本来它保存的是前一项的总长度，现在变成了保存当前插入的数据项的总长度。这样它的<prevrawlen>字段本身需要的存储空间也可能发生变化，这个变化可能是变大也可能是变小。这个变化了多少的值nextdiff，是调用zipPrevLenByteDiff计算出来的。如果变大了，nextdiff是正值，否则是负值。
现在很容易算出来插入后新的ziplist需要多少字节了，然后调用ziplistResize来重新调整大小。ziplistResize的实现里会调用allocator的zrealloc，它有可能会造成数据拷贝。
现在额外的空间有了，接下来就是将原来p位置的数据项以及后面的所有数据都向后挪动，并为它设置新的<prevrawlen>字段。此外，还可能需要调整ziplist的<zltail>字段。
最后，组装新的待插入数据项，放在位置p。

ziplist API

unsigned char *ziplistNew(void);
unsigned char *ziplistMerge(unsigned char **first, unsigned char **second);
unsigned char *ziplistPush(unsigned char *zl, unsigned char *s, unsigned int slen, int where);
unsigned char *ziplistIndex(unsigned char *zl, int index);
unsigned char *ziplistNext(unsigned char *zl, unsigned char *p);
unsigned char *ziplistPrev(unsigned char *zl, unsigned char *p);
unsigned char *ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen);
unsigned char *ziplistDelete(unsigned char *zl, unsigned char **p);
unsigned char *ziplistFind(unsigned char *p, unsigned char *vstr, unsigned int vlen, unsigned int skip);
unsigned int ziplistLen(unsigned char *zl);

ziplist的数据类型，没有用自定义的struct之类的来表达，而就是简单的unsigned char *。这是因为ziplist本质上就是一块连续内存，内部组成结构又是一个高度动态的设计（变长编码），也没法用一个固定的数据结构来表达。
ziplistNew: 创建一个空的ziplist（只包含<zlbytes><zltail><zllen><zlend>）。
ziplistMerge: 将两个ziplist合并成一个新的ziplist。
ziplistPush: 在ziplist的头部或尾端插入一段数据（产生一个新的数据项）。注意一下这个接口的返回值，是一个新的ziplist。调用方必须用这里返回的新的ziplist，替换之前传进来的旧的ziplist变量，而经过这个函数处理之后，原来旧的ziplist变量就失效了。为什么一个简单的插入操作会导致产生一个新的ziplist呢？这是因为ziplist是一块连续空间，对它的追加操作，会引发内存的realloc，因此ziplist的内存位置可能会发生变化。实际上，我们在之前介绍sds的文章中提到过类似这种接口使用模式（参见sdscatlen函数的说明）。
ziplistIndex: 返回index参数指定的数据项的内存位置。index可以是负数，表示从尾端向前进行索引。
ziplistNext和ziplistPrev分别返回一个ziplist中指定数据项p的后一项和前一项。
ziplistInsert: 在ziplist的任意数据项前面插入一个新的数据项。
ziplistDelete: 删除指定的数据项。
ziplistFind: 查找给定的数据（由vstr和vlen指定）。注意它有一个skip参数，表示查找的时候每次比较之间要跳过几个数据项。为什么会有这么一个参数呢？其实这个参数的主要用途是当用ziplist表示hash结构的时候，是按照一个field，一个value来依次存入ziplist的。也就是说，偶数索引的数据项存field，奇数索引的数据项存value。当按照field的值进行查找的时候，就需要把奇数项跳过去。
ziplistLen: 计算ziplist的长度（即包含数据项的个数）。

5、总结

ziplist 是一个连续的内存空间，设计上遵循着比较多的规矩，这种结构并不擅长更新操作。所以sortedset 、hash 都是在数据量小的情况使用这种结构。
ziplist 也提现了Redis 对存储效率的追求，不在乎复杂的设计，只要能提升性能都是值得的。