Prometheus tsdb index源码分析

2019-10-21 本文已影响0人酱油王0901

To be continued...

Prometheus index是为了对监控数据进行快速检索和查询而设计的，主要用来记录chunk中时序的偏移位置。

Index structure

index的数据格式如下所示：

+----------------------------+---------------------+
| magic(0xBAAAD700) <4b>     | version(1) <1 byte> |
+----------------------------+---------------------+
| +----------------------------------------------+ |
| |                 Symbol Table                 | |
| +----------------------------------------------+ |
| |                    Series                    | |
| +----------------------------------------------+ |
| |                 Label Index 1                | |
| +----------------------------------------------+ |
| |                      ...                     | |
| +----------------------------------------------+ |
| |                 Label Index N                | |
| +----------------------------------------------+ |
| |                   Postings 1                 | |
| +----------------------------------------------+ |
| |                      ...                     | |
| +----------------------------------------------+ |
| |                   Postings N                 | |
| +----------------------------------------------+ |
| |               Label Index Table              | |
| +----------------------------------------------+ |
| |                 Postings Table               | |
| +----------------------------------------------+ |
| |                      TOC                     | |
| +----------------------------------------------+ |
+--------------------------------------------------+

Magic
- 4 bytes的Magic Number
Version
- 1 byte的version
TOC (Table of Content)
- TOC记录的是上图中各部分的起始位置，即offset。
Symbol Table
Label Index Table
Postings Table

const indexTOCLen = 6*8 + 4

// TOC represents index Table Of Content that states where each section of index starts.
type TOC struct {
    Symbols           uint64
    Series            uint64
    LabelIndices      uint64
    LabelIndicesTable uint64
    Postings          uint64
    PostingsTable     uint64
}

从上述可以看出TOC共占用52个字节，其中包括6个uint64的offset，最后4个字节的CRC。可以用hexdump命令查看index文件的最后52个字节。

0005810 8b 2d 9a 00 00 00 00 00 00 00 05 00 00 00 00 00
0005820 00 0b f2 00 00 00 00 00 00 29 b9 00 00 00 00 00
0005830 00 46 70 00 00 00 00 00 00 2c a8 00 00 00 00 00
0005840 00 47 05 aa 37 2c f0
0005847
(END)

index主要包括读写两部分。下面分别对其进行描述。

Writer

Writer实现IndexWriter接口，而IndexWriter中定义了序列化index的几个阶段。

// IndexWriter serializes the index for a block of series data.
// The methods must be called in the order they are specified in.
type IndexWriter interface {
    // AddSymbols registers all string symbols that are encountered in series
    // and other indices.
    AddSymbols(sym map[string]struct{}) error

    // AddSeries populates the index writer with a series and its offsets
    // of chunks that the index can reference.
    // Implementations may require series to be insert in increasing order by
    // their labels.
    // The reference numbers are used to resolve entries in postings lists that
    // are added later.
    AddSeries(ref uint64, l labels.Labels, chunks ...chunks.Meta) error

    // WriteLabelIndex serializes an index from label names to values.
    // The passed in values chained tuples of strings of the length of names.
    WriteLabelIndex(names []string, values []string) error

    // WritePostings writes a postings list for a single label pair.
    // The Postings here contain refs to the series that were added.
    WritePostings(name, value string, it index.Postings) error

    // Close writes any finalization and closes the resources associated with
    // the underlying writer.
    Close() error
}

index的写分为5个阶段，顺序执行。

const (
    idxStageNone indexWriterStage = iota
    idxStageSymbols
    idxStageSeries
    idxStageLabelIndex
    idxStagePostings
    idxStageDone
)

// Writer implements the IndexWriter interface for the standard
// serialization format.
type Writer struct {
    f    *os.File
    fbuf *bufio.Writer
    pos  uint64

    toc   TOC
    stage indexWriterStage

    // Reusable memory.
    buf1    encoding.Encbuf
    buf2    encoding.Encbuf
    uint32s []uint32

    symbols       map[string]uint32     // symbol offsets
    seriesOffsets map[uint64]uint64     // offsets of series
    labelIndexes  []labelIndexHashEntry // label index offsets
    postings      []postingsHashEntry   // postings lists offsets

    // Hold last series to validate that clients insert new series in order.
    lastSeries labels.Labels

    crc32 hash.Hash

    Version int
}

其中labelIndexHashEntry和postingHashEntry结构如下：

type labelIndexHashEntry struct {
    keys   []string
    offset uint64
}

type postingsHashEntry struct {
    name, value string
    offset      uint64
}

Symbol Table

The symbol table holds a sorted list of deduplicated strings that occurred in label pairs of the stored series. They can be referenced from subsequent sections and significantly reduce the total index size.

The section contains a sequence of the string entries, each prefixed with the string's length in raw bytes. All strings are utf-8 encoded. Strings are referenced by sequential indexing. The strings are sorted in lexicographically ascending order.

func (w *Writer) AddSymbols(sym map[string]struct{}) error

首先从sym中获取所以的symbols，即所有的key。
对其进行排序。
将symbols按照如下格式进行write。

其具体格式如下：

+--------------------+---------------------+
| len <4b>           |  len(symbols) <4b>  |
+--------------------+---------------------+
| +----------------------+---------------+ |
| | len(str_1) <uvarint> | str_1 <bytes> | |
| +----------------------+---------------+ |
| |                . . .                 | |
| +----------------------+---------------+ |
| | len(str_n) <uvarint> | str_n <bytes> | |
| +----------------------+---------------+ |
+--------------------+---------------------+
| CRC32 <4b>                               |
+--------------------+---------------------+

Series

The section contains a sequence of series that hold the label set of the series as well as its chunks within the block. The series are sorted lexicographically by their label sets.

Each series section is aligned to 16 bytes. The ID for a series is the offset/16. This serves as the series' ID in all subsequent references. Thereby, a sorted list of series IDs implies a lexicographically sorted list of series label sets.

// AddSeries adds the series one at a time along with its chunks.
func (w *Writer) AddSeries(ref uint64, lset labels.Labels, chunks ...chunks.Meta) error

Label Offset Table & Postings offset Table

A label offset table stores a sequence of label offset entries. Every label offset entry holds the label name and the offset to its values in the label index section. They are used to track label index sections. They are read into memory when an index file is loaded.

A postings offset table stores a sequence of postings offset entries. Every postings offset entry holds the label name/value pair and the offset to its series list in the postings section. They are used to track postings sections. They are read into memory when an index file is loaded.

在idxStageDone阶段，会调用writeLabelIndexesOffsetTable和writePostingsOffsetTable方法。

// writeLabelIndexesOffsetTable writes the label indices offset table.
func (w *Writer) writeLabelIndexesOffsetTable() error {
    w.buf2.Reset()
    w.buf2.PutBE32int(len(w.labelIndexes))

    for _, e := range w.labelIndexes {
        w.buf2.PutUvarint(len(e.keys))
        for _, k := range e.keys {
            w.buf2.PutUvarintStr(k)
        }
        w.buf2.PutUvarint64(e.offset)
    }

    w.buf1.Reset()
    // len(buf2)的长度包括labelIndexes的长度所占的4个字节，
    // 但是不包括CRC所占的4个字节
    w.buf1.PutBE32int(w.buf2.Len())
    w.buf2.PutHash(w.crc32)

    return w.write(w.buf1.Get(), w.buf2.Get())
}

可以看出Label Index Table的大致结构如下：

+------------------------+-------------------------+
| len <4b>               | len(labelIndexes) <4b>  |
+------------------------+-------------------------+
|    +----------------------------------------+    |
|    |  len(keys) <uvarint>                   |    |
|    +----------------------+-----------------+    |
|    | len(name) <uvarint>  | name <bytes>    |    |
|    +----------------------+-----------------+    |
|    |                 . . .                  |    |
|    +----------------------------------------+    |
|    |  offset <uvarint64>                    |    |
|    +----------------------------------------+    |
|                       . . .                      |
+------------------------+-------------------------+
|      CRC32 <4b>                                  |
+------------------------+-------------------------+

// writePostingsOffsetTable writes the postings offset table.
func (w *Writer) writePostingsOffsetTable() error {
    w.buf2.Reset()
    w.buf2.PutBE32int(len(w.postings))

    for _, e := range w.postings {
        w.buf2.PutUvarint(2)
        w.buf2.PutUvarintStr(e.name)
        w.buf2.PutUvarintStr(e.value)
        w.buf2.PutUvarint64(e.offset)
    }

    w.buf1.Reset()
    // len(buf2)的长度包括postings的长度所占的4个字节，
    // 但是不包括CRC所占的4个字节
    w.buf1.PutBE32int(w.buf2.Len())
    w.buf2.PutHash(w.crc32)

    return w.write(w.buf1.Get(), w.buf2.Get())
}

可以看出Postings Table的大致结构如下：

+------------------------+-------------------------+
| len <4b>               | len(postings) <4b>      |
+------------------------+-------------------------+
|    +----------------------------------------+    |
|    |  n = 2 <1b>  <uvarint>                 |    |
|    +----------------------+-----------------+    |
|    | len(name) <uvarint>  | name <bytes>    |    |
|    +----------------------+-----------------+    |
|    | len(value) <uvarint> | value <bytes>   |    |
|    +----------------------------------------+    |
|    |  offset <uvarint64>                    |    |
|    +----------------------------------------+    |
|                       . . .                      |
+------------------------+-------------------------+
|      CRC32 <4b>                                  |
+------------------------+-------------------------+

References

Index disk format