Prometheus tsdb index源码分析
To be continued...
Prometheus index
是为了对监控数据进行快速检索和查询而设计的,主要用来记录chunk
中时序的偏移位置。
Index structure
index的数据格式如下所示:
+----------------------------+---------------------+
| magic(0xBAAAD700) <4b> | version(1) <1 byte> |
+----------------------------+---------------------+
| +----------------------------------------------+ |
| | Symbol Table | |
| +----------------------------------------------+ |
| | Series | |
| +----------------------------------------------+ |
| | Label Index 1 | |
| +----------------------------------------------+ |
| | ... | |
| +----------------------------------------------+ |
| | Label Index N | |
| +----------------------------------------------+ |
| | Postings 1 | |
| +----------------------------------------------+ |
| | ... | |
| +----------------------------------------------+ |
| | Postings N | |
| +----------------------------------------------+ |
| | Label Index Table | |
| +----------------------------------------------+ |
| | Postings Table | |
| +----------------------------------------------+ |
| | TOC | |
| +----------------------------------------------+ |
+--------------------------------------------------+
- Magic
- 4 bytes的
Magic Number
- 4 bytes的
- Version
- 1 byte的
version
- 1 byte的
- TOC (Table of Content)
- TOC记录的是上图中各部分的起始位置,即
offset
。
- TOC记录的是上图中各部分的起始位置,即
- Symbol Table
- Label Index Table
- Postings Table
TOC
The table of contents serves as an entry point to the entire index and points to various sections in the file. If a reference is zero, it indicates the respective section does not exist and empty results should be returned upon lookup.
const indexTOCLen = 6*8 + 4
// TOC represents index Table Of Content that states where each section of index starts.
type TOC struct {
Symbols uint64
Series uint64
LabelIndices uint64
LabelIndicesTable uint64
Postings uint64
PostingsTable uint64
}
从上述可以看出TOC
共占用52个字节,其中包括6个uint64的offset
,最后4个字节的CRC
。可以用hexdump
命令查看index文件的最后52个字节。
0005810 8b 2d 9a 00 00 00 00 00 00 00 05 00 00 00 00 00
0005820 00 0b f2 00 00 00 00 00 00 29 b9 00 00 00 00 00
0005830 00 46 70 00 00 00 00 00 00 2c a8 00 00 00 00 00
0005840 00 47 05 aa 37 2c f0
0005847
(END)
index主要包括读写两部分。下面分别对其进行描述。
Writer
Writer
实现IndexWriter
接口,而IndexWriter
中定义了序列化index
的几个阶段。
// IndexWriter serializes the index for a block of series data.
// The methods must be called in the order they are specified in.
type IndexWriter interface {
// AddSymbols registers all string symbols that are encountered in series
// and other indices.
AddSymbols(sym map[string]struct{}) error
// AddSeries populates the index writer with a series and its offsets
// of chunks that the index can reference.
// Implementations may require series to be insert in increasing order by
// their labels.
// The reference numbers are used to resolve entries in postings lists that
// are added later.
AddSeries(ref uint64, l labels.Labels, chunks ...chunks.Meta) error
// WriteLabelIndex serializes an index from label names to values.
// The passed in values chained tuples of strings of the length of names.
WriteLabelIndex(names []string, values []string) error
// WritePostings writes a postings list for a single label pair.
// The Postings here contain refs to the series that were added.
WritePostings(name, value string, it index.Postings) error
// Close writes any finalization and closes the resources associated with
// the underlying writer.
Close() error
}
index的写分为5个阶段,顺序执行。
const (
idxStageNone indexWriterStage = iota
idxStageSymbols
idxStageSeries
idxStageLabelIndex
idxStagePostings
idxStageDone
)
// Writer implements the IndexWriter interface for the standard
// serialization format.
type Writer struct {
f *os.File
fbuf *bufio.Writer
pos uint64
toc TOC
stage indexWriterStage
// Reusable memory.
buf1 encoding.Encbuf
buf2 encoding.Encbuf
uint32s []uint32
symbols map[string]uint32 // symbol offsets
seriesOffsets map[uint64]uint64 // offsets of series
labelIndexes []labelIndexHashEntry // label index offsets
postings []postingsHashEntry // postings lists offsets
// Hold last series to validate that clients insert new series in order.
lastSeries labels.Labels
crc32 hash.Hash
Version int
}
其中labelIndexHashEntry
和postingHashEntry
结构如下:
type labelIndexHashEntry struct {
keys []string
offset uint64
}
type postingsHashEntry struct {
name, value string
offset uint64
}
Symbol Table
- The symbol table holds a sorted list of deduplicated strings that occurred in label pairs of the stored series. They can be referenced from subsequent sections and significantly reduce the total index size.
- The section contains a sequence of the string entries, each prefixed with the string's length in raw bytes. All strings are utf-8 encoded. Strings are referenced by sequential indexing. The strings are sorted in lexicographically ascending order.
func (w *Writer) AddSymbols(sym map[string]struct{}) error
- 首先从sym中获取所以的symbols,即所有的key。
- 对其进行排序。
- 将symbols按照如下格式进行write。
其具体格式如下:
+--------------------+---------------------+
| len <4b> | len(symbols) <4b> |
+--------------------+---------------------+
| +----------------------+---------------+ |
| | len(str_1) <uvarint> | str_1 <bytes> | |
| +----------------------+---------------+ |
| | . . . | |
| +----------------------+---------------+ |
| | len(str_n) <uvarint> | str_n <bytes> | |
| +----------------------+---------------+ |
+--------------------+---------------------+
| CRC32 <4b> |
+--------------------+---------------------+
Series
- The section contains a sequence of series that hold the label set of the series as well as its chunks within the block. The series are sorted lexicographically by their label sets.
- Each series section is aligned to 16 bytes. The ID for a series is the offset/16. This serves as the series' ID in all subsequent references. Thereby, a sorted list of series IDs implies a lexicographically sorted list of series label sets.
// AddSeries adds the series one at a time along with its chunks.
func (w *Writer) AddSeries(ref uint64, lset labels.Labels, chunks ...chunks.Meta) error
Label Offset Table & Postings offset Table
A label offset table stores a sequence of label offset entries. Every label offset entry holds the label name and the offset to its values in the label index section. They are used to track label index sections. They are read into memory when an index file is loaded.
A postings offset table stores a sequence of postings offset entries. Every postings offset entry holds the label name/value pair and the offset to its series list in the postings section. They are used to track postings sections. They are read into memory when an index file is loaded.
在idxStageDone
阶段,会调用writeLabelIndexesOffsetTable
和writePostingsOffsetTable
方法。
// writeLabelIndexesOffsetTable writes the label indices offset table.
func (w *Writer) writeLabelIndexesOffsetTable() error {
w.buf2.Reset()
w.buf2.PutBE32int(len(w.labelIndexes))
for _, e := range w.labelIndexes {
w.buf2.PutUvarint(len(e.keys))
for _, k := range e.keys {
w.buf2.PutUvarintStr(k)
}
w.buf2.PutUvarint64(e.offset)
}
w.buf1.Reset()
// len(buf2)的长度包括labelIndexes的长度所占的4个字节,
// 但是不包括CRC所占的4个字节
w.buf1.PutBE32int(w.buf2.Len())
w.buf2.PutHash(w.crc32)
return w.write(w.buf1.Get(), w.buf2.Get())
}
可以看出Label Index Table的大致结构如下:
+------------------------+-------------------------+
| len <4b> | len(labelIndexes) <4b> |
+------------------------+-------------------------+
| +----------------------------------------+ |
| | len(keys) <uvarint> | |
| +----------------------+-----------------+ |
| | len(name) <uvarint> | name <bytes> | |
| +----------------------+-----------------+ |
| | . . . | |
| +----------------------------------------+ |
| | offset <uvarint64> | |
| +----------------------------------------+ |
| . . . |
+------------------------+-------------------------+
| CRC32 <4b> |
+------------------------+-------------------------+
// writePostingsOffsetTable writes the postings offset table.
func (w *Writer) writePostingsOffsetTable() error {
w.buf2.Reset()
w.buf2.PutBE32int(len(w.postings))
for _, e := range w.postings {
w.buf2.PutUvarint(2)
w.buf2.PutUvarintStr(e.name)
w.buf2.PutUvarintStr(e.value)
w.buf2.PutUvarint64(e.offset)
}
w.buf1.Reset()
// len(buf2)的长度包括postings的长度所占的4个字节,
// 但是不包括CRC所占的4个字节
w.buf1.PutBE32int(w.buf2.Len())
w.buf2.PutHash(w.crc32)
return w.write(w.buf1.Get(), w.buf2.Get())
}
可以看出Postings Table的大致结构如下:
+------------------------+-------------------------+
| len <4b> | len(postings) <4b> |
+------------------------+-------------------------+
| +----------------------------------------+ |
| | n = 2 <1b> <uvarint> | |
| +----------------------+-----------------+ |
| | len(name) <uvarint> | name <bytes> | |
| +----------------------+-----------------+ |
| | len(value) <uvarint> | value <bytes> | |
| +----------------------------------------+ |
| | offset <uvarint64> | |
| +----------------------------------------+ |
| . . . |
+------------------------+-------------------------+
| CRC32 <4b> |
+------------------------+-------------------------+