2019-05-14 Reading: Data provena

2019-05-24  本文已影响0人  SeanC52111

Timestamp related data and query

Data type: time series data
Data arrive synchronously or asynchronously from multiple sources
4 Process:

Time indexing

A. Time Bucketing
Each time bucket can handle one hour's worth of data. Alternate policies might vary the bucket extents from one time period to another. For example, a bucketing policy may specify that the buckets for events from earlier than today are three-hour buckets, but that the buckets for events occurring during the last 24 hours are hashed by the hour.
In order to improve efficiency further, buckets are instantiated using a lazy allocation policy (as late as possible) in primary memory (RAM). In-memory buckets have a maximum capacity and, when they reach their limit, they will be committed to disk and replaced by a new bucket. Bucket storage size is another element of the bucketing policy and varies along with the size of the temporal extent. Finally, bucket policies typically enforce that buckets (a) do not overlap, and (b) cover all possible incoming timestamps.

Each incoming event is assigned to the time bucket where the time stamp from the event matches the bucket's temporal criteria. In one implementation, we can use half-open intervals, defined by a start time and an end time where the start time is an inclusive boundary and the end time is an exclusive boundary. This can make sure events occurring on bucket boundaries are uniquely assigned to a bucket.

B. Segmentation
Once an appropriate bucket has been identified for an event, the raw event data is segmented. A segment is a substring of the incoming event text and segmentation is the collection of segments implied by the segmentation algorithm on the incoming event data.
A segment substring may overlap another substring, but if it does, it must be contained entirely within that substring. We allow this property to apply recursively to the containing substring so that the segment hierarchy forms a tree on the incoming text.

C. Archiving and Indexing Events
The index is split into two separate phases: hot indexing and warm indexing. Hot indexes are managed entirely in RAM, are optimized for the smallest possible insert time, are not searchable, and do not persist. Warm indexes are searchable and persistent, but immutable. When hot indexes need to be made searchable or need to be persistent, they are converted into warm indexes.

During the course of the indexing process, it is possible that a single time bucket will be filled and committed to disk several times. This will result in multiple, independently searchable indices in secondary storage for a single time span. in an exemplary implementation, there is a merging process that takes as input two or more warm indices and merges them into a single warm index for that time bucket. This is a performance optimization and is not strictly required for searching.

上一篇 下一篇

猜你喜欢

热点阅读