Bigtable: A Distributed Storage
Bigtable: A Distributed Storage System for Structured Data
Chang et al. OSDI'06
Recently I had a chance to take a look at the legendary Bigtable paper. I have zero expertise in the field of data management, or big data if you'd like it fancier. It's a ten-year old paper, but still widely adopted in industry, e.g., HBase. Back in 2006, there are few real big data systems. Bigtable was internally used in Google for two years before they decided to get a publication about it. Google used Bigtable to store various data, from URLs to images.
Introduction
Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability.
Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage.
Besides, Bigtable uses arbitrary strings as keys to index data, interms of row and column names. The values (data) are also treated as strings. It's the clients' responsibility to serialize their data.
Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.
Data Model
Basically, a Bigtable is a sorted map which is (row:string, column:string, time:int64) -> string
.
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
Rows
the row keys are arbitrary strings. Every read/write of data under a single row key is atomic, which means no more than one client can manipulate with the data in one row simultaneously.
Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the system’s behavior in the presence of concurrent updates to the same row.
Concurrent read should be all right, but it seems Bigtable simply does not allow that.
Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing.
This makes read of short row ranges effecient which typically only require communication with a few machines.
Column Families
Unlike rows, columns are grouped into sets.
Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type (we compress data in the same column family together).
The number of distinct column families are intentionally kept small, however the number of columns is unbounded.
A column key is as family:qualifier.
Column family names must be printable, but qualifiers may be arbitrary strings.
Access control and both disk and memory accounting are performed at the column-family level.
This means an application is possible to view only part of the data that is in a legitimate column family.
Timestamps
Instead of storing only one specific string (which is the data) in a cell, Bigtable allows a cell to contain multiple versions of the same data; these versions are indexed by timestamp. These timestamps could be assigned by Bigtable, or by client applications.
Bigtable also allows clients to set to garbage-collect cell versions automatically.
The above covers the basic structure of the data model in Bigtable. The rest of the paper I will not put here. For the implementation we can look at Apache HBase.