Bigtable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data
Ido Hakimi

Outline Introduction Data Model API Building Blocks Implementation
Refinements Performance Evaluation Future Work

Why not Relational Database?
Scale is too large for most commercial databases Even if it weren’t, cost would be very high Low-level storage optimizations help performance significantly Explain History of storage!

What is Bigtable? It is a distributed storage system for managing structured data that is designed to scale to peta bytes of storage across thousands of commodity servers. Wide Applicability Scalability Bigtable Explain each square its meaning High Performance High Availability

Data Model – Overview Sparse, distributed, persistent multidimensional sorted map. Example of Webtable Each row

Data Model – Example “Webtable” stores copy of web pages & their related information. row key: URL (reverse hostname) column key: attribute name timestamp: time that the page is fetched

Data Model – Rows Row key: string (usually 10-100KB, max 64KB)
Every R/W of data under a single row key is atomic

Data Model – Rows Sorted by row key in lexicographic order
Tablet: a certain range of rows the unit of distribution & load balancing good locality for data access

Data Model – Columns Column families: group of column keys (same type)
the unit of access control Column key: family:qualifier At most a couple of hundreds of column families For example a language family for the Webtable which has only 1 column key Another example is anchor column family and every key is a reference to that URL Access control is on Memory and performed at the column-family level

Data Model – Timestamps
Timestamp: index multiple versions of the same data not necessarily the “real time” data clean up, garbage collection Bigtable time stamps are 64-bit Integers If application want to avoid collisions of time stamps they can assign time stamps by them self Bigtable uses garbage collection to clean old data, for example we can specify to keep only the last n versions of a cell or to only keep values that were written in the last seven days.

API Write or Delete values in Bigtable
Look up values from individual rows Iterate over a subset of the data in a table What you can do with the API… or how you can access the data model…

API – Update a Row What you can do with the API… or how you can access the data model…

API – Update a Row Opens a Table

We’re going to mutate the row
API – Update a Row We’re going to mutate the row

Store a new item under the column key “anchor:www.c-span.org”
API – Update a Row Store a new item under the column key “anchor:

Delete an item under the column key “anchor:www.abc.com”
API – Update a Row Delete an item under the column key “anchor:

API – Update a Row Atomic Mutation

API – Iterate over a Table
Create a Scanner instance

Access “anchor” column family

Specify “return all versions”

Specify a row key

Iterate over rows

Refinements Performance Evaluation

Building Blocks GFS Chubby store log & data files
scalability, reliability, performance, fault tolerance Chubby a highly-available and persistent distributed lock service Bigtable processes often share the same machines with processes from other applications. Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures and monitoring machine status.

Building Blocks

SSTable SSTable file format
persistent, ordered, immutable key-value (string-string) pairs used internally to store Bigtable data Each SSTable contains a sequence of blocks, typically each blovk is 64KB in size but is configurable A block index stored at the end of the SSTable is used to locate blocks The block index is loaded into memory when the SSTable is opened A lookup can be performed with a single disk seek, we first find the appropriate block by performing a binary search in the in-memory index and then reading the appropriate block from disk Optionally an SSTable can be completely mapped into memory which allows us to perform lookups and scans without touching disk

Tablet Contains some range of rows of the table
Built out of multiple SSTables Tablet Start:aardvark End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index Each tablet is approximately MB in size by default

Table Multiple tablets make up the table SSTables can be shared
Tablets do not overlap, SSTables can overlap Tablet Tablet aardvark apple apple boat SSTable SSTable SSTable SSTable

Refinements Performance

Bigtable Components A library that is linked into every client
Many tablet servers handle R/W to tablets with clients One tablet master assign tablets to tablet servers detect addition & expiration of tablet servers balance tablet-server load Bigtable relies on a highly-available and persistent distributed lock service called Chubby A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests Chubby uses the Paxos algorithm Each Chubby client maintains a session with a Chubby service. A client’s session expires if it is unable to renew its session lease within the lease expiration time When a client’s session expires, it loses any locks Chubby ensures that there is at most one active master at any time Chubby stores bootstrap location of Bigtable data Chubby discovers tablet servers and finalize tablet server death Chubby stores access control lists If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable

Architecture The master is responsible for:
assigning tablets to tablet servers detecting the addition and expiration of tablet servers Balancing tablet-server load Garbage collection of files in GFS Handles schema changes such as table and column family creation Each tablet server manages a set of tablets, ten to a thousand tablets per tablet server Tablet server handles read and write to the tablets that it has loaded and also splits tablets that have grown too large

Tablet Location Three-level hierarchy
root tablet (Only one, stores addresses of METADATA tablets) METADATA tablets (stores addresses of user tablets) B+ tree Each METADATA row stores approximately 1KB of data in memory. For 128MB tablets the scheme can store 2^61 bytes

Tablet Location Client caches (multiple) tablet locations
if the cache is stale, query again If the client does not know the location of a tablet or if it discovers that cached location information is incorrect the it recursively moves up the tablet location hierarchy If the client cache is empty then we require three network round trips including one read from Chubby Although tablet locations are stored in memory, so no GFS accesses are required, we further reduce this cost in the common case by having the client library prefetch tablet locations, it reads more than one tablet whenever it reads the METADATA table Client data does not move through the master Clients communicate directly with tablet servers for read and writes Most clients never communicate with the master

Tablet Assignment Each tablet is assigned to at most one tablet server at a time When a tablet is unassigned, and a tablet server is available, the master assigns the tablet by sending a tablet load request The tablet master uses Chubby to keeps track of live tablet servers each live tablet server acquires an exclusive lock on a corresponding file Bigtable uses Chubby to keep track of tablet servers When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. The master monitors this directory (the servers directory) to discover tablet servers A tablet server stops serving its tablets if it loses its exclusive lock: e.g., due to a network partition that caused the server to lose its Chubby session. (Chubby provides an efficient mechanism that allows a tablet server to check whether it still holds its lock without incurring network traffic.) A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists. If the file no longer exists, then the tablet server will never be able to serve again, so it kills itself. Whenever a tablet server terminates (e.g., because the cluster management system is removing the tablet server’s machine from the cluster), it attempts to release its lock so that the master will reassign its tablets more quickly.

Tablet Assignment Case 1: some tablets are unassigned
master assigns them to tablet servers with sufficient room Case 2: a tablet server stops its service master detects it and assigns outstanding tablets to other servers. Case 3: too many small tablets master initiates merge Case 4: a tablet grows too large the corresponding tablet server initiates split and notifies master

Tablet Serving A tablet is stored as a sequence of SSTables in GFS
Tablet mutations are logged in commit log the “commit log” stores redo records recent tablet versions are stored in memory (memtable) older tablet versions are stored in GFS

Tablet Serving - Recovery
Tablet server fetches its metadata from METADATA tablet, which contains a list of SSTables that comprises a tablet and redo points. The server reads the indices of the SSTables into memory. The server applies all the mutations after the redo point.

Tablet Serving - Write operation
1. The tablet server checks the validity of the operation. 2. The operation is logged in the commit log. 3. Commit the operation. 4. The content of tablet is inserted into memtable.

Tablet Serving - Read operation
1. The tablet server checks the validity of the operation. 2. Execute the operation on a merged view of memtable & SSTables.

Compactions Memtable grows as write operations execute
Two types of compactions minor compaction merging (major) compaction

Compactions - Minor compaction (when memtable size reaches a threshold)
1. Freeze the memtable 2. Create a new memtable 3. Convert the memtable to an SSTable and write to GFS

Compactions - Merging compaction (periodically)
1. Freeze the memtable 2. Create a new memtable 3. Merge a few SSTables & memtable into a new SSTable

Compactions - Major compaction
special case of merging compaction merges all SSTables & memtable

Compactions Why freeze & create memtable? Advantages of compaction:
Incoming read and write operations can continue during compactions. Advantages of compaction: release the memory of the tablet server reduce the amount of data that has to be read from the commit log during recovery if this tablet server dies

Refinements - Locality groups
group multiple column families together different locality groups are not typically accessed together for each tablet, store each locality group in a separate SSTable more efficient R/W

Refinements - Compression
similar data in same column, neighbouring rows, multiple versions customized compression on SSTable block level (smallest component) two-pass compression scheme 1. Bentley and McIlroy’s scheme, compress long strings across a large window 2. fast compression algorithm, look for repetitions in small window experimental compression ratio: 10% (Gzip: 25-33%)

Refinements - Caching

Refinements - Bloom Filters

Refinements – Commit-log

Refinements – Tablet Recovery

Refinements – Exploiting immutability

Performance (2006)

Performance Random reads slow because tablet server channel to GFS saturated Random reads (mem) is fast because only memtable involved Random & sequential writes > sequential reads because only log and memtable involved Sequential read > random read because of block caching Scans even faster because tablet server can return more data per RPC

Performance Scalability of operations markedly different
Random reads (mem) had increase of ~300x for an increase of 500x in tablet servers Random reads has poor scalability

Bigtable: A Distributed Storage System for Structured Data

Similar presentations

Presentation on theme: "Bigtable: A Distributed Storage System for Structured Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bigtable: A Distributed Storage System for Structured Data

Similar presentations

Presentation on theme: "Bigtable: A Distributed Storage System for Structured Data"— Presentation transcript:

Similar presentations

About project

Feedback