Bigtable: A Distributed Storage System for Structured Data Written By: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Google, Inc. Presented By: Manoher Shatha & Naveen Kumar Ratkal
Overview Introduction Data Model API Building Blocks Implementation Refinements Real Applications Conclusion Discussion
Introduction Bigtable: It is a distributed storage system for managing structured data that is designed to scale to a very large size: i.e. petabytes of data across thousands of commodity servers. Why not Commercial DB? No commercial DB is big enough to store petabytes of data, Even though such DB exists it will be very costly. Low level storage optimizations are difficult to perform on Commercial DB. With good design and implementation Bigtable achieved wide applicability, scalability and high availability. Bigtable is used by more than sixty Google products and projects.
Database Vs Bigtable FeaturesDatabaseBigtable Supports Relational DB ?Most of the databases.No Atomic TransactionsAll are atomic transactions. Limited. Data TypeSupports many data types. String of characters (un- interpreted string). ACID TestYesNO OperationsYes (insert, delete, update etc….) Yes (read, write, update, delete etc…)
Data Model Figure 1: Web Table “Contents:” “anchor:cnnsi.com”anchor:cnnsi.com “anchor:my.look.ca”anchor:my.look.ca “com.cnn.www”com.cnn.www “ CNN ” “CNN.com ” t3 t5 t6 t9t8 Bigtable is a multidimensional stored map. Map is indexed by row key, column key and timestamp. i.e. (row: string, column: string, time:int64 ) String. Rows are ordered in lexicographic order by row key. Row range for a table is dynamically partitioned, Each row range is called “Tablet ”. Columns: syntax is family : qualifier. Cells can store multiple version of data with timestamps.
API Writing to Bigtable // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor: "CNN"); r1.Delete("anchor: Operation op; Apply(&op, &r1); Taken From paper
API Contd… Reading from Bigtable Scanner scanner(T); ScanStream *stream; stream = scanner.FetchColumnFamily("anchor"); stream->SetReturnAllVersions(); scanner.Lookup("com.cnn.www"); for (; !stream->Done(); stream->Next()) { printf("%s %s %lld %s\n",scanner.RowName(),stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); } Taken From paper
Building Blocks GFS Uses Google File system to store data. Cluster Management Google cluster management system manages Bigtable’s cluster Chubby Its is a distributed lock server. Allows multi-thousand node Bigtable cluster to stay coordinated. Five replicas, one is elected as master.
SSTables This is the underlying file format used to store Bigtable data. SSTables are immutable. If new data is added, a new SSTable is created. Old SSTable is set out for garbage collection. Figure: SSTable Figure : From Erik Paulson presentation 64K Block 64K Block Index SSTable
Tablet & Table Tablets contains some range of rows Figure : Table Figure : From Erik Paulson presentation 64K Block 64K Block Index SSTable 64K Block 64K Block Index SSTable Tablet Start : aardvarkEnd : apple 64K Block 64K Block Index SSTable 64K Block 64K Block Index SSTable 64K Block 64K Block Index SSTable Tablet aardvarkapple Tablet apple_twoboat Figure : Tablet Bigtable contains tables, Tables contains set of tablets and each tablet contains set of rows ( MB).
Implementation Bigtable components Library linked into every client. Master Server. Tablets Server. Tablet server can be added dynamically based on the workload. Master assigns tablets to tablet server. Master is lightly loaded.
Tablet Location Chubby File Root tablet (1at METADATA tablet) Other METADATA tablets User Table 1 User Table N Figure : Tablet Location Hierarchy
Tablet Assignment Each Tablets Server is given a tablet for serving client requests. Master keeps track of the tablet server (RPC) to assign the tablet. Chubby directory is used to acquire lock by the tablet server. If Tablet server terminates, it release the lock on the file. Status is sent to Master by tablet server. How Does Master comes to know about Tablets, Tablet servers? Master acquires unique master lock in chubby. Master scans server directory in chubby to find live servers. Master communicates with each tablet server to get the details. It scans the METADATA table to find the unassigned tablets. Steps are taken from the paper “google bigtable”.
Chubby Directory Acquires unique lock Once scanned it will come to know about the live tablet server Communicates with all the tablet server to get the details about the tablet they are serving Metadata table Scans the metadata table Master Tablet Server Tablet Assignment Contd …
Tablet Serving Commit log stores the updates that are made to the data. Updates are stored in memtable. Recovery process. Reads/Writes that arrive at tablet server. Is the request Well-formed. Authorization. Chubby holds the permission file. If a mutation occurs it is wrote to commit log and finally a group commit is used. Figure : Tablet Representation. Figure is taken from the paper “google bigtable”. Memory GFS Tablet Log Write Op Read Op Memtable SST SSTable Files
Compaction Write Operation Figure : Minor Compaction. Frozen SSTable Memtable Converted to SSTable Figure : Merging Compaction. SST SSTable Small New SSTables Due to minor compaction. New Large SSTables SST Merging compaction leads to major compaction.
Refinements “ CNN ” “CNN.com ” “Contents:” “anchor:cnnsi.com” “anchor:my.look.ca” “com.cnn.www” “ CNN ” “CNN.com ” “Contents:” “com.cnn.www” “anchor:cnnsi.com” “anchor:my.look.ca” Locality Group Figure : From Jeff Dean presentation Multiple column families are grouped into locality group. Efficient reads are done by separating column families. Additional parameters can be specified.
Caching for Read Performance Two types of caches Scan Cache, caches key value pair. Block Cache, caches complete SSTable block. Commit-log Implementation How about Commit-log for each tablet? What would be the bottleneck? Better to use single Commit-log file. Commit log is sorted based on the value to avoid duplicate reads. Refinements Contd…
Speeding up tablet recovery If master decides to move the tablet from once tablet server to other tablet server based on the load, then the tablet under goes minor compaction two time. Exploiting immutability As SSTables are immutable no need of synchronization for accessing data while reading. The deleted data is collected by garbage collection. Refinements Contd…
Figure: Characteristics of few tables in production use Real Applications Figure is taken from the paper “google bigtable”.
Google Analytics Google analytics ( is a service that helps webmasters to analyze traffic patterns at their websites. To enable the service, webmaster embed a small JavaScript program in their web pages. We describe two of the tables used by google analytics. Raw click table(~200 TB) maintains a row for each end-user session. Summary table(~20 TB) contains various predefined summaries for each website. Real Applications Contd… URL Address : IP address Location: User Location Norfolk, VA Figure: Raw Click Table URL Language :Page Rank: English6 Figure: Summary Table
Real Applications Contd… Personalized Search Personalized Search ( is a service that records user queries and clicks across a variety of Google properties. Personalized search stores each user’s data in Bigtable. Each user has a unique userID and is assigned a row named by userID. All user actions are stored in a table. The personalized search data is replicated across several Bigtable clusters to increase availability and to reduce latency due to distance from clients. Date : Text: words used 04/06/2007 Bloom Filter Figure: UserTable User ID
Conclusion What all we discussed? Google’s Bigtable, Its architecture and some real application that are using bigtable. Bigtable is a feasible solution for storing large amount of structured data. It reduces the amount of space required to store the data. Difficult for new users to use bigtable.
Discussions How server expansion is done? And will tablets be redistributed immediately? What happens when the tablet server crashed?
References Websites Papers
Questions????