Download presentation
Presentation is loading. Please wait.
Published byJaren Brisendine Modified over 10 years ago
1
Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon kyunghoj@buffalo.edu 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 1
2
Motivation and Design Goal Distributed Storage System for Structured Data – Scalability Petabytes of data on Thousands of (commodity) machines – Wide Applicability Throughput-oriented and Latency-sensitive – High Performance – High Availability 10/22/2012Fall 2012: CSE 704 Web-scale Data Management2
3
Data Model 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 3
4
Data Model Not a Full Relational Data Model Provides a simple data model – Supports Dynamic Control over Data Layout – Allows clients to reason about the locality properties 10/22/2012Fall 2012: CSE 704 Web-scale Data Management4
5
Data Model – A Big Table A Table in Bigtable is a: – Sparse – Distributed – Persistent – Multidimensional – Sorted map 10/22/2012Fall 2012: CSE 704 Web-scale Data Management5
6
Data Model 10/22/2012Fall 2012: CSE 704 Web-scale Data Management6
7
Data Model Rows – Data maintained in lexicographic order by row key – Tablet: rows with consecutive keys Units of distribution and load balancing Columns – Column families Family:qualifier Cells Timestamps 10/22/2012Fall 2012: CSE 704 Web-scale Data Management7
8
Data Model – WebTable Example 10/22/2012Fall 2012: CSE 704 Web-scale Data Management8 A large collection of web pages and related information
9
Data Model – WebTable Example Row Key Tablet - Group of rows with consecutive keys. Unit of Distribution Bigtable maintains data in lexicographic order by row key 10/22/2012Fall 2012: CSE 704 Web-scale Data Management9
10
Data Model – WebTable Example Column Family Column family is the unit of access control 10/22/2012Fall 2012: CSE 704 Web-scale Data Management10
11
Data Model – WebTable Example Column Column key is specified by “Column family:qualifier” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management11
12
Data Model – WebTable Example Column You can add a column in a column family if the column family was created 10/22/2012Fall 2012: CSE 704 Web-scale Data Management12
13
Data Model – WebTable Example Cell Cell: the storage referenced by a particular row key, column key, and timestamp 10/22/2012Fall 2012: CSE 704 Web-scale Data Management13
14
Data Model – WebTable Example Different cells in a table can contain multiple versions indexed by timestamp 10/22/2012Fall 2012: CSE 704 Web-scale Data Management14
15
API 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 15
16
API Write or Delete values in Bigtable Look up values from individual rows Iterate over a subset of the data in a table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management16
17
API – Update a Row 10/22/2012Fall 2012: CSE 704 Web-scale Data Management17
18
API – Update a Row Opens a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management18
19
API – Update a Row We’re going to mutate the row 10/22/2012Fall 2012: CSE 704 Web-scale Data Management19
20
API – Update a Row Store a new item under the column key “anchor:www.c- span.org” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management20
21
API – Update a Row Delete an item under the column key “anchor:www.abc.com” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management21
22
API – Update a Row Atomic Mutation 10/22/2012Fall 2012: CSE 704 Web-scale Data Management22
23
API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management23 Create a Scanner instance
24
API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management24 Access “anchor” column family
25
API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management25 Specify “return all versions”
26
API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management26 Specify a row key
27
API – Iterate over a Table 10/22/2012Fall 2012: CSE 704 Web-scale Data Management27 Iterate over rows
28
API – Other Features Single row transaction Client-supplied scripts in the address space of the server Input source/Output target for MapReduce jobs 10/22/2012Fall 2012: CSE 704 Web-scale Data Management28
29
A Typical Google Machine 10/22/2012Fall 2012: CSE 704 Web-scale Data Management29
30
A Google Cluster 10/22/2012Fall 2012: CSE 704 Web-scale Data Management30
31
A Google Cluster 10/22/2012Fall 2012: CSE 704 Web-scale Data Management31
32
Building Blocks Chubby – Highly-available and persistent distributed lock service GFS – Store logs and data files – SSTable Google’s immutable file format A persistent, ordered immutable map from keys to values http://code.google.com/p/leveldb/ 10/22/2012Fall 2012: CSE 704 Web-scale Data Management32
33
SSTable For more info: http://www.igvita.com/2012/02/06/sstable-and- log-structured-storage-leveldb/ http://www.igvita.com/2012/02/06/sstable-and- log-structured-storage-leveldb/ 10/22/2012Fall 2012: CSE 704 Web-scale Data Management33
34
Chubby Highly-available and persistent distributed lock service – 5 replicas, one is elected as a master – Paxos – Provides a namespace that consists of directories and small files 10/22/2012Fall 2012: CSE 704 Web-scale Data Management34
35
Implementation Client Library Master – one and only one! Tablet Servers – Many 10/22/2012Fall 2012: CSE 704 Web-scale Data Management35
36
Implementation - Master Responsible for assigning tablets to table servers – Addition/removal of tablet server – Tablet-server load balancing – Garbage collecting files in GFS Handles schema changes Single master system (as GFS did) 10/22/2012Fall 2012: CSE 704 Web-scale Data Management36
37
Tablet Server Manages a set of tablets Handles read and write requests to the tablets Splits tablets that have grown too large 10/22/2012Fall 2012: CSE 704 Web-scale Data Management37
38
How Does a Client Find a Tablet? 10/22/2012Fall 2012: CSE 704 Web-scale Data Management38
39
Tablet Assignment Each tablet is assigned to at most one tablet server at a time When a tablet is unassigned, and a tablet server is available, the master assigns the tablet by sending a tablet load request Bigtable uses Chubby to keep track of tablet servers 10/22/2012Fall 2012: CSE 704 Web-scale Data Management39
40
Tablet Assignment Detecting a tablet server which is no longer serving its tablets – The master periodically asks each tablet server for the status of its lock – If a tablet server reports it has lost its lock, or if the master cannot reach a tablet server, – The master attempts to acquire an exclusive lock on the server’s file – If the lock acquire is successful -> Chubby is alive, so the tablet server must have a problem – The master deletes the server’s file in Chubby to ensure the tablet server can never serve again – Then, the master move all the tablets that were previously assigned to that server into the set of unassigned tablets 10/22/2012Fall 2012: CSE 704 Web-scale Data Management40
41
Tablet Assignment When a master is started, the master… – Grabs a unique master lock in Chubby – Scans the servers directory in Chubby to find the live servers – Communicates with every live tablet server to discover the current tablet assignment – Scans the METADATA table and adds unassigned tablets to the set of unassigned tablets 10/22/2012Fall 2012: CSE 704 Web-scale Data Management41
42
Tablet Serving 10/22/2012Fall 2012: CSE 704 Web-scale Data Management42
43
Tablet Serving Memtable – A sorted buffer – Maintains the updates on a row-by-row basis – Each row is copy-on-write to maintain row-level consistency – Older updates are stored in a sequence of SSTable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management43
44
Tablet Serving 10/22/2012Fall 2012: CSE 704 Web-scale Data Management44
45
Tablet Serving - Write Write operation – The server checks if the operation is valid – A valid mutation is written to the commit log – After the write has been committed, its contents are inserted into the memtable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management45
46
Tablet Serving 10/22/2012Fall 2012: CSE 704 Web-scale Data Management46
47
Tablet Serving - Read Read operation – Check if the operation is valid – A valid operation is executed on a merged view of the sequence of SSTables and the memtable – The merged view can be formed efficiently since SSTables and the memtable are lexicographically sorted data structure 10/22/2012Fall 2012: CSE 704 Web-scale Data Management47
48
Tablet Serving - Recover 10/22/2012Fall 2012: CSE 704 Web-scale Data Management48
49
Tablet Serving - Recover Recover a table – A tablet server reads its metadata from METADATA table – The metadata contains the list of SSTables that comprise a tablet and a set of redo points – The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points 10/22/2012Fall 2012: CSE 704 Web-scale Data Management49
50
Compaction Minor compaction – When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable Major compaction – Rewrite multiple SSTables into one SSTable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management50
51
Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management51
52
Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable Threshold reached 10/22/2012Fall 2012: CSE 704 Web-scale Data Management52
53
Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable Threshold reached 10/22/2012Fall 2012: CSE 704 Web-scale Data Management53
54
Compaction memtable SSTable Memory GFS Write Op Commit Log SSTable A new memtable 10/22/2012Fall 2012: CSE 704 Web-scale Data Management54
55
Compaction memtable SSTable Memory GFS Write Op Commit Log Major compaction 10/22/2012Fall 2012: CSE 704 Web-scale Data Management55
56
Schema Management Bigtable schemas are stored in Chubby The master update the schema by rewriting the corresponding schema file in Chubby 10/22/2012Fall 2012: CSE 704 Web-scale Data Management56
57
Optimization Locality Group – Client defined – An abstraction that enables clients to control their data’s storage layout – A separate SSTable is generated for each locality group in each tablet during compaction – A locality group can be declared to be in-memory 10/22/2012Fall 2012: CSE 704 Web-scale Data Management57
58
Optimization Compression – Client can control whether the SSTables for a locality group are compressed 10/22/2012Fall 2012: CSE 704 Web-scale Data Management58
59
Optimization Two-level Caching for Read Performance – Scan cache: higher level. Caches the key-value pairs returned by the SSTable interface to the tablet server code – Block cache: lower level Caches SSTable blocks 10/22/2012Fall 2012: CSE 704 Web-scale Data Management59
60
Optimization Bloom Filters 10/22/2012Fall 2012: CSE 704 Web-scale Data Management60
61
Optimization Commit-Log Implementation – Using one log per tablet server – Recovery? A tablet server hosted 100 tablets failed 100 other machines were each assigned a single tablet 100 reads? Sort the commit log by – Writing commit logs Two log-writer threads 10/22/2012Fall 2012: CSE 704 Web-scale Data Management61
62
Performance Evaluation Sequential writes/reads – Row keys with names 0 to R-1, partitioned into 10N equal-sized ranges – Wrote a single string under each row key – 1GB / tablet server Scan – Uses Bigtable Scan API Random writes/reads – Similar to Sequential write/read, but the row key was hashed Random reads (Mem) – 100MB / tablet server, the locality group is marked as in-memory 10/22/2012Fall 2012: CSE 704 Web-scale Data Management62
63
Single Tablet Server Performance 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 63
64
Aggregate Throughput 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 64
65
Real Applications 10/22/2012Fall 2012: CSE 704 Web-scale Data Management65
66
Lessons Learned Failures! Delay new features until it is clear how the new features will be used Monitoring Simple Design! 10/22/2012Fall 2012: CSE 704 Web-scale Data Management66
67
Acknowledgement Jeff Dean, “Handling Large Datasets at Google: Current Systems and Future Directions” 10/22/2012Fall 2012: CSE 704 Web-scale Data Management67
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.