CSCI5570 Large Scale Data Processing Systems NoSQL James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Jinyang Li, Shimin Chen, Mohsen Taheriyan
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber OSDI 2006
Bigtable Description Outline Motivation and goals Interfaces and semantics Architecture Implementation details
Motivation Lots of (semi-)structured data in Google Scale is huge Web data: Contents, crawl metadata, links, anchors, pagerank, … Per-user data: User preference settings, recent queries/search results, … Map data: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … Scale is huge Billions of Web pages, many versions/page (~20K/version) Hundreds of millions of users, thousands of q/sec 100TB+ of satellite image data (Above numbers are as of 2006-7!)
Goals Petabytes of data across thousands of commodity servers Want asynchronous processes to continuously update different pieces of data Want access to most recent data at any time Need to support: Very high read/write rates (millions of ops per second) Efficient retrieval of small subsets of the data Efficient scans over entirety or subsets of the data Often want to examine data changes over time E.g. Contents of a web page over multiple crawls
Why Not Commercial DB? Scale is too large for most commercial databases Cost of licensing per machine would be very high Building internally means that system can be applied across many projects for low incremental cost Low-level storage optimizations help performance Much harder when running on top of a database layer “Also fun and challenging to build large-scale systems”
Bigtable Scalable Self-managing A distributed storage system for (semi)structured data Scalable Thousands of servers Terabytes of in-memory data Petabytes of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance Allow clients to reason about the locality properties of data in the underlying storage
Bigtable Applications A flexible high-performance solution for many Google products (as of 2008) Web indexing, personalized search, Google Earth, Google Analytics, Google Finance, … Backend bulk processing: throughput-oriented batch-processing jobs Real-time data serving: latency-sensitive serving of data to end users
Bigtable Description Outline Motivation and goals Interfaces and semantics Architecture Implementation details
Basic Data Model “A BigTable is a sparse, distributed, persistent multi-dimensional sorted map” Indexed by: (row:string, column:string, time:int64) → string “contents:” “anchor:cnnsi.com” “anchor:my.look.ca” “<html> …” t 3 “<html> …” “com.cnn.www” t 5 “CNN” t 9 “CNN.com” t 8 “<html> …” t 6 Row key: up to 64KB, 10-100B typically, sorted by reverse URL cell w/ timestamped versions + GC column families
Rows Row names/keys are arbitrary strings and are ordered lexicographically Rows close together lexicographically are usually on one or a small number of machines Hence, programmers can manipulate row names to achieve good locality in their programs Example: com.cnn.www vs. www.cnn.com – which one provides more locality for a “site:” query? Access to data in a row is atomic Data row is the only unit of atomicity in Bigtable Easier for concurrent updates to the same row Does not support relational model No table wide integrity constants No multi-row transactions
Row-based Locality www.cnn.com Matches from same site should be retrieved together by accessing the minimal number of machines Using reversed-DNS URLs clusters URLs from the same site together, to speed up site-local queries com.cnn.edition, com.cnn.money, com.cnn.www edition.cnn.com money.cnn.com
family:optional_qualifier (e.g.: anchor:cnnsi.com) Columns Columns have two-level name structure: family:optional_qualifier (e.g.: anchor:cnnsi.com) Column family Basic unit of access control Has associated type information, all data stored in a column family usually of the same type (allow compression) Small number of distinct column families in a table, but unbounded number of columns Additional levels of indexing, if desired either grant access to the whole column family or none of the families in the column family “contents:” “anchor:cnnsi.com” “anchor:stanford.edu” “…” “CNN homepage” “CNN” com.cnn.www
Timestamps Used to store different versions of data in a cell 64-bit integers New writes default to current time, but timestamps for writes can also be set explicitly by clients Each column family has two garbage-collection settings: “Only keep most recent K values in a cell” “Keep values until they are older than K seconds” Lookup options: “Return most recent K values” “Return all values in timestamp range (or all values)” Example uses: Keep versions of the Web
The Bigtable API Metadata operations Writes (atomic) Reads Create/delete tables or column families Change metadata, access control rights Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic (e.g., atomic read-modify-write sequence) Can restrict returned rows to a particular range Can ask for data just from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns
API Examples: Write/Modify // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor:www.c-span.org", "CNN"); r1.Delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); atomic row modification No support for (RDBMS-style) multi-row transactions
Bigtable Description Outline Motivation and goals Interfaces and semantics Architecture Implementation details
Tablets A Bigtable table is partitioned into many tablets based on row keys Tablets (100-200MB each) are stored in a particular structure in GFS Tablet is the unit of distribution and load balancing in Bigtable Each tablet is served by one tablet server Tablets are stateless (all states are in GFS), hence they can restart at any time “language:” “contents:” Tablet: Start: com.aaa End: com.cnn.www “com.aaa” “com.cnn.edition” “com.cnn.money” . . . EN “<html>…” Tablet: Start: com.cnn.www End: com.dodo.www “com.cnn.www” “com.cnn.www/sports.html” “com.cnn.www/world/” “com.dodo.www” . . . . . . “com.website” “com.yahoo/kids.html” “com.yahoo/kids.html?d” “com.zuppa/menu.html”
Tablet Structure SSTable Uses Google SSTables, a key building block A brief description of SSTable: An immutable, sorted file of key-value pairs SSTable files are stored in GFS Keys are: <row, column, timestamp> SSTables allow only appends, no updates (delete possible) Why do you think they don’t use something that supports updates? SSTable SSTable (engl. Sorted Strings Table) is a file of key/value string pairs, sorted by keys. An SSTable provides a persistent,ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Index (block ranges) 64KB Block 64KB Block 64KB Block …
Tablet Structure A Tablet stores a range of rows from a table using SSTable files, which are stored in GFS Tablet Start: aardvark End: apple SSTable SSTable … … …
Tablet Splitting When tablets grow too big, they are split There’s merging, too “language:” “contents:” “com.aaa” “com.cnn.edition” “com.cnn.money” . . . EN “<html>…” . . . . . . “com.website” “com.xuma” “com.yahoo/kids.html” “com.yahoo/kids.html?d” “com.yahoo/parents.html” “com.yahoo/parents.html?d” “com.zuppa/menu.html”
Servers One master tablet server Many tablet servers Assigns/load-balances tablets to tablet servers Detects up/down tablet servers Garbage collects deleted tablets Coordinates metadata updates (e.g., create table, …) Does NOT provide tablet location (we’ll see how this is gotten) Master is stateless – state is in Chubby, and Bigtable (recursively)! Many tablet servers Manage a set of tablets Handle R/W requests to their tablets Split tablets that have grown too large Tablet servers are also stateless – their states are in GFS!
System Architecture Metadata ops Bigtable client Bigtable client library Bigtable tablet master performs metadata ops, load balancing Read/Write Open() Bigtable tablet server Bigtable tablet server Bigtable tablet server serves data serves data serves data GFS Chubby holds tablet data holds metadata, handles master-election
Locating Tablets Since tablets move around from server to server, given a row, how do clients find the right machine? Tablet properties: startRowIndex and endRowIndex Need to find tablet whose row range covers the target row One approach: could use the Bigtable master Central server almost certainly would be bottleneck in large system Instead: store special tables containing tablet location info in the Bigtable cell itself (recursive design)
Tablets are located using a hierarchical (B+ tree-like) structure UserTable_1 METADATA 1 st METADATA (stored in one single tablet, unsplittable) Chubby lock file … … UserTable_N <table_id, end_row> → location Each METADATA record ~1KB Max METADATA table = 128MB A 3-level location scheme can address 2 34 tablets
Other Components Building blocks BigTable uses of building blocks Google File System (GFS): Raw storage Scheduler: Schedules jobs onto machines Chubby: Lock service BigTable uses of building blocks GFS: stores all persistent states (log and data files) Scheduler: schedules jobs involved in BigTable serving Chubby: master election, keep track of tablet servers
GFS GFS Master Client Client C0 C1 C1 C0 C3 C3 C4 C3 C5 C4 Master manages metadata Data transfers happen directly between clients/chunkservers Files broken into chunks (typically 64 MB) Chunks replicated across three machines for reliability Client C0 C1 C1 C0 C3 Chunk servers C3 C4 C3 C5 C4
Chubby A highly-available and persistent distributed lock service with a file system interface Provides a namespace consisting of directories and small files, each directory or file used as a lock Intuitively, Chubby provides locks with possibility to store a bit of data in them, which can be read but not written unless you have a writer’s lock It also provides notifications for file updates Uses Paxos to provide consistency Paxos: a family of protocols for solving consensus in a network of unreliable processors. Consensus: the process of agreeing on one result among a group of participants.
Chubby Bigtable uses Chubby to: Ensure that there is at most one active master at any time Store the bootstrap location of Bigtable data Discover tablet servers and finalize tablet server deaths Store Bigdata schema information (column family information for each tablet) Store access control lists
Cluster Scheduling Master Typical Cluster Cluster Scheduling Master Lock Service (Chubby) GFS Master by-and-large stateless! Machine 1 Machine 2 Machine 3 BigTable Server BigTable Server BigTable Master User Task (Map/Reduce) User Task (Map/Reduce) Scheduler Slave GFS Chunk Server Scheduler Slave GFS Chunk Server Scheduler Slave GFS Chunk Server Linux Linux Linux Stateful!
Specific Example: Web Search inserts or updates of page contents and anchors to pages targeted queries (should not require scans!) full or partial table scans
Bigtable Description Outline Motivation and goals Interfaces and semantics Architecture Implementation details
Tablet Assignment Master assigns tablets to tablet servers
Tablet Assignment Each tablet is assigned to one tablet server Bigtable uses Chubby to keep track of tablet servers When a tablet server starts, it creates and acquires an exclusive lock on a file (stores the location of the root tablet) in a Chubby directory Master monitors this directory to discover tablet servers A tablet server stops serving its tablets if it loses its exclusive lock (e.g., due to network partition); it may require the lock again if the file still exists, otherwise kills itself Require to acquire an exclusive lock so that each tablet is served by one tablet server only
Tablet Assignment Master periodically asks each tablet server for the status of its lock if cannot reach a tablet server after several trials, tries to acquire an exclusive lock on the tablet server’s file if master can acquire the lock, Chubby is live and the tablet server is dead or cannot reach Chubby master disallows the tablet server to serve by deleting its file and puts all its tablets as unassigned tablets Master assigns unassigned tablets to a tablet server with sufficient room Master failures do not change tablet assignments
Master Startup Operation Grabs unique master lock in Chubby Prevents concurrent master instantiations Scans directory in Chubby for live tablet servers Communicates with every live tablet server Discover all tablets Scans METADATA table to learn the set of tablets Unassigned tablets are marked for assignment
Tablet Serving Updates are stored in tablet log and new ones in memtable (in memory) old ones in SSTables
Tablet Serving Write Read The tablet server checks authorization for write The write is written to the commit log and its contents are inserted into memtable Read The tablet server checks authorization for read The read is executed on a merged view of the sequence of SSTables and memtable The merged view can be formed efficiently since SSTables and memtable are sorted
Tablet Serving Recovering tablet Tablet server reads data from METADATA table Metadata contains list of SSTables and pointers to any commit logs that may contain data for the tablet Server reads the indices of the SSTables into memory Server reconstructs the memtable by applying all of the updates since redo points
Compaction Minor compaction – convert the memtable into an SSTable Reduce memory usage Reduce log traffic on restart Merging (major) compaction Reduce number of SSTables, merging into a few (exactly one) SSTable Save disk due to high compression rate, remove deleted entries
Bigtable Summary Scalable distributed storage system for semistructured data Offers a multi-dimensional-map interface <row, column, timestamp> → value Offers atomic reads/writes within a row Key design philosophy: statelessness, which is key for scalability All Bigtable servers (including master) are stateless All states are stored in reliable GFS and Chubby systems Bigtable leverages strong-semantic operations in these systems (appends in GFS, file locks in Chubby) Availability and scalability: if a tablet server (or master) is dead, easily starts a new tablet server as it’s stateless