Bigtable: A Distributed Storage System for Structured Data

Slides:



Advertisements
Similar presentations
Introduction to cloud computing
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon 10/22/2012 Fall.
Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Big Table Alon pluda.
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
 Pouria Pirzadeh  3 rd year student in CS  PhD  Vandana Ayyalasomayajula  1 st year student in CS  Masters.
Authors Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson Hsieh Deborah Wallach Mike Burrows Tushar Chandra Andrew Fikes Robert Gruber Bigtable: A Distributed.
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable A System for Distributed Structured Storage Yanen Li Department of Computer Science University of Illinois at Urbana-Champaign
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
BigTable and Google File System
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Cloud Data Models Lecturer.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Bigtable A Distributed Storage System for Structured Data.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Chapter 3 System Models.
Bigtable: A Distributed Storage System for Structured Data Written By: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
CSCI5570 Large Scale Data Processing Systems
Bigtable A Distributed Storage System for Structured Data
Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD
Column-Based.
HBase Mohamed Eltabakh
Bigtable: A Distributed Storage System for Structured Data
GFS and BigTable (Lecture 20, cs262a)
Data Management in the Cloud
CSE-291 (Cloud Computing) Fall 2016
Introduction to Apache
Cloud Computing Storage Systems
A Distributed Storage System for Structured Data
Outline Introduction LSM-tree and LevelDB Architecture WiscKey.
Presentation transcript:

Bigtable: A Distributed Storage System for Structured Data Ido Hakimi

Outline Introduction Data Model API Building Blocks Implementation Refinements Performance Evaluation Future Work

Why not Relational Database? Scale is too large for most commercial databases Even if it weren’t, cost would be very high Low-level storage optimizations help performance significantly Explain History of storage!

What is Bigtable? It is a distributed storage system for managing structured data that is designed to scale to peta bytes of storage across thousands of commodity servers. Wide Applicability Scalability Bigtable Explain each square its meaning High Performance High Availability

Outline Introduction Data Model API Building Blocks Implementation Refinements Performance Evaluation Future Work

Data Model – Overview Sparse, distributed, persistent multidimensional sorted map. Example of Webtable Each row

Data Model – Example “Webtable” stores copy of web pages & their related information. row key: URL (reverse hostname) column key: attribute name timestamp: time that the page is fetched

Data Model – Rows Row key: string (usually 10-100KB, max 64KB) Every R/W of data under a single row key is atomic

Data Model – Rows Sorted by row key in lexicographic order Tablet: a certain range of rows the unit of distribution & load balancing good locality for data access

Data Model – Columns Column families: group of column keys (same type) the unit of access control Column key: family:qualifier At most a couple of hundreds of column families For example a language family for the Webtable which has only 1 column key Another example is anchor column family and every key is a reference to that URL Access control is on Memory and performed at the column-family level

Data Model – Timestamps Timestamp: index multiple versions of the same data not necessarily the “real time” data clean up, garbage collection Bigtable time stamps are 64-bit Integers If application want to avoid collisions of time stamps they can assign time stamps by them self Bigtable uses garbage collection to clean old data, for example we can specify to keep only the last n versions of a cell or to only keep values that were written in the last seven days.

Outline Introduction Data Model API Building Blocks Implementation Refinements Performance Evaluation Future Work

API Write or Delete values in Bigtable Look up values from individual rows Iterate over a subset of the data in a table What you can do with the API… or how you can access the data model…

API – Update a Row What you can do with the API… or how you can access the data model…

API – Update a Row Opens a Table

We’re going to mutate the row API – Update a Row We’re going to mutate the row

Store a new item under the column key “anchor:www.c-span.org” API – Update a Row Store a new item under the column key “anchor:www.c-span.org”

Delete an item under the column key “anchor:www.abc.com” API – Update a Row Delete an item under the column key “anchor:www.abc.com”

API – Update a Row Atomic Mutation

API – Iterate over a Table Create a Scanner instance

API – Iterate over a Table Access “anchor” column family

API – Iterate over a Table Specify “return all versions”

API – Iterate over a Table Specify a row key

API – Iterate over a Table Iterate over rows

Outline Introduction Data Model API Building Blocks Implementation Refinements Performance Evaluation

Building Blocks GFS Chubby store log & data files scalability, reliability, performance, fault tolerance Chubby a highly-available and persistent distributed lock service Bigtable processes often share the same machines with processes from other applications. Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures and monitoring machine status.

Building Blocks

SSTable SSTable file format persistent, ordered, immutable key-value (string-string) pairs used internally to store Bigtable data Each SSTable contains a sequence of blocks, typically each blovk is 64KB in size but is configurable A block index stored at the end of the SSTable is used to locate blocks The block index is loaded into memory when the SSTable is opened A lookup can be performed with a single disk seek, we first find the appropriate block by performing a binary search in the in-memory index and then reading the appropriate block from disk Optionally an SSTable can be completely mapped into memory which allows us to perform lookups and scans without touching disk

Tablet Contains some range of rows of the table Built out of multiple SSTables Tablet Start:aardvark End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index Each tablet is approximately 100-200 MB in size by default

Table Multiple tablets make up the table SSTables can be shared Tablets do not overlap, SSTables can overlap Tablet Tablet aardvark apple apple boat SSTable SSTable SSTable SSTable

Outline Introduction Data Model API Building Blocks Implementation Refinements Performance

Bigtable Components A library that is linked into every client Many tablet servers handle R/W to tablets with clients One tablet master assign tablets to tablet servers detect addition & expiration of tablet servers balance tablet-server load Bigtable relies on a highly-available and persistent distributed lock service called Chubby A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests Chubby uses the Paxos algorithm Each Chubby client maintains a session with a Chubby service. A client’s session expires if it is unable to renew its session lease within the lease expiration time When a client’s session expires, it loses any locks Chubby ensures that there is at most one active master at any time Chubby stores bootstrap location of Bigtable data Chubby discovers tablet servers and finalize tablet server death Chubby stores access control lists If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable

Architecture The master is responsible for: assigning tablets to tablet servers detecting the addition and expiration of tablet servers Balancing tablet-server load Garbage collection of files in GFS Handles schema changes such as table and column family creation Each tablet server manages a set of tablets, ten to a thousand tablets per tablet server Tablet server handles read and write to the tablets that it has loaded and also splits tablets that have grown too large

Tablet Location Three-level hierarchy root tablet (Only one, stores addresses of METADATA tablets) METADATA tablets (stores addresses of user tablets) B+ tree Each METADATA row stores approximately 1KB of data in memory. For 128MB tablets the scheme can store 2^61 bytes

Tablet Location Client caches (multiple) tablet locations if the cache is stale, query again If the client does not know the location of a tablet or if it discovers that cached location information is incorrect the it recursively moves up the tablet location hierarchy If the client cache is empty then we require three network round trips including one read from Chubby Although tablet locations are stored in memory, so no GFS accesses are required, we further reduce this cost in the common case by having the client library prefetch tablet locations, it reads more than one tablet whenever it reads the METADATA table Client data does not move through the master Clients communicate directly with tablet servers for read and writes Most clients never communicate with the master

Tablet Assignment Each tablet is assigned to at most one tablet server at a time When a tablet is unassigned, and a tablet server is available, the master assigns the tablet by sending a tablet load request The tablet master uses Chubby to keeps track of live tablet servers each live tablet server acquires an exclusive lock on a corresponding file Bigtable uses Chubby to keep track of tablet servers When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. The master monitors this directory (the servers directory) to discover tablet servers A tablet server stops serving its tablets if it loses its exclusive lock: e.g., due to a network partition that caused the server to lose its Chubby session. (Chubby provides an efficient mechanism that allows a tablet server to check whether it still holds its lock without incurring network traffic.) A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists. If the file no longer exists, then the tablet server will never be able to serve again, so it kills itself. Whenever a tablet server terminates (e.g., because the cluster management system is removing the tablet server’s machine from the cluster), it attempts to release its lock so that the master will reassign its tablets more quickly.

Tablet Assignment Case 1: some tablets are unassigned master assigns them to tablet servers with sufficient room Case 2: a tablet server stops its service master detects it and assigns outstanding tablets to other servers. Case 3: too many small tablets master initiates merge Case 4: a tablet grows too large the corresponding tablet server initiates split and notifies master

Tablet Serving A tablet is stored as a sequence of SSTables in GFS Tablet mutations are logged in commit log the “commit log” stores redo records recent tablet versions are stored in memory (memtable) older tablet versions are stored in GFS

Tablet Serving - Recovery Tablet server fetches its metadata from METADATA tablet, which contains a list of SSTables that comprises a tablet and redo points. The server reads the indices of the SSTables into memory. The server applies all the mutations after the redo point.

Tablet Serving - Write operation 1. The tablet server checks the validity of the operation. 2. The operation is logged in the commit log. 3. Commit the operation. 4. The content of tablet is inserted into memtable.

Tablet Serving - Read operation 1. The tablet server checks the validity of the operation. 2. Execute the operation on a merged view of memtable & SSTables.

Compactions Memtable grows as write operations execute Two types of compactions minor compaction merging (major) compaction

Compactions - Minor compaction (when memtable size reaches a threshold) 1. Freeze the memtable 2. Create a new memtable 3. Convert the memtable to an SSTable and write to GFS

Compactions - Merging compaction (periodically) 1. Freeze the memtable 2. Create a new memtable 3. Merge a few SSTables & memtable into a new SSTable

Compactions - Major compaction special case of merging compaction merges all SSTables & memtable

Compactions Why freeze & create memtable? Advantages of compaction: Incoming read and write operations can continue during compactions. Advantages of compaction: release the memory of the tablet server reduce the amount of data that has to be read from the commit log during recovery if this tablet server dies

Outline Introduction Data Model API Building Blocks Implementation Refinements Performance

Refinements - Locality groups group multiple column families together different locality groups are not typically accessed together for each tablet, store each locality group in a separate SSTable more efficient R/W

Refinements - Compression similar data in same column, neighbouring rows, multiple versions customized compression on SSTable block level (smallest component) two-pass compression scheme 1. Bentley and McIlroy’s scheme, compress long strings across a large window 2. fast compression algorithm, look for repetitions in small window experimental compression ratio: 10% (Gzip: 25-33%)

Refinements - Caching

Refinements - Bloom Filters

Refinements – Commit-log

Refinements – Tablet Recovery

Refinements – Exploiting immutability

Outline Introduction Data Model API Building Blocks Implementation Refinements Performance

Performance (2006)

Performance Random reads slow because tablet server channel to GFS saturated Random reads (mem) is fast because only memtable involved Random & sequential writes > sequential reads because only log and memtable involved Sequential read > random read because of block caching Scans even faster because tablet server can return more data per RPC

Performance Scalability of operations markedly different Random reads (mem) had increase of ~300x for an increase of 500x in tablet servers Random reads has poor scalability