Big Table Alon pluda.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon 10/22/2012 Fall.
Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
 Pouria Pirzadeh  3 rd year student in CS  PhD  Vandana Ayyalasomayajula  1 st year student in CS  Masters.
Authors Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson Hsieh Deborah Wallach Mike Burrows Tushar Chandra Andrew Fikes Robert Gruber Bigtable: A Distributed.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable A System for Distributed Structured Storage Yanen Li Department of Computer Science University of Illinois at Urbana-Champaign
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
BigTable and Google File System
CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Cloudera Kudu Introduction
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Bigtable A Distributed Storage System for Structured Data.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Chapter 3 System Models.
Bigtable: A Distributed Storage System for Structured Data Written By: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
CSCI5570 Large Scale Data Processing Systems
Bigtable A Distributed Storage System for Structured Data
Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD
HBase Mohamed Eltabakh
Bigtable: A Distributed Storage System for Structured Data
GFS and BigTable (Lecture 20, cs262a)
Data Management in the Cloud
CSE-291 (Cloud Computing) Fall 2016
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Cloud Computing Storage Systems
A Distributed Storage System for Structured Data
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
Presentation transcript:

Big Table Alon pluda

(big) Table of content Introduction Data model API Building blocks Implementation Refinements Performance Applications conclusions

(big) Table of content Introduction Data model API Building blocks Implementation Refinements Performance Applications conclusions שולחן לגו

introduction Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size petabytes of data across thousands of commodity servers Many projects at Google store data in Bigtable web indexing, Google Earth, Google Finance… These applications place very different demands on Bigtable in terms of data size (from URLs to web pages to satellite imagery) In terms of latency requirements (from backend bulk processing to real-time data serving). It is not a relational database, it is a sparse, distributed, persistent multi-dimensional sorted map (key/value store)

(big) Table of content Introduction Data model API Building blocks Implementation Refinements Performance Applications conclusions

Data model Column Family1 Column Family2 Column Family )anchor) Column family with one column key Column family with multiple column key Column Family1 Column Family2 Column Family )anchor) content: language: my.look.ca: cnnsi.com com.cnn.www   com.bbc.www il.co.ynet.www org.apache.hbase org.apache.hadoop <html> t3 t2 <html> EN anchor1 anchor2 <html> t1 Every cell is an uninterpreted array of bytes two per-column-family settings for automatic garbage-collect: only the last n versions of a cell be kept only new-enough versions be kept 3 The row range for a table is dynamically partitioned in tablets

(big) Table of content Introduction Data model API Building blocks Implementation Refinements Performance Applications conclusions

API The Bigtable API provides functions : Creating and deleting tables and column families. Changing cluster , table and column family metadata. Support for single row transactions Allows cells to be used as integer counters Client supplied scripts can be executed in the address space of servers 0.5

API // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor:www.c-span.org", "CNN"); r1.Delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); 1

API Scanner scanner(T); ScanStream *stream; stream = scanner.FetchColumnFamily("anchor"); stream->SetReturnAllVersions(); scanner.Lookup("com.cnn.www"); for (; !stream->Done(); stream->Next()) { printf("%s %s %lld %s\n", scanner.RowName(), stream->ColumnName(), stream->MicroTimestamp(), stream->Value()); } 1

(big) Table of content Introduction Data model API Building blocks Implementation Refinements Performance Applications conclusions

Building blocks Bigtable is built on several other pieces of Google infrastructure: Google File System (GFS): Bigtable uses the distributed Google File System (GFS) to store Metedata, data and log. Google SSTable file format(sorted string table) Bigtable data is internally stored in Google SSTable file format. An SSTable provides a persistent ordered immutable map from keys to values. Google Chubby Bigtable relies on a highly-available and persistent distributed lock service called Chubby 2

Building blocks Google file system: shared pool of machines that run a wide variety of other distributed applications. Bigtable depends on a cluster management system for scheduling jobs, dealing with machine failures, and monitoring machine status. Three major compunents: - log files: each tablet have its Own log file. - DATA: data of the tablet (stored in SSTable file) - METADATA: tablets location (also stored in SSTable file) 2

Building blocks Google SSTable file format Contains a sequence of 64 KB Blocks Optionally, an SSTable can be completely mapped into memory Block index stored at the end of the file Used to locate blocks Index loaded in memory when the SSTable is opened Lookup is performed with a single dist seek Find the appropriate block by performing a binary search in the in-memory index Reading the appropriate block from disk 2

Building blocks Google Chubby consists of five active replicas, one of which is elected to be the master and actively serve requests Chubby provides A namespace that contains directories and small files (less then 256KB) – Each Directory or file can be used as a lock – Reads And writes to a file are atomic – Chubby Client library provides consistent caching of Chubby files – Each Chubby Client maintains a session with a Chubby service Bigtable use Chubby file for many tasks: – To Ensure there is at most one active master at any time – To Store the bootstrap location of Bigtable Data (Root tablet) – To Discover tablet servers and finalize tablet server deaths – To Store Bigtable Schema information (column Family information for each table) – To Store access control lists (ACL) 2

(big) Table of content Introduction Data model API Building blocks Implementation Refinements Performance Applications conclusions

Master sever 3. Client 2. Tablet server – Assigning tab1lets to tablet servers – Detecting the addition and expiration of tablet servers – Balancing tablet server load – Garbage collecting of files in GFS – Handling schema changes (table crea7on, column family creation/deletion) 3. Client – Do not rely on the master for tablet location information – Communicates directly with tablet servers for reads and writes 5 2. Tablet server – manages a set of tablets – Handles read and write request to the tablets – Splits tablets that have grown too large (100-­200 MB)

Implementation 4

Implementation Tablet Assignment • Each tablet is assigned to one tablet server at a time • Master keeps tracks of – the set of live tablet servers (tracking via Chubby, under directory “servers”) – the current assignment of tablet to tablet servers – the current unassigned tablets (by scaning the B+ Tree) • When a tablet is unassigned, the master assigns the tablet to an available tablet server by sending a tablet load request to that tablet server

Implementation Master startup • When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can changes them: 1. Master grabs a unique master lock in Chubby to prevent concurrent master instantiations 2. Master scans the “servers” directory in Chubby to find the live tablet servers 3. Master communicate with every live tablet servers to discover what tablets are already assigned to each server 4. Master adds the root tablet to the set of unassigned tablets if an assignment for the root tablet is not discovered in step 3. 5. Master scans the METADATA table to learn the set of tablets (and detect unassigned tablets)

Implementation Tablet Service Writing: Server checks if it is well-formed Server Checks if the sender is authorized (list of permitted writers in Chubby file) A valid mutation is written to the commit After the write has been committed, its contents are inserted into the memtable. 19-24: 7:00

Implementation Tablet Service Reading: Server checks if it is well-formed Server Checks if the sender is authorized (list of permitted readers in Chubby file) A valid read operation is executed on a merged view of the sequence of SSTables and the memtable 2

Implementation Tablet Service Recover: Tablet server find the commit log for this tablet by iterating over the METADATA tablets (searching the b+ tree) Tablet server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have a committed since the redo points 2

Implementation Compaction Minor compaction: When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. shrinks the memory usage of the tablet server reduces the amount of data that has to be read from the commit log during recovery 2

Implementation Compaction Marging compaction: When the number of SSTables rich its bounde, the server reads the contents of a few SSTables and the memtable, and writes out a new SSTable Every minor compaction creates a new SSTable. If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables 2

Implementation Compaction Major compaction: It is a merging compaction that rewrites all SSTables into exactly one SSTable that contains no deletion information or deleted data Bigtable cycles throught all of it tablets and regularly applies major compaction to them (=reclaim ressources used by deleted data in a timely fashion) 2

(big) Table of content Introduction Data model API Building blocks Implementation Refinements Performance Applications conclusions

Refinements Caching for read performance To improve read performance, tablet servers use two levels of cachin. Scan Cache: a high-level cache that caches key-value pairs returned by the SSTable interface Block Cache: a lower-level cache that caches SSTable blocks read from file system 2 Read K13 from SSTable Get K13 value K13 is in block 1 Is K13 in Scan cache? Is K13 in Block cache?

Refinements Bloom filter a read operation has to read from all SSTables that make up the state of a tablet If these SSTables are not in memory, we may end up doing many disk accesses We reduce the number of accesses by using Bloom filters Bloom filter let us know with high propapility if a SSTable contain a specified row/column pair or not. (no false negative) Use only small amount of memory 1

(big) Table of content Introduction Data model API Building blocks Implementation Refinements Performance Experience conclusions

Performance . . . . . . Cluster 500 500 1786 • 500 tablet servers – Configured to use 1 GB RAM – Dual-­?core Opteron 2 GHz, Gigabit Ethernet NIC – Write to a GFS cell (1786 machines with 2 x 400 GB IDE) • 500 clients • Network roundtrip time between any machine < 1 millisecond 500 500 1 client . . . Tablet server . . . GFS server GFS server GFS server 1786

Performance Random reads - Similar to Sequential reads except row key hashed modulo R Random reads (memory) - Similar to Random reads benchmark except locality group that contains the data is marked as in-memory Random writes - Similar to Sequential writes except row key hashed modulo R Sequential reads - Used R row keys partitioned and assigned to N clients Sequential writes - Used R row keys partitioned and assigned to N clients Scans - Similar to Random reads but uses support provided by Bigtable API for scanning over all values in a row range (reduces RPC) 1

(big) Table of content Introduction Data model API Building blocks Implementation Refinements Performance Applications conclusions

Experience Characteristics of a few tables in production use:

Experience Google Earth Google operates a collection of services that provide users with access to high-resolution satellite imagery of the world's surface Relies greatly on Bigtable to keep his data Requre both low latency raction and store very big data.

Experience Google Earth aaa . . . . . . . .