CSCI5570 Large Scale Data Processing Systems

Slides:



Advertisements
Similar presentations
Introduction to cloud computing
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
The google file system Cs 595 Lecture 9.
Big Table Alon pluda.
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
 Pouria Pirzadeh  3 rd year student in CS  PhD  Vandana Ayyalasomayajula  1 st year student in CS  Masters.
Authors Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson Hsieh Deborah Wallach Mike Burrows Tushar Chandra Andrew Fikes Robert Gruber Bigtable: A Distributed.
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable A System for Distributed Structured Storage Yanen Li Department of Computer Science University of Illinois at Urbana-Champaign
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Essential Google by Zhongyuan Wang Outline Motivation & Goals Problems Solution : BigTable File System vs Database Google’s Database : Google.
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Bigtable A Distributed Storage System for Structured Data.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Chapter 3 System Models.
Bigtable: A Distributed Storage System for Structured Data Written By: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Bigtable A Distributed Storage System for Structured Data
Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD
HBase Mohamed Eltabakh
Software Systems Development
Bigtable: A Distributed Storage System for Structured Data
How did it start? • At Google • • • • Lots of semi structured data
GFS and BigTable (Lecture 20, cs262a)
Data Management in the Cloud
Large-scale file systems and Map-Reduce
CSE-291 (Cloud Computing) Fall 2016
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
Google and Cloud Computing
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Cloud Computing Storage Systems
A Distributed Storage System for Structured Data
THE GOOGLE FILE SYSTEM.
John Kubiatowicz (with slides from Ion Stoica and Ali Ghodsi)
Presentation transcript:

CSCI5570 Large Scale Data Processing Systems NoSQL James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Jinyang Li, Shimin Chen, Mohsen Taheriyan

Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber OSDI 2006

Bigtable Description Outline Motivation and goals Interfaces and semantics Architecture Implementation details

Motivation Lots of (semi-)structured data in Google Scale is huge Web data: Contents, crawl metadata, links, anchors, pagerank, … Per-user data: User preference settings, recent queries/search results, … Map data: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … Scale is huge Billions of Web pages, many versions/page (~20K/version) Hundreds of millions of users, thousands of q/sec 100TB+ of satellite image data (Above numbers are as of 2006-7!)

Goals Petabytes of data across thousands of commodity servers Want asynchronous processes to continuously update different pieces of data Want access to most recent data at any time Need to support: Very high read/write rates (millions of ops per second) Efficient retrieval of small subsets of the data Efficient scans over entirety or subsets of the data Often want to examine data changes over time E.g. Contents of a web page over multiple crawls

Why Not Commercial DB? Scale is too large for most commercial databases Cost of licensing per machine would be very high Building internally means that system can be applied across many projects for low incremental cost Low-level storage optimizations help performance Much harder when running on top of a database layer “Also fun and challenging to build large-scale systems” 

Bigtable Scalable Self-managing A distributed storage system for (semi)structured data Scalable Thousands of servers Terabytes of in-memory data Petabytes of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance Allow clients to reason about the locality properties of data in the underlying storage

Bigtable Applications A flexible high-performance solution for many Google products (as of 2008) Web indexing, personalized search, Google Earth, Google Analytics, Google Finance, … Backend bulk processing: throughput-oriented batch-processing jobs Real-time data serving: latency-sensitive serving of data to end users

Bigtable Description Outline Motivation and goals Interfaces and semantics Architecture Implementation details

Basic Data Model “A BigTable is a sparse, distributed, persistent multi-dimensional sorted map” Indexed by: (row:string, column:string, time:int64) → string “contents:” “anchor:cnnsi.com” “anchor:my.look.ca” “<html> …” t 3 “<html> …” “com.cnn.www” t 5 “CNN” t 9 “CNN.com” t 8 “<html> …” t 6 Row key: up to 64KB, 10-100B typically, sorted by reverse URL cell w/ timestamped versions + GC column families

Rows Row names/keys are arbitrary strings and are ordered lexicographically Rows close together lexicographically are usually on one or a small number of machines Hence, programmers can manipulate row names to achieve good locality in their programs Example: com.cnn.www vs. www.cnn.com – which one provides more locality for a “site:” query? Access to data in a row is atomic Data row is the only unit of atomicity in Bigtable Easier for concurrent updates to the same row Does not support relational model No table wide integrity constants No multi-row transactions

Row-based Locality www.cnn.com Matches from same site should be retrieved together by accessing the minimal number of machines Using reversed-DNS URLs clusters URLs from the same site together, to speed up site-local queries com.cnn.edition, com.cnn.money, com.cnn.www edition.cnn.com money.cnn.com

family:optional_qualifier (e.g.: anchor:cnnsi.com) Columns Columns have two-level name structure: family:optional_qualifier (e.g.: anchor:cnnsi.com) Column family Basic unit of access control Has associated type information, all data stored in a column family usually of the same type (allow compression) Small number of distinct column families in a table, but unbounded number of columns Additional levels of indexing, if desired either grant access to the whole column family or none of the families in the column family “contents:” “anchor:cnnsi.com” “anchor:stanford.edu” “…” “CNN homepage” “CNN” com.cnn.www

Timestamps Used to store different versions of data in a cell 64-bit integers New writes default to current time, but timestamps for writes can also be set explicitly by clients Each column family has two garbage-collection settings: “Only keep most recent K values in a cell” “Keep values until they are older than K seconds” Lookup options: “Return most recent K values” “Return all values in timestamp range (or all values)” Example uses: Keep versions of the Web

The Bigtable API Metadata operations Writes (atomic) Reads Create/delete tables or column families Change metadata, access control rights Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in a bigtable Each row read is atomic (e.g., atomic read-modify-write sequence) Can restrict returned rows to a particular range Can ask for data just from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns

API Examples: Write/Modify // Open the table Table *T = OpenOrDie("/bigtable/web/webtable"); // Write a new anchor and delete an old anchor RowMutation r1(T, "com.cnn.www"); r1.Set("anchor:www.c-span.org", "CNN"); r1.Delete("anchor:www.abc.com"); Operation op; Apply(&op, &r1); atomic row modification No support for (RDBMS-style) multi-row transactions

Bigtable Description Outline Motivation and goals Interfaces and semantics Architecture Implementation details

Tablets A Bigtable table is partitioned into many tablets based on row keys Tablets (100-200MB each) are stored in a particular structure in GFS Tablet is the unit of distribution and load balancing in Bigtable Each tablet is served by one tablet server Tablets are stateless (all states are in GFS), hence they can restart at any time “language:” “contents:” Tablet: Start: com.aaa End: com.cnn.www “com.aaa” “com.cnn.edition” “com.cnn.money” . . . EN “<html>…” Tablet: Start: com.cnn.www End: com.dodo.www “com.cnn.www” “com.cnn.www/sports.html” “com.cnn.www/world/” “com.dodo.www” . . . . . . “com.website” “com.yahoo/kids.html” “com.yahoo/kids.html?d” “com.zuppa/menu.html”

Tablet Structure SSTable Uses Google SSTables, a key building block A brief description of SSTable: An immutable, sorted file of key-value pairs SSTable files are stored in GFS Keys are: <row, column, timestamp> SSTables allow only appends, no updates (delete possible) Why do you think they don’t use something that supports updates? SSTable SSTable (engl. Sorted Strings Table) is a file of key/value string pairs, sorted by keys. An SSTable provides a persistent,ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Index (block ranges) 64KB Block 64KB Block 64KB Block …

Tablet Structure A Tablet stores a range of rows from a table using SSTable files, which are stored in GFS Tablet Start: aardvark End: apple SSTable SSTable … … …

Tablet Splitting When tablets grow too big, they are split There’s merging, too “language:” “contents:” “com.aaa” “com.cnn.edition” “com.cnn.money” . . . EN “<html>…” . . . . . . “com.website” “com.xuma” “com.yahoo/kids.html” “com.yahoo/kids.html?d” “com.yahoo/parents.html” “com.yahoo/parents.html?d” “com.zuppa/menu.html”

Servers One master tablet server Many tablet servers Assigns/load-balances tablets to tablet servers Detects up/down tablet servers Garbage collects deleted tablets Coordinates metadata updates (e.g., create table, …) Does NOT provide tablet location (we’ll see how this is gotten) Master is stateless – state is in Chubby, and Bigtable (recursively)! Many tablet servers Manage a set of tablets Handle R/W requests to their tablets Split tablets that have grown too large Tablet servers are also stateless – their states are in GFS!

System Architecture Metadata ops Bigtable client Bigtable client library Bigtable tablet master performs metadata ops, load balancing Read/Write Open() Bigtable tablet server Bigtable tablet server Bigtable tablet server serves data serves data serves data GFS Chubby holds tablet data holds metadata, handles master-election

Locating Tablets Since tablets move around from server to server, given a row, how do clients find the right machine? Tablet properties: startRowIndex and endRowIndex Need to find tablet whose row range covers the target row One approach: could use the Bigtable master Central server almost certainly would be bottleneck in large system Instead: store special tables containing tablet location info in the Bigtable cell itself (recursive design)

Tablets are located using a hierarchical (B+ tree-like) structure UserTable_1 METADATA 1 st METADATA (stored in one single tablet, unsplittable) Chubby lock file … … UserTable_N <table_id, end_row> → location Each METADATA record ~1KB Max METADATA table = 128MB A 3-level location scheme can address 2 34 tablets

Other Components Building blocks BigTable uses of building blocks Google File System (GFS): Raw storage Scheduler: Schedules jobs onto machines Chubby: Lock service BigTable uses of building blocks GFS: stores all persistent states (log and data files) Scheduler: schedules jobs involved in BigTable serving Chubby: master election, keep track of tablet servers

GFS GFS Master Client Client C0 C1 C1 C0 C3 C3 C4 C3 C5 C4 Master manages metadata Data transfers happen directly between clients/chunkservers Files broken into chunks (typically 64 MB) Chunks replicated across three machines for reliability Client C0 C1 C1 C0 C3 Chunk servers C3 C4 C3 C5 C4

Chubby A highly-available and persistent distributed lock service with a file system interface Provides a namespace consisting of directories and small files, each directory or file used as a lock Intuitively, Chubby provides locks with possibility to store a bit of data in them, which can be read but not written unless you have a writer’s lock It also provides notifications for file updates Uses Paxos to provide consistency Paxos: a family of protocols for solving consensus in a network of unreliable processors. Consensus: the process of agreeing on one result among a group of participants.

Chubby Bigtable uses Chubby to: Ensure that there is at most one active master at any time Store the bootstrap location of Bigtable data Discover tablet servers and finalize tablet server deaths Store Bigdata schema information (column family information for each tablet) Store access control lists

Cluster Scheduling Master Typical Cluster Cluster Scheduling Master Lock Service (Chubby) GFS Master by-and-large stateless! Machine 1 Machine 2 Machine 3 BigTable Server BigTable Server BigTable Master User Task (Map/Reduce) User Task (Map/Reduce) Scheduler Slave GFS Chunk Server Scheduler Slave GFS Chunk Server Scheduler Slave GFS Chunk Server Linux Linux Linux Stateful!

Specific Example: Web Search inserts or updates of page contents and anchors to pages targeted queries (should not require scans!) full or partial table scans

Bigtable Description Outline Motivation and goals Interfaces and semantics Architecture Implementation details

Tablet Assignment Master assigns tablets to tablet servers

Tablet Assignment Each tablet is assigned to one tablet server Bigtable uses Chubby to keep track of tablet servers When a tablet server starts, it creates and acquires an exclusive lock on a file (stores the location of the root tablet) in a Chubby directory Master monitors this directory to discover tablet servers A tablet server stops serving its tablets if it loses its exclusive lock (e.g., due to network partition); it may require the lock again if the file still exists, otherwise kills itself Require to acquire an exclusive lock so that each tablet is served by one tablet server only

Tablet Assignment Master periodically asks each tablet server for the status of its lock if cannot reach a tablet server after several trials, tries to acquire an exclusive lock on the tablet server’s file if master can acquire the lock, Chubby is live and the tablet server is dead or cannot reach Chubby master disallows the tablet server to serve by deleting its file and puts all its tablets as unassigned tablets Master assigns unassigned tablets to a tablet server with sufficient room Master failures do not change tablet assignments

Master Startup Operation Grabs unique master lock in Chubby Prevents concurrent master instantiations Scans directory in Chubby for live tablet servers Communicates with every live tablet server Discover all tablets Scans METADATA table to learn the set of tablets Unassigned tablets are marked for assignment

Tablet Serving Updates are stored in tablet log and new ones in memtable (in memory) old ones in SSTables

Tablet Serving Write Read The tablet server checks authorization for write The write is written to the commit log and its contents are inserted into memtable Read The tablet server checks authorization for read The read is executed on a merged view of the sequence of SSTables and memtable The merged view can be formed efficiently since SSTables and memtable are sorted

Tablet Serving Recovering tablet Tablet server reads data from METADATA table Metadata contains list of SSTables and pointers to any commit logs that may contain data for the tablet Server reads the indices of the SSTables into memory Server reconstructs the memtable by applying all of the updates since redo points

Compaction Minor compaction – convert the memtable into an SSTable Reduce memory usage Reduce log traffic on restart Merging (major) compaction Reduce number of SSTables, merging into a few (exactly one) SSTable Save disk due to high compression rate, remove deleted entries

Bigtable Summary Scalable distributed storage system for semistructured data Offers a multi-dimensional-map interface <row, column, timestamp> → value Offers atomic reads/writes within a row Key design philosophy: statelessness, which is key for scalability All Bigtable servers (including master) are stateless All states are stored in reliable GFS and Chubby systems Bigtable leverages strong-semantic operations in these systems (appends in GFS, file locks in Chubby) Availability and scalability: if a tablet server (or master) is dead, easily starts a new tablet server as it’s stateless