Managing Data in the Cloud

Slides:



Advertisements
Similar presentations
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia.
Advertisements

Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno Jacobsen,
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)
PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang.
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Case Study - GFS.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Alireza Angabini Advanced DB class Dr. M.Rahgozar Fall 88.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
PNUTS PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno.
The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Bigtable A Distributed Storage System for Structured Data.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
CSE-291 (Cloud Computing) Fall 2016
PNUTS: Yahoo!’s Hosted Data Serving Platform
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
Cloud scale storage: The Google File system
The Google File System (GFS)
Cloud Computing Storage Systems
THE GOOGLE FILE SYSTEM.
The Google File System (GFS)
Presentation transcript:

Managing Data in the Cloud

Database becomes the Scalability Bottleneck Cannot leverage elasticity Scaling in the Cloud Client Site Client Site Client Site Load Balancer (Proxy) App Server App Server App Server App Server App Server MySQL Master DB Database becomes the Scalability Bottleneck Cannot leverage elasticity Replication MySQL Slave DB CS271

Scaling in the Cloud Client Site Client Site Client Site Replication Load Balancer (Proxy) App Server App Server App Server App Server App Server MySQL Master DB Replication MySQL Slave DB CS271

Scaling in the Cloud Key Value Stores Client Site Client Site Load Balancer (Proxy) Apache + App Server Apache + App Server Apache + App Server Apache + App Server Apache + App Server Key Value Stores CS271

CAP Theorem (Eric Brewer) “Towards Robust Distributed Systems” PODC 2000. “CAP Twelve Years Later: How the "Rules" Have Changed” IEEE Computer 2012 CS271

Key Value Stores Key-Valued data model Gained widespread popularity Key is the unique identifier Key is the granularity for consistent access Value can be structured or unstructured Gained widespread popularity In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo (Amazon) Open source: HBase, Hypertable, Cassandra, Voldemort Popular choice for the modern breed of web-applications CS271

Big Table (Google) Data model. Sparse, persistent, multi-dimensional sorted map. Data is partitioned across multiple servers. The map is indexed by a row key, column key, and a timestamp. Output value is un-interpreted array of bytes. (row: byte[ ], column: byte[ ], time: int64)  byte[ ] CS271

Architecture Overview Shared-nothing architecture consisting of thousands of nodes (commodity PC). Google File System Google’s Bigtable Data Model ……. CS271

Atomicity Guarantees in Big Table Every read or write of data under a single row is atomic. Objective: make read operations single-sited! CS271

Big Table’s Building Blocks Google File System (GFS) Highly available distributed file system that stores log and data files Chubby Highly available persistent distributed lock manager. Tablet servers Handles read and writes to its tablet and splits tablets. Each tablet is typically 100-200 MB in size. Master Server Assigns tablets to tablet servers, Detects the addition and deletion of tablet servers, Balances tablet-server load, CS271

Overview of Bigtable Architecture Chubby Master Lease Management Control Operations T1 T2 Tn Tablets Master and Chubby Proxies Log Manager Cache Manager Tablet Server Tablet Server Tablet Server Because there is so much out there already, I’d just spend one slide giving an overview of ElasTraS. Google File System CS271

GFS Architectural Design A GFS cluster A single master Multiple chunkservers per master Accessed by multiple clients Running on commodity Linux machines A file Represented as fixed-sized chunks Labeled with 64-bit unique global IDs Stored at chunkservers 3-way replication across chunkservers CS271

Architectural Design chunk location? chunk data? Application GFS client chunk location? GFS Master GFS chunkserver Linux file system chunk data? GFS chunkserver Linux file system GFS chunkserver Linux file system CS271

Single-Master Design Simple Master answers only chunk locations A client typically asks for multiple chunk locations in a single request The master also predicatively provides chunk locations immediately following those requested CS271

Metadata Master stores three major types All kept in memory: Fast! File and chunk namespaces, persistent in operation log File-to-chunk mappings, persistent in operation log Locations of a chunk’s replicas, not persistent. All kept in memory: Fast! Quick global scans For Garbage collections and Reorganizations 64 bytes of metadata only per 64 MB of data CS271

Mutation Operation in GFS Mutation: any write or append operation The data needs to be written to all replicas Guarantee of the same order when multi user request the mutation operation. CS271

GFS Revisited “GFS: Evolution on Fast-Forward” an interview with GFS designers in CACM 3/11. Single master was critical for early deployment. “the choice to establish 64MB …. was much larger than the typical file-system block size, but only because the files generated by Google's crawling and indexing system were unusually large.” As the application mix changed over time, ….deal efficiently with large numbers of files requiring far less than 64MB (think in terms of Gmail, for example). The problem was not so much with the number of files itself, but rather with the memory demands all of those files made on the centralized master, thus exposing one of the bottleneck risks inherent in the original GFS design. CS271

GFS Revisited(Cont’d) “the initial emphasis in designing GFS was on batch efficiency as opposed to low latency.” “The original single-master design: A single point of failure may not have been a disaster for batch-oriented applications, but it was certainly unacceptable for latency-sensitive applications, such as video serving.” Future directions: distributed master, etc. Interesting and entertaining read. CS271

PNUTS Overview Data Model: Fault-tolerance: Pub/Sub Message System: Simple relational model—really key-value store. Single-table scans with predicates Fault-tolerance: Redundancy at multiple levels: data, meta-data etc. Leverages relaxed consistency for high availability: reads & writes despite failures Pub/Sub Message System: Yahoo! Message Broker for asynchronous updates CS271

Asynchronous replication CS271

Consistency Model Hide the complexity of data replication Between the two extremes: One-copy serializability, and Eventual consistency Key assumption: Applications manipulate one record at a time Per-record time-line consistency: All replicas of a record preserve the update order CS271

Implementation A read returns a consistent version One replica designated as master (per record) All updates forwarded to that master Master designation adaptive, replica with most of writes becomes master CS271

Consistency model Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? Record inserted Update Update Update Update Update Update Update Delete v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Time Generation 1 CS271

Consistency model Time Read Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Generation 1 CS271

Consistency model Time Read up-to-date Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Generation 1 CS271

Consistency model Time Read ≥ v.6 Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Generation 1 CS271

Consistency model Time Write Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Generation 1 CS271

Consistency model Time Write if = v.7 Stale version Stale version ERROR Stale version Stale version Current version v. 1 v. 2 v. 3 v. 4 v. 5 v. 6 v. 7 v. 8 Time Generation 1 CS271

PNUTS Architecture Clients REST API Routers Message Broker Tablet Data-path components REST API Routers Message Broker Tablet controller Storage units CS271

PNUTS architecture Local region Remote regions CS271 Clients REST API Routers YMB Tablet controller Storage units CS271

System Architecture: Key Features Pub/Sub Mechanism: Yahoo! Message Broker Physical Storage: Storage Unit Mapping of records: Tablet Controller Record locating: Routers CS271

Highlights of PNUTS Approach Shared nothing architecture Multiple datacenter for geographic distribution Time-line consistency and access to stale data. Use a publish-subscribe system for reliable fault-tolerant communication Replication with record-based master. CS271

Amazon’s Key-Value Store: Dynamo Adapted from Amazon’s Dynamo Presentation CS271

Highlights of Dynamo High write availability Optimistic: vector clocks for resolution Consistent hashing (Chord) in controlled environment Quorums for relaxed consistency. CS271

Too many choices – Which system should I use? Cooper et al., SOCC 2010 CS271

Benchmarking Serving Systems A standard benchmarking tool for evaluating Key Value stores: Yahoo! Cloud Servicing Benchmark (YCSB) Evaluate different systems on common workloads Focus on performance and scale out CS271

Benchmark tiers Tier 1 – Performance Tier 2 – Scalability Latency versus throughput as throughput increases Tier 2 – Scalability Latency as database, system size increases “Scale-out” Latency as we elastically add servers “Elastic speedup” CS271

Workload A – Update heavy: 50/50 read/update Cassandra (based on Dynamo) is optimized for heavy updates Cassandra uses hash partitioning.

Workload B – Read heavy 95/5 read/update PNUTS uses MSQL, and MSQL is optimized for read operations CS271

Workload E – short scans Scans of 1-100 records of size 1KB HBASE uses append-only log, so optimized for scans—same for MSQL and PNUTS. Cassandra uses hash partitioning, so poor scan performance. CS271

Summary Different databases suitable for different workloads Evolving systems – landscape changing dramatically Active development community around open source systems CS271

Two approaches to scalability Scale-up Classical enterprise setting (RDBMS) Flexible ACID transactions Transactions in a single node Scale-out Cloud friendly (Key value stores) Execution at a single server Limited functionality & guarantees No multi-row or multi-step transactions CS271

Key-Value Store Lessons What are the design principles learned?

Design Principles [DNIS 2010] Separate System and Application State System metadata is critical but small Application data has varying needs Separation allows use of different class of protocols CS271

Design Principles Decouple Ownership from Data Storage Ownership is exclusive read/write access to data Decoupling allows lightweight ownership migration Ownership [Multi-step transactions or Read/Write Access] Transaction Manager Recovery Cache Manager Storage Classical DBMSs Decoupled ownership and Storage CS271

Design Principles Limit most interactions to a single node Allows horizontal scaling Graceful degradation during failures No distributed synchronization Thanks: Curino et al VLDB 2010 CS271

Design Principles Limited distributed synchronization is practical Maintenance of metadata Provide strong guarantees only for data that needs it CS271

Fault-tolerance in the Cloud Need to tolerate catastrophic failures Geographic Replication How to support ACID transactions over data replicated at multiple datacenters One-copy serializablity: Clients can access data in any datacenter, appears as single copy with atomic access SBBD 2012

Megastore: Entity Groups (Google--CIDR 2011) Entity groups are sub-database Static partitioning Cheap transactions in Entity groups (common) Expensive cross-entity group transactions (rare) SBBD 2012

Megastore Entity Groups Semantically Predefined Email Each email account forms a natural entity group Operations within an account are transactional: user’s send message is guaranteed to observe the change despite of fail-over to another replica Blogs User’s profile is entity group Operations such as creating a new blog rely on asynchronous messaging with two-phase commit Maps Dividing the globe into non-overlapping patches Each patch can be an entity group SBBD 2012

Megastore Slides adapted from authors’ presentation SBBD 2012

Google’s Spanner: Database Tech That Can Scan the Planet (OSDI 2012) SBBD 2012

2PL + wound-wait (isolation) The Big Picture (OSDI 2012) 2PC (atomicity) GPS + Atomic Clocks TrueTime 2PL + wound-wait (isolation) Movedir load balancing Paxos (consistency) Tablets Logs SSTables Colossus File System

TrueTime TrueTime: APIs that provide real time with bounds on error. Powered by GPS and atomic clocks. Enforce external consistency If the start of T2 occurs after the commit of T1 , then the commit timestamp of T2 must be greater than the commit timestamp of T1 . Concurrency Control: Update transactions: 2PL Read-only transactions: Use real time to return a consistency snapshot.

Primary References Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, Gruber: Bigtable: A Distributed Storage System for Structured Data. OSDI 2006 The Google File System: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. Symp on Operating Systems Princ 2003. GFS: Evolution on Fast-Forward: Kirk McKusick, Sean Quinlan Communications of the ACM 2010. Cooper, Ramakrishnan, Srivastava, Silberstein, Bohannon, Jacobsen, Puz, Weaver, Yerneni: PNUTS: Yahoo!'s hosted data serving platform. VLDB 2008. DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: amazon's highly available key-value store. SOSP 2007 Cooper, Silberstein, Tam, Ramakrishnan, Sears: Benchmarking cloud serving systems with YCSB. SoCC 2010 CS271