PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno Jacobsen,
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
1 Web-Scale Data Serving with PNUTS Adam Silberstein Yahoo! Research.
PNUTS: Yahoo’s Hosted Data Serving Platform Jonathan Danaparamita jdanap at umich dot edu University of Michigan EECS 584, Fall Some slides/illustrations.
Transaction.
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen,
PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang.
NoSQL Databases: MongoDB vs Cassandra
Web Data Management Raghu Ramakrishnan Research QUIQ Lessons Structured data management powers scalable collaboration environments ASP Multi-tenancy.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Wide-area cooperative storage with CFS
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
Security and Replication … and Course Wrap-up Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems September 13, 2015 PNUTS.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,
Goodbye rows and tables, hello documents and collections.
Alireza Angabini Advanced DB class Dr. M.Rahgozar Fall 88.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang,
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
PNUTS PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Databases Illuminated
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Bigtable: A Distributed Storage System for Structured Data
CSci8211: Distributed System Techniques & Case Studies: I 1 Detour: Distributed Systems Techniques & Case Studies I  Distributing (Logically) Centralized.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Adam Silberstein James Cheng CSE, CUHK.
Web-Scale Data Serving with PNUTS
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Dr.S.Sridhar, Director, RVCET, RVCE, Bangalore
Google File System.
PNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving Platform
CHAPTER 3 Architectures for Distributed Systems
Google Filesystem Some slides taken from Alan Sussman.
by Mikael Bjerga & Arne Lange
Chapter 21: Parallel and Distributed Storage
Presentation transcript:

PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON, H ANS -A RNO J ACOBSEN, N ICK P UZ, D ANIEL W EAVER AND R AMANA Y ERNENI Y AHOO ! R ESEARCH Presented by Team Silverlining- Rakesh Nair, Navya Sruti Sirugudi, Shantanu Sardal, Smruti Aski, Chandra Sekhar

D ISTRIBUTED D ATABASES – O VERVIEW Web applications need: Scalability And the ability to scale linearly Geographic scope High availability and fault tolerance Web applications typically have: Simplified query needs No joins, aggregations Relaxed consistency needs Applications can tolerate stale or reordered data 2

A GENDA Introduction PNUTS Features Architecture PNUTS applications Experimental Results Feature Enhancements Related Work

PNUTS A massive-scale hosted database system Focus on data serving for web applications Provides data storage organized as hashed or ordered tables Low latency for large numbers of concurrent requests Novel per-record consistency guarantees

W HAT IS PNUTS? 5 E C A E B W C W D E F E E C A E B W C W D E F E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Indexes and views Structured, flexible schema Hosted, managed infrastructure A E B W C W D E E C F E

F EATURES Data Model and Features Relational data model, scatter-gather operations, asynchronous notifications, bulk loading Fault Tolerance Employs redundancy, supports low-latency reads and writes even after failure Pub-Sub Message System Asynchronous operations carried out using YMB Record-level Mastering All high-latency operations are asynchronous Hosting Centrally managed database service shared by multiple applications

D ESIGN DECISIONS Record-level, asynchronous geographic replication Guaranteed message delivery service Consistency model which is not fully serialized Hashed and ordered table organizations, flexible schema Data management as a hosted service

SCALABILITY 8 Data-path components Storage units Routers Tablet controller REST API Clients Message Broker

R EPLICATION 9 Storage units Routers Tablet controller REST API Clients Local region Remote regions YMB

D ATA AND Q UERY M ODEL Data organized into tables of records with attributes Query language of PNUTS supports selection and projection from a single table. PNUTS allows application declare tables to be hashed or ordered.

Q UERY MODEL Per-record operations Get Set Delete Multi-record operations Multiget Scan Getrange Web service (RESTful) API 11

C ONSISTENCY MODEL Web applications typically manipulate one record at a time. Per-record timeline consistency Data in PNUTS is replicated across sites Each record contains Sequence number – #updates since the time of creation Version number – changes on each update on record Hidden field in each record stores which copy is the master copy updates can be submitted to any copy forwarded to master, applied in order received by master Record also contains origin of last few updates Mastership can be changed by current master, based on this information Mastership change is simply a record update

C ONSISTENCY MODEL Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? 13 Time Record inserted Update Delete Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update

C ONSISTENCY MODEL (API S ) 14 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Read

C ONSISTENCY MODEL (API S ) 15 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version

C ONSISTENCY MODEL (API S ) 16 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read ≥ v.6 Current version Stale version Read-critical(required version):

C ONSISTENCY MODEL (API S ) 17 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version

C ONSISTENCY MODEL (API S ) 18 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Test-and-set-write(required version)

C ONSISTENCY MODEL (API S ) 19 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Mechanism: per record mastership

S YSTEM A RCHITECTURE System divided into regions typically geographically distributed Each region contains a complete copy of each table Use pub/sub mechanism for reliability and replication (Yahoo Message Broker) Data tables are horizontally partitioned into groups of records called tablets. Each server might have hundreds or thousands of tablets.

T ABLET SPLITTING AND BALANCING 21 Each storage unit has many tablets (horizontal partitions of the table) Tablets may grow over time Overfull tablets split Storage unit may become a hotspot Shed load by moving tablets to other servers Storage unit Tablet

R EADING DATA Three components: Storage Unit (SU) Router Tablet Controller Each router contains interval mapping of each tablet boundry mapped to the SU containing the tablet. For ordered tables, the primary key space is divided into intervals. For hash tables, the hash space is divided into intervals for each tablet.

T ABLET C ONTROLLER Routers contain only a cached copy of the interval mapping. Mapping owned by tablet controller Routers get an update of the mapping from the tablet controller when a read request fails Simplifies router’s failure recovery

A CCESSING SINGLE RECORD 24 SU 1 Get key k 2 3 Record for key k 4

B ULK READ 25 SU Scatter/ gather server SU 1 {k 1, k 2, … k n } 2 Get k 1 Get k 2 Get k 3

R ANGE QUERIES MIN-CanteloupeSU1 Canteloupe-LimeSU3 Lime-StrawberrySU2 Strawberry-MAXSU1 26 Storage unit 1Storage unit 2Storage unit 3 Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? SU1Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU1MIN-Canteloupe

U PDATES 27 1 Write key k 2 7 Sequence # for key k 8 SU 3 Write key k 4 5 SUCCESS 6 Write key k Routers Message brokers

Y AHOO M ESSAGE B ROKER Distributed publish-subscribe service Guarantees delivery once a message is published Logging at site where message is published, and at other sites when received Guarantees messages published to a particular cluster will be delivered in same order at all other clusters Record updates are published to YMB by master copy (Record-level mastering) All replicas subscribe to the updates, and get them in same order for a particular record 28

A SYNCHRONOUS REPLICATION 29

O THER F EATURES Per record transactions Copying a tablet (on failure, for e.g.) Request copy Publish checkpoint message Get copy of tablet as of when checkpoint is received Apply later updates Tablet split Has to be coordinated across all copies 30

Q UERY P ROCESSING Range scan can span tablets done by scatter gather engine (in router) Only one tablet scanned at a time Client may not need all results at once Continuation object returned to client to indicate where range scan should continue Notification One pub-sub topic per tablet Client knows about tables, does not know about tablets Automatically subscribed to all tablets, even as tablets are added/removed. Usual problem with pub-sub: undelivered notifications, handled in usual way 31

PNUTS A PPLICATIONS User Database Millions of active Yahoo users – user profiles, IM buddy lists Record timeline - relaxed consistency Hosted DB – many apps sharing same data Social and Web 2.0 Apps Rapidly evolving and expanding – flexible schema Connections in a social graph – ordered table abstraction Content Metadata Bulk data – distributed FS, metadata – PNUTS Helps high performance operations like file creation, deletion, renaming Listings Management Comparison shopping (sorted by price, rating, etc) Ordered table and views – data sorted by price, ratings,etc Session Data Large session-state storage PNUTS as a service – easy access to session store

E XPERIMENTAL SETUP Production PNUTS code Enhanced with ordered table type Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk Workload requests/second 0-50% writes 80% locality 33

I NSERTS Required 75.6 ms per insert in West 1 (tablet master) ms per insert into the non-master West 2, and ms per insert into the non-master East. 34

35 10% writes by default

S CALABILITY 36

R EQUEST SKEW 37

S IZE OF RANGE SCANS 38

R ELATED WORK Distributed and parallel databases Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data Services, Cassandra Distributed filesystems Ceph, Boxwood, Sinfonia Distributed (P2P) hash tables Chord, Pastry, … Database replication Master-slave, epidemic/gossip, synchronous… 39

C ONCLUSIONS AND ONGOING WORK PNUTS is an interesting research product Research: consistency, performance, fault tolerance, rich functionality Product: make it work, keep it (relatively) simple, learn from experience and real applications Ongoing work Indexes and materialized views Bundled updates Batch query processing 40

S UMMARY Aim of PNUTS Rich Database functionality Low latency on a massive scale Tradeoffs between functionality, performance and scalability Asynchronous replication – Low write latency Consistency Model – Useful guarantees without sacrificing scalability Hosted Service – Minimize operation costs for applications Features Limited – Preserving Reliability and Scale Novel Aspects Per-record timeline consistency - Asynchronous replication Message broker - Replication mechanism, Redo log Flexible mapping of tablets to storage units – Auto Failover, Load Balancing

T HANK YOU ! Questions??