PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen,

Slides:



Advertisements
Similar presentations
An Overview of Cloud Computing Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many discussions.
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno Jacobsen,
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
1 Web-Scale Data Serving with PNUTS Adam Silberstein Yahoo! Research.
PNUTS: Yahoo’s Hosted Data Serving Platform Jonathan Danaparamita jdanap at umich dot edu University of Michigan EECS 584, Fall Some slides/illustrations.
PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang.
Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears Yahoo! Research Presenter.
Small-Scale Peer-to-Peer Publish/Subscribe
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Web Data Management Raghu Ramakrishnan Research QUIQ Lessons Structured data management powers scalable collaboration environments ASP Multi-tenancy.
Managing Data in the Cloud
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Wide-area cooperative storage with CFS
1 An Overview of Cloud Yahoo! Raghu Ramakrishnan Chief Scientist, Audience and Cloud Computing Research Fellow, Yahoo! Research Reflects many.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
22-Aug-15 | 1 |1 | Help! I need more servers! What do I do? Scaling a PHP application.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
SQL Server Replication By Karthick P.K Technical Lead, Microsoft SQL Server.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Where in the world is my data? Sudarshan Kadambi Yahoo! Research VLDB 2011 Joint work with Jianjun Chen, Brian Cooper, Adam Silberstein, David Lomax, Erwin.
Training Workshop Windows Azure Platform. Presentation Outline (hidden slide): Technical Level: 200 Intended Audience: Developers Objectives (what do.
Security and Replication … and Course Wrap-up Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems September 13, 2015 PNUTS.
PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U TKARSH S RIVASTAVA, A DAM S ILBERSTEIN, P HILIP B OHANNON,
Distributed Data Stores and No SQL Databases S. Sudarshan Perry Hoekstra (Perficient) with slides pinched from various sources such as Perry Hoekstra (Perficient)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,
Apache Cassandra - Distributed Database Management System Presented by Jayesh Kawli.
Alireza Angabini Advanced DB class Dr. M.Rahgozar Fall 88.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Bigtable: A Distributed Storage System for Structured Data 1.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
D 3 S: Debugging Deployed Distributed Systems Xuezheng Liu et al, Microsoft Research, NSDI 2008 Presenter: Shuo Tang,
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
PNUTS PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, HansArno.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore Jun Rao, Eugene J. Shekita, Sandeep Tata IBM Almaden Research Center PVLDB,
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Authors Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana.
Geo-distributed Messaging with RabbitMQ
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
R*: An overview of the Architecture By R. Williams et al. Presented by D. Kontos Instructor : Dr. Megalooikonomou.
Bigtable: A Distributed Storage System for Structured Data
1 Benchmarking Cloud Serving Systems with YCSB Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan and Russell Sears Yahoo! Research.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Making Sense of Service Broker Inside the Black Box.
CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Adam Silberstein James Cheng CSE, CUHK.
Web-Scale Data Serving with PNUTS
Dr.S.Sridhar, Director, RVCET, RVCE, Bangalore
Open Source distributed document DB for an enterprise
PNUTS: Yahoo!’s Hosted Data Serving Platform
PNUTS: Yahoo!’s Hosted Data Serving Platform
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Massively Parallel Cloud Data Storage Systems
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Introduction to Apache
Benchmarking Cloud Serving Systems with YCSB
Small-Scale Peer-to-Peer Publish/Subscribe
Chapter 21: Parallel and Distributed Storage
Presentation transcript:

PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni Yahoo! Research With some additions by S. Sudarshan

2 How do I build a cool new web app? Option 1: Code it up! Make it live! Scale it later It gets posted to slashdot Scale it now! Flickr, Twitter, MySpace, Facebook, …

3 How do I build a cool new web app? Option 2: Make it industrial strength! Evaluate scalable database backends Evaluate scalable indexing systems Evaluate scalable caching systems Architect data partitioning schemes Architect data replication schemes Architect monitoring and reporting infrastructure Write application Go live Realize it doesn’t scale as well as you hoped Rearchitect around bottlenecks 1 year later – ready to go!

4 Example: social network updates Brian SonjaJimiBrandonKurt What are my friends up to? Sonja: Brandon:

5 Example: social network updates 16 Mike <ph.. 6 Jimi <ph.. 8 Mary <re.. 12 Sonja <ph.. 15 Brandon <po.. 17 Bob <re.. Flower

6 What do we need from our DBMS? Web applications need: Scalability And the ability to scale linearly Geographic scope High availability Web applications typically have: Simplified query needs No joins, aggregations Relaxed consistency needs Applications can tolerate stale or reordered data

7 What is PNUTS?

8 E C A E B W C W D E F E E C A E B W C W D E F E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Indexes and views Structured, flexible schema Hosted, managed infrastructure A E B W C W D E E C F E

9 Query model Per-record operations Get Set Delete Multi-record operations Multiget Scan Getrange Web service (RESTful) API

10 Data-path components Storage units Routers Tablet controller REST API Clients Message Broker Detailed architecture

11 Storage units Routers Tablet controller REST API Clients Local region Remote regions YMB Detailed architecture

12 Tablet splitting and balancing Each storage unit has many tablets (horizontal partitions of the table) Tablets may grow over time Overfull tablets split Storage unit may become a hotspot Shed load by moving tablets to other servers Storage unit Tablet

13 Query processing

14 Accessing data SU 1 Get key k 2 3 Record for key k 4

15 Bulk read SU Scatter/ gather server SU 1 {k 1, k 2, … k n } 2 Get k 1 Get k 2 Get k 3

16 Storage unit 1Storage unit 2Storage unit 3 Range queries Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? MIN-CanteloupeSU1 Canteloupe-LimeSU3 Lime-StrawberrySU2 Strawberry-MAXSU1 Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU1MIN-Canteloupe

17 Updates 1 Write key k 2 7 Sequence # for key k 8 SU 3 Write key k 4 5 SUCCESS 6 Write key k Routers Message brokers

18 Yahoo Message Bus Distributed publish-subscribe service Guarantees delivery once a message is published Logging at site where message is published, and at other sites when received Guarantees messages published to a particular cluster will be delivered in same order at all other clusters Record updates are published to YMB by master copy All replicas subscribe to the updates, and get them in same order for a particular record

19 Asynchronous replication and consistency

20 Asynchronous replication

21 Consistency model Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? Time Record inserted Update Delete Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update

22 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Read

23 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version

24 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read ≥ v.6 Current version Stale version Read-critical(required version):

25 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version

26 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Test-and-set-write(required version)

27 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Mechanism: per record mastership

28 Record and Tablet Mastership Data in PNUTS is replicated across sites Hidden field in each record stores which copy is the master copy updates can be submitted to any copy forwarded to master, applied in order received by master Record also contains origin of last few updates Mastership can be changed by current master, based on this information Mastership change is simply a record update Tablets mastership Required to ensure primary key consistency Can be different from record mastership

29 Other Features Per record transactions Copying a tablet (on failure, for e.g.) Request copy Publish checkpoint message Get copy of tablet as of when checkpoint is received Apply later updates Tablet split Has to be coordinated across all copies

30 Query Processing Range scan can span tablets Only one tablet scanned at a time Client may not need all results at once Continuation object returned to client to indicate where range scan should continue Notification One pub-sub topic per tablet Client knows about tables, does not know about tablets Automatically subscribed to all tablets, even as tablets are added/removed. Usual problem with pub-sub: undelivered notifications, handled in usual way

31 Experiments

32 Experimental setup Production PNUTS code Enhanced with ordered table type Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk Workload requests/second 0-50% writes 80% locality

33 Inserts required 75.6 ms per insert in West 1 (tablet master) ms per insert into the non-master West 2, and ms per insert into the non-master East.

34 10% writes by default

35 Scalability

36 Request skew

37 Size of range scans

38 Related work Distributed and parallel databases Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data Services, Cassandra Distributed filesystems Ceph, Boxwood, Sinfonia Distributed (P2P) hash tables Chord, Pastry, … Database replication Master-slave, epidemic/gossip, synchronous…

39 Conclusions and ongoing work PNUTS is an interesting research product Research: consistency, performance, fault tolerance, rich functionality Product: make it work, keep it (relatively) simple, learn from experience and real applications Ongoing work Indexes and materialized views Bundled updates Batch query processing

40 Thanks! research.yahoo.com