Download presentation
Presentation is loading. Please wait.
Published byMargaret Lyons Modified over 9 years ago
1
PNUTS: Yahoo!’s Hosted Data Serving Platform Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni Yahoo! Research With some additions by S. Sudarshan
2
2 How do I build a cool new web app? Option 1: Code it up! Make it live! Scale it later It gets posted to slashdot Scale it now! Flickr, Twitter, MySpace, Facebook, …
3
3 How do I build a cool new web app? Option 2: Make it industrial strength! Evaluate scalable database backends Evaluate scalable indexing systems Evaluate scalable caching systems Architect data partitioning schemes Architect data replication schemes Architect monitoring and reporting infrastructure Write application Go live Realize it doesn’t scale as well as you hoped Rearchitect around bottlenecks 1 year later – ready to go!
4
4 Example: social network updates Brian SonjaJimiBrandonKurt What are my friends up to? Sonja: Brandon:
5
5 Example: social network updates 16 Mike <ph.. 6 Jimi <ph.. 8 Mary <re.. 12 Sonja <ph.. 15 Brandon <po.. 17 Bob <re.. Flower www.flickr.com
6
6 What do we need from our DBMS? Web applications need: Scalability And the ability to scale linearly Geographic scope High availability Web applications typically have: Simplified query needs No joins, aggregations Relaxed consistency needs Applications can tolerate stale or reordered data
7
7 What is PNUTS?
8
8 E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database Geographic replication Indexes and views Structured, flexible schema Hosted, managed infrastructure A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E
9
9 Query model Per-record operations Get Set Delete Multi-record operations Multiget Scan Getrange Web service (RESTful) API
10
10 Data-path components Storage units Routers Tablet controller REST API Clients Message Broker Detailed architecture
11
11 Storage units Routers Tablet controller REST API Clients Local region Remote regions YMB Detailed architecture
12
12 Tablet splitting and balancing Each storage unit has many tablets (horizontal partitions of the table) Tablets may grow over time Overfull tablets split Storage unit may become a hotspot Shed load by moving tablets to other servers Storage unit Tablet
13
13 Query processing
14
14 Accessing data SU 1 Get key k 2 3 Record for key k 4
15
15 Bulk read SU Scatter/ gather server SU 1 {k 1, k 2, … k n } 2 Get k 1 Get k 2 Get k 3
16
16 Storage unit 1Storage unit 2Storage unit 3 Range queries Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? MIN-CanteloupeSU1 Canteloupe-LimeSU3 Lime-StrawberrySU2 Strawberry-MAXSU1 Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU1MIN-Canteloupe
17
17 Updates 1 Write key k 2 7 Sequence # for key k 8 SU 3 Write key k 4 5 SUCCESS 6 Write key k Routers Message brokers
18
18 Yahoo Message Bus Distributed publish-subscribe service Guarantees delivery once a message is published Logging at site where message is published, and at other sites when received Guarantees messages published to a particular cluster will be delivered in same order at all other clusters Record updates are published to YMB by master copy All replicas subscribe to the updates, and get them in same order for a particular record
19
19 Asynchronous replication and consistency
20
20 Asynchronous replication
21
21 Consistency model Goal: make it easier for applications to reason about updates and cope with asynchrony What happens to a record with primary key “Brian”? Time Record inserted Update Delete Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update
22
22 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Read
23
23 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version
24
24 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read ≥ v.6 Current version Stale version Read-critical(required version):
25
25 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version
26
26 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Test-and-set-write(required version)
27
27 Consistency model Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Mechanism: per record mastership
28
28 Record and Tablet Mastership Data in PNUTS is replicated across sites Hidden field in each record stores which copy is the master copy updates can be submitted to any copy forwarded to master, applied in order received by master Record also contains origin of last few updates Mastership can be changed by current master, based on this information Mastership change is simply a record update Tablets mastership Required to ensure primary key consistency Can be different from record mastership
29
29 Other Features Per record transactions Copying a tablet (on failure, for e.g.) Request copy Publish checkpoint message Get copy of tablet as of when checkpoint is received Apply later updates Tablet split Has to be coordinated across all copies
30
30 Query Processing Range scan can span tablets Only one tablet scanned at a time Client may not need all results at once Continuation object returned to client to indicate where range scan should continue Notification One pub-sub topic per tablet Client knows about tables, does not know about tablets Automatically subscribed to all tablets, even as tablets are added/removed. Usual problem with pub-sub: undelivered notifications, handled in usual way
31
31 Experiments
32
32 Experimental setup Production PNUTS code Enhanced with ordered table type Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk Workload 1200-3600 requests/second 0-50% writes 80% locality
33
33 Inserts required 75.6 ms per insert in West 1 (tablet master) 131.5 ms per insert into the non-master West 2, and 315.5 ms per insert into the non-master East.
34
34 10% writes by default
35
35 Scalability
36
36 Request skew
37
37 Size of range scans
38
38 Related work Distributed and parallel databases Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data Services, Cassandra Distributed filesystems Ceph, Boxwood, Sinfonia Distributed (P2P) hash tables Chord, Pastry, … Database replication Master-slave, epidemic/gossip, synchronous…
39
39 Conclusions and ongoing work PNUTS is an interesting research product Research: consistency, performance, fault tolerance, rich functionality Product: make it work, keep it (relatively) simple, learn from experience and real applications Ongoing work Indexes and materialized views Bundled updates Batch query processing
40
40 Thanks! cooperb@yahoo-inc.com research.yahoo.com
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.