Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2014-2015.

Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2014-2015

Yahoo! PNUTS 2  A massively parallel and geographically distributed database system for Yahoo!’s web applications  provides data storage organized as hashed or ordered tables  low latency for large numbers of concurrent requests including updates and queries  per-record consistency guarantees

Consistency 3  Serializability of general transaction is inefficient and often unnecessary  If a user changes an avatar, posts new pictures, or invites several friends to connect, little harm is done if the new avatar is not initially visible to one friend  Many distributed applications go to the extreme of providing only eventual consistency  Too weak and inadequate for web applications  PNUTS suggests a consistency model that falls between those two extremes

SYSTEM ARCHITECTURE 4  Data is organized into tables of records with attributes  In addition to typical data types, “blob” is a valid data type, allowing arbitrary structures inside a record  Data tables are horizontally partitioned into groups of records called tablets.  Tablets are scattered across many servers  each server might have hundreds or thousands of tablets, but each tablet is stored on a single server within a region

Distributed Hash Table 5 0x0000 0x911F 0x2AF3 Tablet

Distributed Hash Table 6 Tablet clustered by key range

Query model 7  PNUTS supports very simple queries sacrificing rich API in favor of response time and overall simplicity  No joins, group-by, etc.  This is stated as future work  The system is designed to work well with queries that read and write single records or small groups of records

PNUTS-Single Region 8 a single pair of active/standby servers Maintains map from database.table.key to tablet to storage-unit a single pair of active/standby servers Maintains map from database.table.key to tablet to storage-unit Routes client requests to correct storage unit Caches the maps from the tablet controller Routes client requests to correct storage unit Caches the maps from the tablet controller Stores records Services get/set/delete requests Stores records Services get/set/delete requests

Tablet Splitting & Balancing Each storage unit has many tablets (horizontal partitions of the table) Tablets may grow over time Overfull tablets split Storage unit may become a hotspot Shed load by moving tablets to other servers 9

Consistency Options Availability Consistency 10  Eventual Consistency  Low latency updates and inserts done locally  Record Timeline Consistency  Each record is assigned a “master region”  Inserts succeed, but updates could fail during outages  Primary Key Constraint + Record Timeline  Each tablet and record is assigned a “master region”  Inserts and updates could fail during outages

Record Timeline Consistency 11  One of the replicas is designated as the master  Per record  All updates to that record are forwarded to the master  If a replica is receiving the majority of write requests – it becomes the master  Each update advances the generation of the record

Record Timeline Consistency Transactions:  Alice changes status from “Sleeping” to “Awake”  Alice changes location from “Home” to “Work” (Alice, Home, Sleeping)(Alice, Home, Awake) Region 1 (Alice, Home, Sleeping)(Alice, Work, Awake) Region 2 Awake Work (Alice, Work, Awake) Work (Alice, Work, Awake) No replica should see record as (Alice, Work, Sleeping ) 12

API calls 13  Read-any  Returns a possibly stale version of the record.  The returned record is always a valid one from the record’s history.  This call has lower latency than other read calls with stricter guarantees  Read-critical(required version)  Returns a version of the record that is strictly newer than, or the same as the required version.  Read-latest  Returns the latest copy of the record that reflects all writes that have succeeded.  Write  This call gives the same ACID guarantees as a transaction with a single write operation in it. This call is useful for blind writes, e.g., a user updating his status on his profile.  Test-and-set-write(required version)  This call performs the requested write to the record if and only if the present version of the record is the same as required version.

Eventual Consistency  Timeline consistency comes at a price  Writes not originating in record master region forward to master and have longer latency  The mastership of a record can migrate between replicas  When master region down, record is unavailable for write  eventual consistency mode  On conflict, latest write per field wins  Target customers  Those that externally guarantee no conflicts  Those that understand/can cope 14

Yahoo! Message Broker (YMB)  A topic-based publish/subscribe system  Data updates are considered “committed” when they have been published to YMB.  At some point after being committed, the update will be asynchronously propagated to different regions and applied to their replicas  YMB guarantees that published messages will be delivered to all topic subscribers even in the presence of single broker machine failures  by logging the message to multiple disks on different servers. two copies are logged initially, and more copies are logged as the message propagates  The message is not purged from the YMB log until PNUTS has verified that the update is applied to all replicas of the database  YMB provides partial ordering of published messages.  Mes sages published to a particular YMB cluster will be delivered to all subscribers in the order they were published 15

Recovery  Recovering from a failure involves copying lost tablets from another replica.  A three step process  The tablet controller requests a copy from a particular remote replica (the “source tablet”).  A “checkpoint message” is published to YMB, to ensure that any in- flight updates at the time the copy is initiated are applied to the source tablet.  The source tablet is copied to the destination region.  To support this recovery protocol, tablet boundaries are kept synchronized across replicas, and tablet splits are conducted by having all regions split a tablet at the same point  coordinated by a two-phase commit between regions. 16

For more info 17 http://www.mpi-sws.org/~druschel/courses/ds/papers/cooper-pnuts.pdf

Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2014-2015.

Similar presentations

Presentation on theme: "Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2014-2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2014-2015.

Similar presentations

Presentation on theme: "Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester, 2014-2015."— Presentation transcript:

Similar presentations

About project

Feedback