Dynamo: Amazon’s Highly Available Key- value Store (SOSP’07) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store Slides taken from created by paper authors Giuseppe DeCandia, Deniz Hastorun,
Dynamo: Amazon’s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat.
Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available.
D YNAMO : A MAZON ’ S H IGHLY A VAILABLE K EY - V ALUE S TORE Presented By Roni Hyam Ami Desai.
Distributed Hash Tables Chord and Dynamo Costin Raiciu, Advanced Topics in Distributed Systems 18/12/2012.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Dynamo: Amazon's Highly Available Key-value Store Guiseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
Amazon Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
Dynamo: Amazon’s Highly Available Key-value Store Adopted from slides and/or materials by paper authors (Giuseppe DeCandia, Deniz Hastorun, Madan Jampani,
1 Dynamo Amazon’s Highly Available Key-value Store Scott Dougan.
Dynamo Highly Available Key-Value Store 1Dennis Kafura – CS5204 – Operating Systems.
Dynamo Kay Ousterhout. Goals Small files Always writeable Low latency – Measured at 99.9 th percentile.
Rethinking Dynamo: Amazon’s Highly Available Key-value Store --An Offense Shih-Chi Chen Hongyu Gao.
Versioning and Eventual Consistency COS 461: Computer Networks Spring 2011 Mike Freedman 1.
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Dynamo: Amazon's Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, et.al., SOSP ‘07.
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.
EECS 262a Advanced Topics in Computer Systems Lecture 22 P2P Storage: Dynamo November 14 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma CSCI-572 (Prof. Chris Mattmann)
Dynamo: Amazon’s Highly Available Key-value Store COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
EECS 262a Advanced Topics in Computer Systems Lecture 22 P2P Storage: Dynamo November 17 th, 2014 John Kubiatowicz Electrical Engineering and Computer.
Depot: Cloud Storage with minimal Trust COSC 7388 – Advanced Distributed Computing Presentation By Sushil Joshi.
Dynamo: Amazon's Highly Available Key-value Store Dr. Yingwu Zhu.
Dynamo: Amazon’s Highly Available Key-value Store DeCandia, Hastorun, Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels PRESENTED.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
D YNAMO : A MAZON ’ S H IGHLY A VAILABLE K EY - VALUE S TORE Presenters: Pourya Aliabadi Boshra Ardallani Paria Rakhshani 1 Professor : Dr Sheykh Esmaili.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Trade-offs in Cloud.
Dynamo: Amazon’s Highly Available Key-value Store
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
CAP + Clocks Time keeps on slipping, slipping…. Logistics Last week’s slides online Sign up on Piazza now – No really, do it now Papers are loaded in.
Peer to Peer Networks Distributed Hash Tables Chord, Kelips, Dynamo Galen Marchetti, Cornell University.
Databases Illuminated
EECS 262a Advanced Topics in Computer Systems Lecture 22 P2P Storage: Dynamo November 20 th, 2013 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN,
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Big Data Yuan Xue CS 292 Special topics on.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Peter Vosshall James Cheng CSE, CUHK.
Amazon Simple Storage Service (S3)
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
CS 440 Database Management Systems
P2P: Storage.
Dynamo: Amazon’s Highly Available Key-value Store
Lecturer : Dr. Pavle Mogin
Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD
John Kubiatowicz Electrical Engineering and Computer Sciences
Scaling Out Key-Value Storage
EECS 498 Introduction to Distributed Systems Fall 2017
CS 440 Database Management Systems
Dynamo Recap Hadoop & Spark
EECS 498 Introduction to Distributed Systems Fall 2017
Key-Value Tables: Chord and DynamoDB (Lecture 16, cs262a)
EECS 498 Introduction to Distributed Systems Fall 2017
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Presentation transcript:

Dynamo: Amazon’s Highly Available Key- value Store (SOSP’07) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels

EECE 411: Design of Distributed Software Applications Scale > 10M customers at peak times > 10K servers (distributed around the world) 3M checkout operations per day (peak season 2006) Problem: Reliability at massive scale: Slightest outage has significant financial consequences and impacts customer trust. Amazon eCommerce platform

EECE 411: Design of Distributed Software Applications Key requirements: Data/service availability the key issue. “Always writeable” data-store Low latency – delivered to (almost all) clients/users Example SLA: provide a 300ms response time, for 99.9% of requests, for a peak load of 500requests/sec. Why not average/median? Architectural requirements Incremental scalability Symmetry Ability to run on a heterogeneous platform Amazon eCommerce platform - Requirements

EECE 411: Design of Distributed Software Applications Data Access Model Data stored as (key, object) pairs: Interface put(key, object), get(key) ‘identifier’ generated as a hash for object Objects: Opaque Application examples: shopping carts, customer preferences, session management, sales rank, product catalog, S3

EECE 411: Design of Distributed Software Applications Not a database! Databases: ACID Properties Atomicity, Consistency, Isolation, Durability Dynamo: relax the “C” – to increase availability, No isolation, only single key updates. Further assumptions: Relatively small objects (<1MB) Operations do not span multiple objects Friendly (cooperative) environment, One Dynamo instance per service  100s hosts/service

EECE 411: Design of Distributed Software Applications Deign intuition Requirements: High data availability; always writeable data store Solution idea Multiple replicas … … but avoid synchronous replica coordination … (used by solutions that provide strong consistency). Tradeoff: Consistency  Availability.. and use weak consistency models to improve availability The problems this introduces: when to resolve possible conflicts and who should solve them When: at read time (allows providing an “always writeable” data store) Who: the application or the data store

EECE 411: Design of Distributed Software Applications Key technical problems Partitioning the key/data space High availability for writes Handling temporary failures Recovering from permanent failures Membership and failure detection

EECE 411: Design of Distributed Software Applications ProblemTechniqueAdvantage Partitioning Consistent hashingIncremental scalability High availability for writes Eventual consistency Vector clocks with reconciliation during reads Handling temporary failures ‘Sloppy’ quorum protocol and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

EECE 411: Design of Distributed Software Applications Partition Algorithm: Consistent hashing Each data item is replicated at N hosts. Replication: “preference list” - The list of nodes that is responsible for storing a particular key.

EECE 411: Design of Distributed Software Applications Quorum systems Multiple replicas to provide durability … … but avoid synchronous replica coordination … Traditional quorum system: R/W: the minimum number of nodes that must participate in a successful read/write operation. Problem: the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. To provide better latency R and W are usually configured to be less than N. R + W > N yields a quorum-like system. ‘Sloppy quorum’ in Dynamo

EECE 411: Design of Distributed Software Applications Data versioning Multiple replicas … … but avoid synchronous replica coordination … The problems this introduces: when to resolve possible conflicts? at read time (allows providing an “always writeable” data store) A put() call may return to its caller before the update has been applied at all the replicas A get() call may return different versions of the same object. who should solve them the application  use vector clocks to capture causality ordering between different versions of the same object. the data store  use logical clocks or physical time

EECE 411: Design of Distributed Software Applications Vector Clocks Each version of each object has one associated vector clock. list of (node, counter) pairs. Reconciliation: If the counters on the first object’s clock are less-than- or-equal than all of the nodes in the second clock, then the first is a direct ancestor of the second (and can be ignored). Otherwise: application-level reconciliation

EECE 411: Design of Distributed Software Applications D1 ([Sx, 1]) D2 ([Sx, 2]) D3 ([Sx, 2][Sy,1]) D4 ([Sx, 2] [Sz,1]) D5 ([Sx, 3][Sy,1][Sz,1]) write handled by Sx write handled by Sx write handled by Sy write handled by Sz Reconciled and written by Sx

EECE 411: Design of Distributed Software Applications ProblemTechniqueAdvantage Partitioning Consistent hashingIncremental scalability High availability for writes Eventual consistency Vector clocks with reconciliation during reads Handling temporary failures ‘Sloppy’ quorum protocol and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

EECE 411: Design of Distributed Software Applications Other techniques Node synchronization: Hinted handoff Merkle hash tree.

EECE 411: Design of Distributed Software Applications “Hinted handoff” Assume replications factor N = 3. When A is temporarily down or unreachable during a write, send replica to D. D is hinted that the replica is belong to A and it will deliver to A when A is recovered. Again: “always writeable”

EECE 411: Design of Distributed Software Applications Why replication is tricky The claim: Dynamo will replicate each data item on N successors A pair (k,v) is stored by the node closest to k and replicated on N successors of that node Why is this hard? k

EECE 411: Design of Distributed Software Applications Replication Meets Epidemics Candidate Algorithm For each (k,v) stored locally, compute SHA(k.v) Every period, pick a random leaf set neighbor Ask neighbor for all its hashes For each unrecognized hash, ask for key and value This is an epidemic algorithm All N members will have all (k,v) in log(N) periods But as is, the cost is O(C), where C is the size of the set of items stored at the original node

EECE 411: Design of Distributed Software Applications Merkle Trees An efficient summarization technique Interior nodes are the secure hashes of their children E.g., I = SHA(A.B), N = SHA(K.L), etc. ABCDEFGH I=SHA(A.B)JKL N=SHA(K.L)M R

EECE 411: Design of Distributed Software Applications Merkle Trees Merkle trees are an efficient summary technique If the top node is signed and distributed, this signature can later be used to verify any individual block, using only O(log n) nodes, where n = # of leaves E.g., to verify block C, need only R, N, I, C, & D ABCDEFGH IJKL NM R

EECE 411: Design of Distributed Software Applications Using Merkle Trees as Summaries Improvement: use Merkle tree to summarize keys B gets tree root from A, if same as local root, done Otherwise, recurse down tree to find difference B’s values:A’s values: [0, ) [0, )[2 159, ) [0, ) [0, )[2 159, )

EECE 411: Design of Distributed Software Applications Using Merkle Trees as Summaries Improvement: use Merkle tree to summarize keys B gets tree root from A, if same as local root, done Otherwise, recurse down tree to find difference New cost is O(d log C) d = number of differences, C = size of disk B’s values:A’s values: [0, ) [0, )[2 159, ) [0, ) [0, )[2 159, )

EECE 411: Design of Distributed Software Applications Using Merkle Trees as Summaries Still too costly: If A is down for an hour, then comes back, changes will be randomly scattered throughout tree B’s values:A’s values: [0, ) [0, )[2 159, ) [0, ) [0, )[2 159, )

EECE 411: Design of Distributed Software Applications Using Merkle Trees as Summaries Still too costly: If A is down for an hour, then comes back, changes will be randomly scattered throughout tree Solution: order values by time instead of hash Localizes values to one side of tree B’s values:A’s values: [0, 2 64 ) [0, 2 63 )[2 63, 2 64 ) [0, 2 64 ) [0, 2 63 )[2 63, 2 64 )

EECE 411: Design of Distributed Software Applications Implementation Java non-blocking IO Local persistence component allows for different storage engines to be plugged in: Berkeley Database (BDB) Transactional Data Store: object of tens of kilobytes MySQL: larger objects

EECE 411: Design of Distributed Software Applications Evaluation

EECE 411: Design of Distributed Software Applications Trading latency & durability

EECE 411: Design of Distributed Software Applications Load balance

EECE 411: Design of Distributed Software Applications Versions 1 version  99.94% 2 versions  % 3 versions  % 4 versions 

EECE 411: Design of Distributed Software Applications ProblemTechniqueAdvantage PartitioningConsistent HashingIncremental Scalability High Availability for writes Eventual consistency Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failures ‘Sloppy’ quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.