CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Peter Vosshall James Cheng CSE, CUHK
Dynamo: Amazon's highly available key-value store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels SOSP
Why are we reading this paper? Database, eventually consistent, write any replica A real system: used for e.g. shopping cart at Amazon More availability, less consistency than PNUTS Influential design; inspired e.g. Facebook's Cassandra 3
Amazon’s eCommerce Platform Architecture Loosely coupled, service oriented architecture Stateful services own and manage their own state Stringent latency requirements – services must adhere to formal SLAs – Measured at the 99.9 th percentile Availability is paramount Large scale (and growing) 4
Motivation Amazon.com: one of the largest e-commerce operations in the world Reliability at massive scale: – tens of thousands of servers and network components – highly decentralized, loosely coupled – slightest outage has significant financial consequences and impacts on customer trust 5
Motivation Most services on Amazon only need primary-key access to a data store => key-value store State management (primary factor for scalability and availability) => need a highly available, scalable storage system RDBMS is a poor fit – most features unused – scales up, not out – availability limitations Consistency vs. Availability – High availability is very important – User perceived consistency is very important – Trade-off strong consistency in favor of high availability 6
Key Requirements “Always Writable” – accept writes during failure scenarios – allow write conversations without prior context User-perceived consistency Guaranteed performance (99.9 th percentile latency) Incremental scalability “Knobs” to tune tradeoffs between cost, consistency, durability and latency No existing production-ready solutions met these requirements 7
What is Dynamo A highly available and scalable distributed data storage system – a key-value store – data partitioned and replicated using consistent hashing (convenient for adding/removing nodes) – consistency facilitated by object versioning – a quorum-like technique and a decentralized replica synchronization protocol to maintain consistency among replicas during updates – always writeable and eventually consistent – a gossip based protocol to detect failure and membership 8
System Assumptions & Requirements Query model – simple reads/writes by a key – no operation spans multiple data items – no need for relation schema ACID properties – weaker consistency – no isolation guarantee and only single key updates Security – run internally, no security requirements 9
System Interface A simple primary-key only interface – get(key) locate the object replicas associated with key return a single object or a list of objects with conflicting versions along with a context – put(key, context, object) determines where the replicas of the object should be placed based on key, and writes the replicas to disk context: system metadata about the object 10
Techniques used in Dynamo ProblemTechnique usedAdvantage PartitioningConsistent hashingIncremental scalability High availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information 11
Partitioning For incremental scaling => dynamically partitioning data when storage nodes added/removed By consistent hashing 12
Consistent hashing 0 A B C h(key1) 13 Each node is assigned a random value on the ring Each data item hashed to a position on the ring, & located the first node next to it clockwise h(key2)
Incremental Scaling 0 A B C h(key1) h(key2) D 14 Problem: random node assignment => non-uniform data and load distribution Solution: next slide h(key2)
Load Balancing 0 A B C A A A B B B C C C D D D D 15 Virtual nodes: each node assigned to multiple points on the ring
Replication For high availability and durability By replicating data on N storage nodes 16
Replication 0 A B h(key1) D 17 F C E Each key replicated at the N-1 clockwise successor nodes Each node i stores keys in the ranges (i-3, i-2], (i-2, i-1] and (i-1, i] These N nodes are called the preference list of the key
Load Balancing 0 A B C A A A B B B C C C D D D D 18 h(key2) The preference list (PL) may contain multiple virtual nodes of a physical node E.g., if N=3, PL of key2 is {C, B, C} Allow only 1 virtual node of each physical node in constructing PL E.g., PL of key2 now becomes {C, B, A}
Tradeoffs efficiency/scalability => partition availability => replication always writeable => allowed to write just one replica always writeable + replicas + partitions = conflicting versions Dynamo solution: eventual consistency by data versioning 19
Eventual Consistency accept writes at any replica allow divergent replicas allow reads to see stale or conflicting data resolve conflicts when failures go away – reader must merge and then write 20
Unhappy Consequences of Eventual Consistency No notion of "latest version“ Can read multiple conflicting versions Application must merge and resolve conflicts No atomic operations (e.g. no PNUTS test-and- set-write) 21
Techniques Used Vector clocks – Distributed time “Sloppy quorum” – Hinted handoff Anti-entropy mechanism – Merkle trees 22
Distributed Time The notion of time is well-defined (and measurable) at each single location But the relationship between time at different locations is unclear – Can minimize discrepancies, but never eliminate them Examples: – If two file servers get different update requests to same file, what should be the order of those requests? 23
A Baseball Example Four locations: pitcher’s mound (P), home plate, first base, and third base Ten events: e1: pitcher (P) throws ball toward home e2: ball arrives at home e3: batter (B) hits ball toward pitcher e4: batter runs toward first base e5: runner runs toward home e6: ball arrives at pitcher e7: pitcher throws ball toward first base e8: runner arrives at home e9: ball arrives at first base e10: batter arrives at first base R R B B P P 2 nd base 1 st base3 rd base home plate 24
A Baseball Example Pitcher knows e1 happens before e6, which happens before e7 Home plate umpire knows e2 is before e3, which is before e4, which is before e8, … Relationship between e8 and e9 is unclear 25
Ways to Synchronize Send message from first base to home when ball arrives? – Or both home and first base send messages to a central timekeeper when runner/ball arrives – But: How long does this message take to arrive? 26
Logical Time 27
Global Logical Time 28
Concurrency 29
Back to Baseball e1: pitcher (P) throws ball toward home e2: ball arrives at home e3: batter (B) hits ball toward pitcher e4: batter runs toward first base e5: runner runs toward home e6: ball arrives at pitcher e7: pitcher throws ball toward first base e8: runner arrives at home e9: ball arrives at first base e10: batter arrives at first base 30
Vector Clocks 31
Vector Clock Algorithm 32
Vector clocks on the baseball example EventVectorAction e1[1,0,0,0]pitcher throws ball to home e2[1,0,1,0]ball arrives at home e3[1,0,2,0]batter hits ball to pitcher e4[1,0,3,0]batter runs to first base e5[0,0,0,1]runner runs to home e6[2,0,2,0]ball arrives at pitcher e7[3,0,2,0]pitcher throws ball to 1 st base e8[1,0,4,1]runner arrives at home e9[3,1,2,0]ball arrives at first base e10[3,2,3,0]batter arrives at first base Vector: [p,f,h,t] 33
Important Points Physical Clocks – Can keep closely synchronized, but never perfect Logical Clocks – Encode causality relationship – Vector clocks provide exact causality information 34
Vector Clocks in Dynamo Consistency Management – Each put() creates new, immutable version – Dynamo tracks version history When vector clocks grow large – Keep recently-updated entries Write handled by Sx D1([Sx, 1]) Write handled by Sx D2([Sx, 2]) Write handled by Sy Write handled by Sz D3([Sx, 2],[Sy, 1])D4([Sx, 2],[Sz, 1]) D5([Sx, 3],[Sy, 1], [Sz, 1]) reconciled and written by Sx 35
Execution of get() and put() A consistency protocol similar to quorum Quorum: R + W > N – Consider N healthy nodes – R is the minimum number of nodes that must participate in a successful read operation – W is the minimum number of nodes that must participate in a successful write operation – never wait for all N – but R and W will overlap => at least 1 R will see updated W 36
Main advantage of Dynamo is flexible N, R, W – What do you get by varying them? Configurability NRWApplication 322Consistent, durable, interactive, user state (typical configuration) n1nHigh performance read engine 111Distributed web cache 37
Write by Quorum 0 A B h(key1) D F C E put(key1, v1) Key1=v1 local write 38 Generate vector clock for the new version and write locally Preference list: A, F, B
Write by Quorum 0 A B h(key1) D F C E put(key1, v1) Key1=v1 forwarded writes 39 Send new version to top-N reachable nodes in preference list
Write by Quorum 0 A B h(key1) D F C E put(key1, v1) Key1=v1 success Success! 40 Write successful if W-1 nodes respond W=2
Read by Quorum 0 A B h(key1) D F C E get(key1) Key1=v1 local read forwarded reads 41 Request all existing versions from top-N reachable nodes in preference list
Read by Quorum 0 A B h(key1) D F C E get(key1) Key1=v1 = v1 42 Read successful if R nodes respond success If multiple causally unrelated versions, return all R=2
Failures -- two levels Temporary failures vs. Permanent failures Node unreachable -- what to do? – if really dead, need to make new copies to maintain fault- tolerance – if really dead, want to avoid repeatedly waiting for it – if just temporary, hugely wasteful to make new copies 43
Temporary failure handling: quorum Goal – do not block waiting for unreachable nodes – get should have high prob of seeing most recent “put”s “Sloppy quorum”: N is not all nodes, but first N reachable nodes in preference list – each node pings to keep rough estimate of up/down 44
Sloppy Quorum 0 A B h(key1) D F C E put(key, v2) Key1=v1 45 Preference list: A, F, B => A, F, D
Sloppy Quorum 0 A B h(key1) D F C E put(key, v2) Key1=v2 Key1=v1 local write 46
Sloppy Quorum 0 A B h(key1) D F C E put(key, v2) Key1=v2 Key1=v1 local write forwarded writes Key1=v2 hint: B Key1=v1Key1=v2 47 Send the write to another node with a hint on the intended recipient (now temporary down)
Sloppy Quorum 0 A B h(key1) D F C E put(key, v2) Key1=v2 Key1=v1 Key1=v2 hint: B Key1=v2 Success! 48 Send the write to another node with a hint on the intended recipient (now temporary down) success
Sloppy Quorum 0 A B h(key1) D F C E Key1=v2 Key1=v1 Key1=v2 hint: B Key1=v2 49 When the dead node recovers, the hinted replica is transferred to it and deleted from the other node
Sloppy Quorum 0 A B h(key1) D F C E Key1=v2 Key1=v1 Key1=v2 50 When the dead node recovers, the hinted replica is transferred to it and deleted from the other node Preference list: A, F, B Key1=v2 hint: B
Permanent failure handling: anti-entropy Anti-entropy: comparing all the replicas of each piece of data that exist and updating each replica to the newest version Merkle tree: a hash tree – a leave is the hash value of a key – an internal node is the hash value of its children 51
Permanent failure handling: anti-entropy Use of Merkle tree to detect inconsistencies between replicas – compare the hash values of the root of two (sub)trees yes: all leaves are equal => no synchronization needed no: compare the children of the two trees, recursively until reaching leaves: the replicas needed for synchronization are the leaves of the two trees that have different hash values – no data transfer at internal node level 52
Permanent failure handling: anti-entropy Use of Merkle tree in Dynamo – each node maintains a Merkle tree for each key range on the ring – two nodes hosting a common key range compare whether keys within the key range are up-to-date by exchanging the root of their Merkle trees 53
Membership and Failure Detection Ring membership – addition of nodes (for incremental scaling) – removal of nodes (due to failures) One node chosen (probably randomly each time) to write any membership change to persistent store A gossip-based protocol propagates membership changes: each node contacts a peer chosen at random every sec to reconcile their membership info 54
Wrap-up Main ideas: – eventual consistency, consistent hashing, allow conflicting writes, client merges Maybe a good way to get high availability + no blocking on WAN Awkward model for some applications (stale reads, merges) Services that use Dynamo: best seller lists, shopping carts, customer preferences, session management, sales rank, product catalog, etc. No agreement on whether it's good for storage systems – Unclear what's happened to Dynamo at Amazon in the meantime – Almost certainly significant changes (2007->2016) 55