D u k e S y s t e m s Dynamo Jeff Chase Duke University
[Pic courtesy of Alex Smola]
A quick note on terminology The Dynamo paper casually overloads at least the following terms: node partition consistent
Dynamo: a familiar abstraction? Reliability and scalability of a system depend on how it manages its state. Dynamo is a key-value store. Keys are 128 bit (BigInts) Values are binary objects or “blobs” (untyped) Put/get writes/reads value in its entirety Access by “primary key only” Application services built above the store Store has S servers with hash-based data distribution So how/why is it different from our toy KVStore? Specifically: how is the abstraction different?
KVStore class KVStore extends Actor { private val store = new scala.collection.mutable.HashMap[BigInt, Any] override def receive = { case Put(key, cell) => sender ! store.put(key,cell) case Get(key) => sender ! store.get(key) } } [Pic courtesy of Alex Smola]
Dynamo: multi-version KV store Each value has a version stamp, also called context. Why does Put take a stamp for the last version read? Why can Get return multiple values? Under what circumstances can this occur? If there are multiple values, the app is responsible for reconciling them. How? Get(key) {[version, value]*} Put(key, version/context, value) success
Dynamo: replication Dynamo is a highly available KV store. Dynamo uses replication to mask failures. Each (key, value) pair is stored on N servers/replicas. How is it different from replication in Chubby? Why is it different?
Dynamo vs. Chubby (replication) Chubby: primary/backup, synchronous, consensus Dynamo: – Symmetric: no primary! – Asynchronous: some writes wait for W<N/2 replicas to respond. – The other replicas learn of a write eventually. client request ChubbyDynamo
Dynamo vs. Chubby: CAP Where are they on CAP? Chubby: all replicas apply all writes in the same (total) order, read always returns value of last write. If safety cannot be guaranteed, then fail the operation. Dynamo: never block or fail an operation. Do what you must to get it done. If the going gets rough, just muddle through and sort it out later. Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously. A put() call may return to its caller before the update has been applied at all the replicas, which can result in scenarios where a subsequent get() operation may return an object that does not have the latest updates.…under certain failure scenarios (e.g., server outages or network partitions), updates may not arrive at all replicas for an extended period of time.
Prior to joining Amazon, he worked as a researcher at Cornell University. Dr. Werner Vogels is Vice President & Chief Technology Officer at Amazon.com.
Vogels on consistency Strong consistency: “After the update completes, any subsequent access will return the updated value.” Consistency “has to do with how observers see these updates”. The scenario A updates a “data object” in a “storage system”. Eventual consistency: “If no new updates are made to the object, eventually all accesses will return the last updated value.”
Dynamo’s “killer app”: shopping carts For a number of Amazon services, rejecting customer updates could result in a poor customer experience. For instance, the shopping cart service must allow customers to add and remove items from their shopping cart even amidst network and server failures. This requirement forces us to push the complexity of conflict resolution to the reads in order to ensure that writes are never rejected.
Conflicts and resolution In order to provide this kind of guarantee, Dynamo treats the result of each modification as a new and immutable version of the data. It allows for multiple versions of an object to be present in the system at the same time. Most of the time, new versions subsume the previous version(s), and the system itself can determine the authoritative version (syntactic reconciliation). However, version branching may happen, in the presence of failures combined with concurrent updates, resulting in conflicting versions of an object. In these cases, the system cannot reconcile the multiple versions of the same object and the client must perform the reconciliation in order to collapse multiple branches of data evolution back into one (semantic reconciliation). Get(key) {[version, value]*} Put(key, version/context, value) success
Vector clocks Dynamo version stamps are a kind of vector clock. We will discuss them later. For now, what we care about are the properties of vector clocks. Given two versions A and B, the system can determine from their version stamps if: – A happened before B: the writer of B had read and seen A before choosing the new value B, so B supersedes A. – B happened before A: etc., so A supersedes B. – A and B were concurrent, i.e., conflicting. The writer of version A had not seen version B when it wrote A. The writer of version B had not seen version A when it wrote B. These versions need to be reconciled/merged….somehow.
Dynamo vector clocks: example
Application-level conflict resolution The next design choice is who performs the process of conflict resolution… since the application is aware of the data schema it can decide on the conflict resolution method that is best suited for its client’s experience. For example, the shopping cart application requires that an “Add to Cart” operation can never be forgotten or rejected. If the most recent state of the cart is unavailable, and a user makes changes to an older version of the cart, that change is still meaningful and should be preserved. But at the same time it shouldn’t supersede the currently unavailable state of the cart, which itself may contain changes that should be preserved. …When…the latest version is not available, the item is added to (or removed from) the older version and the divergent versions are reconciled later. … For example, the application that maintains customer shopping carts can choose to “merge” the conflicting versions and return a single unified shopping cart…Using this reconciliation mechanism, an “add to cart” operation is never lost. However, deleted items can resurface.
Finding the key for a value named by a string KVStore private def hashForKey(anything: String): BigInt = { val md: MessageDigest = MessageDigest.getInstance("MD5") val digest: Array[Byte] = md.digest(string.getBytes) BigInt(1, digest) } Dynamo Dynamo treats both the key and the object supplied by the caller as an opaque array of bytes. It applies a MD5 hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are responsible for serving the key.
How to distribute keys across the servers? How KVStore does it: private def route(key: BigInt): ActorRef = { stores((key % stores.length).toInt) } How should we map a key to N replicas? What if the number of nodes (stores.length) changes? What if nodes disagree on the set of active peers (stores)? What could go wrong?
Dynamo’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. In consistent hashing [10], the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. Consistent hashing in Dynamo preference list
Consistent hashing Consistent hashing is a technique to assign data objects (or functions) to servers Key benefit: adjusts efficiently to churn. – Adjust as servers leave (fail) and join (recover) Used in Internet server clusters and also in distributed hash tables (DHTs) for peer-to-peer services. Developed at MIT for Akamai CDN Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the WWW. Karger, Lehman, Leighton, Panigrahy, Levine, Lewin. ACM STOC, ≈2000 citations
Consistent Hashing Slides from Bruce Maggs
Hashing E.g., h(x) = (((a x + b) mod P) mod |B|), where P is prime, P > |U| a,b chosen uniformly at random from Z P x is a serial number Universe U of all possible objects, set B of buckets. object: set of web objects with same serial number bucket: web server Hash function h: U B Assigns objects to buckets
f(d) = d + 1 mod 5 Difficulty changing number of buckets bucket object f(d) = d + 1 mod 4
Consistent Hashing Idea: Map both objects and buckets to unit circle. object bucket Assign object to next bucket on circle in clockwise order. new bucket
Complication – Different Views select servers within cluster Low-level DNS servers act independently and may have different ideas about how many and which servers are alive. a212.g.akamai.net Akamai Low-Level DNS Servers
Properties of Consistent Hashing Monotonicity: When a bucket is added/removed, the only objects affected are those that are/were mapped to the bucket. Balance: Objects are assigned to buckets “randomly”. -- can be improved by mapping each bucket to multiple places on unit circle Load: Objects are assigned to buckets evenly, even over a set of views. Spread: An object should be mapped to a small number of buckets over a set of (potentially diverging) views.
Consistent hashing in practice I The basic consistent hashing algorithm presents some challenges. First, the random position assignment of each node on the ring leads to non-uniform data and load distribution. Second, the basic algorithm is oblivious to the heterogeneity in the performance of nodes. To address these issues, Dynamo uses a variant of consistent hashing (similar to the one used in [10, 20]): instead of mapping a node to a single point in the circle, each node gets assigned to multiple points in the ring. To this end, Dynamo uses the concept of “virtual nodes”. A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node. Effectively, when a new node is added to the system, it is assigned multiple positions (henceforth, “tokens”) in the ring. Read and write operations involve the first N healthy [physical] nodes in the preference list, skipping over those that are down or inaccessible. When all nodes are healthy, the top N nodes in a key’s preference list are accessed. When there are node failures or network partitions, nodes that are lower ranked in the preference list are accessed.
Balanced elasticity with virtual nodes Using virtual nodes has the following advantages: -If a node becomes unavailable (due to failures or routine maintenance), the load handled by this node is evenly dispersed across the remaining available nodes. -When a node becomes available again, or a new node is added to the system, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. -The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.
The need for fixed-size partitions Strategy 1: T random tokens per node and partition by token value While using this strategy, the following problems were encountered. First, when a new node joins the system…the nodes handing the key ranges off to the new node have to scan their local persistence store to retrieve the appropriate set of data items. Note that performing such a scan operation on a production node is tricky as scans are highly resource intensive operations…during busy shopping season, when the nodes are handling millions of requests a day, the bootstrapping has taken almost a day to complete. Second, when a node joins/leaves the system…the Merkle trees for the [many] new ranges need to be recalculated. Finally, there was no easy way to take a snapshot of the entire key space due to the randomness in key ranges…archiving the entire key space requires us to retrieve the keys from each node separately, which is highly inefficient.
Consistent hashing in practice II Strategy 2: T random tokens per node and equal sized partitions. In this strategy, the hash space is divided into Q equally sized partitions/ranges and each node is assigned T random tokens…. A partition is placed on the first N unique nodes that are encountered while walking the consistent hashing ring clockwise from the end of the partition. Figure 7 illustrates this strategy for N=3. In this example, nodes A, B, C are encountered while walking the ring from the end of the partition that contains key k1.
Riak is a commercial open-source KV store inspired by Dynamo. Its partitioning scheme is similar to Dynamo.
Benefits of fixed-size partitions Since partition ranges are fixed, they can be stored in separate files, meaning a partition can be relocated as a unit by simply transferring the file (avoiding random accesses needed to locate specific items).… Archiving the entire dataset stored by Dynamo is simpler … because the partition files can be archived separately. [Reconciliation by Merkle trees is easier.]
Tweaking the partition scheme For Q partitions, choose T= Q/S Juggle tokens around the nodes as desired to tweak the partitioning scheme on the fly. Keep a token list for each node. Claims are made, but details (vs. Strategy 2) are fuzzy. (?) Strategy 3: Q/S tokens per node, equal-sized partitions: Similar to strategy 2, but each node is assigned T=Q/S tokens where S is the number of nodes in the system. When a node leaves the system, its tokens are randomly distributed to the remaining nodes….Similarly, when a node joins the system it "steals" tokens from nodes in the system in a way that preserves these properties. …each node needs to maintain the information regarding the partitions assigned to each node.
Load skew vs. metadata size measure of load skew
Also to discuss Ring membership changes by admin command. Ring membership propagation by gossip and “seeds”. Anti-entropy reconciliation with Merkle trees
Merkle Hash Tree Goal: compute a single hash/signature over a set of objects (or KV pairs) Fast update when the set changes. Also enable proofs that a given object is in the set. And fast “diffing” of sets [Dynamo]
Quorum How to build a replicated store that is atomic (consistent) always, and available unless there is a partition? Read and write operations complete only when a minimum number (a quorum) of replicas ack them. Set the quorum size so that any read set is guaranteed to overlap with any write set. This property is sufficient to ensure that any read “sees” the value of the “latest” write. So it ensures consistency, but it must deny service if “too many” replicas fail or become unreachable.
Quorum consistency [Keith Marzullo] rv+wv > n
Weighted quorum voting [Keith Marzullo] Any write quorum must intersect every other quorum. rv+wv > n
“Sloppy quorum” aka partial quorum What if R+W < N+1? Dynamo allows configurations that set R and W much lower than a full quorum. – E.g., the Dynamo paper describes “buffered writes” that return after writing to the memory of a single node!
“Sloppy quorum” aka partial quorum What if R+W < N+1? Dynamo allows configurations that set R and W much lower than a full quorum. – E.g., the Dynamo paper describes “buffered writes” that return after writing to the memory of a single node! Good: reads and/or writes don’t have to wait for a full quorum of replicas to respond lower latency. Good: better availability in failure scenarios. Bad: reads may return stale data eventual consistency Bad: replicas may diverge.
Quantifying latency Focus on tail latency: A common approach in the industry for forming a performance oriented SLA is to describe it using average, median and expected variance. At Amazon we have found that these metrics are not good enough… In this paper there are many references to this 99.9th percentile of distributions, which reflects Amazon engineers’ relentless focus on performance from the perspective of the customers’ experience. Many papers report on averages, so these are included where it makes sense for comparison purposes. Nevertheless, Amazon’s engineering and optimization efforts are not focused on averages.
10% quantile 90% quantile median value 80% of the requests (90-10) have response time R with x1 < R < x2. x1x2 “Tail” of 10% of requests with response time R > x2. What’s the mean R? Understand how/why the mean (average) response time can be misleading. A few requests have very long response times. 50% (median) Cumulative Distribution Function (CDF) R