D u k e S y s t e m s Dynamo Jeff Chase Duke University

Slides:

Advertisements

Similar presentations

Dynamo: Amazon’s Highly Available Key-value Store

Advertisements

Dynamo: Amazon’s Highly Available Key-value Store Slides taken from created by paper authors Giuseppe DeCandia, Deniz Hastorun,

CAN 1.Distributed Hash Tables a)DHT recap b)Uses c)Example – CAN.

Dynamo: Amazon’s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat.

Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.

AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available.

D YNAMO : A MAZON ’ S H IGHLY A VAILABLE K EY - V ALUE S TORE Presented By Roni Hyam Ami Desai.

Distributed Hash Tables Chord and Dynamo Costin Raiciu, Advanced Topics in Distributed Systems 18/12/2012.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.

Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.

Dynamo: Amazon's Highly Available Key-value Store Guiseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,

Amazon Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,

Dynamo: Amazon’s Highly Available Key-value Store Adopted from slides and/or materials by paper authors (Giuseppe DeCandia, Deniz Hastorun, Madan Jampani,

1 Dynamo Amazon’s Highly Available Key-value Store Scott Dougan.

Dynamo Highly Available Key-Value Store 1Dennis Kafura – CS5204 – Operating Systems.

Data Structures Hash Tables

Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.

Dynamo Kay Ousterhout. Goals Small files Always writeable Low latency – Measured at 99.9 th percentile.

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

Dynamo: Amazon’s Highly Available Key- value Store (SOSP’07) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,

Rethinking Dynamo: Amazon’s Highly Available Key-value Store --An Offense Shih-Chi Chen Hongyu Gao.

Hashing General idea: Get a large array

Wide-area cooperative storage with CFS

Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.

Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.

Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,

Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, et.al., SOSP ‘07.

Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.

Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.

Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101

Dynamo: Amazon’s Highly Available Key-value Store COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit

Content Overlays (Nick Feamster) February 25, 2008.

Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,

MIT Consistent Hashing: Load Balancing in a Changing World David Karger, Eric Lehman, Tom Leighton, Matt Levine, Daniel Lewin, Rina Panigrahy.

Ahmad Al-Shishtawy 1,2,Tareq Jamal Khan 1, and Vladimir Vlassov KTH Royal Institute of Technology, Stockholm, Sweden {ahmadas, tareqjk,

Dynamo: Amazon's Highly Available Key-value Store Dr. Yingwu Zhu.

Dynamo: Amazon’s Highly Available Key-value Store DeCandia, Hastorun, Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels PRESENTED.

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.

D YNAMO : A MAZON ’ S H IGHLY A VAILABLE K EY - VALUE S TORE Presenters: Pourya Aliabadi Boshra Ardallani Paria Rakhshani 1 Professor : Dr Sheykh Esmaili.

Dynamo: Amazon’s Highly Available Key-value Store

CSE 486/586 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.

Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.

Peer to Peer Networks Distributed Hash Tables Chord, Kelips, Dynamo Galen Marchetti, Cornell University.

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Partitioning and Replication.

Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,

DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN,

Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Algorithms used by CDNs Stable Marriage Algorithm Consistent Hashing.

Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.

Big Data Yuan Xue CS 292 Special topics on.

Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.

Consistent Hashing. Hashing E.g., h(x) = (((a x + b) mod P) mod |B|), where P is prime, P > |U| a,b chosen uniformly at random from Z P x is a serial.

CS 545 – Fundamentals of Stream Processing – Consistent Hashing

CSE 486/586 Distributed Systems Case Study: Amazon Dynamo

Dynamo: Amazon’s Highly Available Key-value Store

EECS 498 Introduction to Distributed Systems Fall 2017

Edge computing (1) Content Distribution Networks

EECS 498 Introduction to Distributed Systems Fall 2017

EEC 688/788 Secure and Dependable Computing

EECS 498 Introduction to Distributed Systems Fall 2017

EEC 688/788 Secure and Dependable Computing

THE GOOGLE FILE SYSTEM.

EEC 688/788 Secure and Dependable Computing

CSE 486/586 Distributed Systems Case Study: Amazon Dynamo

Presentation transcript:

D u k e S y s t e m s Dynamo Jeff Chase Duke University

[Pic courtesy of Alex Smola]

A quick note on terminology The Dynamo paper casually overloads at least the following terms: node partition consistent

Dynamo: a familiar abstraction? Reliability and scalability of a system depend on how it manages its state. Dynamo is a key-value store. Keys are 128 bit (BigInts) Values are binary objects or “blobs” (untyped) Put/get writes/reads value in its entirety Access by “primary key only” Application services built above the store Store has S servers with hash-based data distribution So how/why is it different from our toy KVStore? Specifically: how is the abstraction different?

KVStore class KVStore extends Actor { private val store = new scala.collection.mutable.HashMap[BigInt, Any] override def receive = { case Put(key, cell) => sender ! store.put(key,cell) case Get(key) => sender ! store.get(key) } } [Pic courtesy of Alex Smola]

Dynamo: multi-version KV store Each value has a version stamp, also called context. Why does Put take a stamp for the last version read? Why can Get return multiple values? Under what circumstances can this occur? If there are multiple values, the app is responsible for reconciling them. How? Get(key)  {[version, value]*} Put(key, version/context, value)  success

Dynamo: replication Dynamo is a highly available KV store. Dynamo uses replication to mask failures. Each (key, value) pair is stored on N servers/replicas. How is it different from replication in Chubby? Why is it different?

Dynamo vs. Chubby (replication) Chubby: primary/backup, synchronous, consensus Dynamo: – Symmetric: no primary! – Asynchronous: some writes wait for W<N/2 replicas to respond. – The other replicas learn of a write eventually. client request ChubbyDynamo

Dynamo vs. Chubby: CAP Where are they on CAP? Chubby: all replicas apply all writes in the same (total) order, read always returns value of last write. If safety cannot be guaranteed, then fail the operation. Dynamo: never block or fail an operation. Do what you must to get it done. If the going gets rough, just muddle through and sort it out later. Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously. A put() call may return to its caller before the update has been applied at all the replicas, which can result in scenarios where a subsequent get() operation may return an object that does not have the latest updates.…under certain failure scenarios (e.g., server outages or network partitions), updates may not arrive at all replicas for an extended period of time.

Prior to joining Amazon, he worked as a researcher at Cornell University. Dr. Werner Vogels is Vice President & Chief Technology Officer at Amazon.com.

Vogels on consistency Strong consistency: “After the update completes, any subsequent access will return the updated value.” Consistency “has to do with how observers see these updates”. The scenario A updates a “data object” in a “storage system”. Eventual consistency: “If no new updates are made to the object, eventually all accesses will return the last updated value.”

Dynamo’s “killer app”: shopping carts For a number of Amazon services, rejecting customer updates could result in a poor customer experience. For instance, the shopping cart service must allow customers to add and remove items from their shopping cart even amidst network and server failures. This requirement forces us to push the complexity of conflict resolution to the reads in order to ensure that writes are never rejected.

Conflicts and resolution In order to provide this kind of guarantee, Dynamo treats the result of each modification as a new and immutable version of the data. It allows for multiple versions of an object to be present in the system at the same time. Most of the time, new versions subsume the previous version(s), and the system itself can determine the authoritative version (syntactic reconciliation). However, version branching may happen, in the presence of failures combined with concurrent updates, resulting in conflicting versions of an object. In these cases, the system cannot reconcile the multiple versions of the same object and the client must perform the reconciliation in order to collapse multiple branches of data evolution back into one (semantic reconciliation). Get(key)  {[version, value]*} Put(key, version/context, value)  success

Vector clocks Dynamo version stamps are a kind of vector clock. We will discuss them later. For now, what we care about are the properties of vector clocks. Given two versions A and B, the system can determine from their version stamps if: – A happened before B: the writer of B had read and seen A before choosing the new value B, so B supersedes A. – B happened before A: etc., so A supersedes B. – A and B were concurrent, i.e., conflicting. The writer of version A had not seen version B when it wrote A. The writer of version B had not seen version A when it wrote B. These versions need to be reconciled/merged….somehow.

Dynamo vector clocks: example

Application-level conflict resolution The next design choice is who performs the process of conflict resolution… since the application is aware of the data schema it can decide on the conflict resolution method that is best suited for its client’s experience. For example, the shopping cart application requires that an “Add to Cart” operation can never be forgotten or rejected. If the most recent state of the cart is unavailable, and a user makes changes to an older version of the cart, that change is still meaningful and should be preserved. But at the same time it shouldn’t supersede the currently unavailable state of the cart, which itself may contain changes that should be preserved. …When…the latest version is not available, the item is added to (or removed from) the older version and the divergent versions are reconciled later. … For example, the application that maintains customer shopping carts can choose to “merge” the conflicting versions and return a single unified shopping cart…Using this reconciliation mechanism, an “add to cart” operation is never lost. However, deleted items can resurface.

Finding the key for a value named by a string KVStore private def hashForKey(anything: String): BigInt = { val md: MessageDigest = MessageDigest.getInstance("MD5") val digest: Array[Byte] = md.digest(string.getBytes) BigInt(1, digest) } Dynamo Dynamo treats both the key and the object supplied by the caller as an opaque array of bytes. It applies a MD5 hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are responsible for serving the key.

How to distribute keys across the servers? How KVStore does it: private def route(key: BigInt): ActorRef = { stores((key % stores.length).toInt) } How should we map a key to N replicas? What if the number of nodes (stores.length) changes? What if nodes disagree on the set of active peers (stores)? What could go wrong?

Dynamo’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. In consistent hashing [10], the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. Consistent hashing in Dynamo preference list

Consistent hashing Consistent hashing is a technique to assign data objects (or functions) to servers Key benefit: adjusts efficiently to churn. – Adjust as servers leave (fail) and join (recover) Used in Internet server clusters and also in distributed hash tables (DHTs) for peer-to-peer services. Developed at MIT for Akamai CDN Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the WWW. Karger, Lehman, Leighton, Panigrahy, Levine, Lewin. ACM STOC, ≈2000 citations

Consistent Hashing Slides from Bruce Maggs

Hashing E.g., h(x) = (((a x + b) mod P) mod |B|), where P is prime, P > |U| a,b chosen uniformly at random from Z P x is a serial number Universe U of all possible objects, set B of buckets. object: set of web objects with same serial number bucket: web server Hash function h: U  B Assigns objects to buckets

f(d) = d + 1 mod 5 Difficulty changing number of buckets bucket object f(d) = d + 1 mod 4

Consistent Hashing Idea: Map both objects and buckets to unit circle. object bucket Assign object to next bucket on circle in clockwise order. new bucket

Complication – Different Views select servers within cluster Low-level DNS servers act independently and may have different ideas about how many and which servers are alive. a212.g.akamai.net Akamai Low-Level DNS Servers

Properties of Consistent Hashing Monotonicity: When a bucket is added/removed, the only objects affected are those that are/were mapped to the bucket. Balance: Objects are assigned to buckets “randomly”. -- can be improved by mapping each bucket to multiple places on unit circle Load: Objects are assigned to buckets evenly, even over a set of views. Spread: An object should be mapped to a small number of buckets over a set of (potentially diverging) views.

Consistent hashing in practice I The basic consistent hashing algorithm presents some challenges. First, the random position assignment of each node on the ring leads to non-uniform data and load distribution. Second, the basic algorithm is oblivious to the heterogeneity in the performance of nodes. To address these issues, Dynamo uses a variant of consistent hashing (similar to the one used in [10, 20]): instead of mapping a node to a single point in the circle, each node gets assigned to multiple points in the ring. To this end, Dynamo uses the concept of “virtual nodes”. A virtual node looks like a single node in the system, but each node can be responsible for more than one virtual node. Effectively, when a new node is added to the system, it is assigned multiple positions (henceforth, “tokens”) in the ring. Read and write operations involve the first N healthy [physical] nodes in the preference list, skipping over those that are down or inaccessible. When all nodes are healthy, the top N nodes in a key’s preference list are accessed. When there are node failures or network partitions, nodes that are lower ranked in the preference list are accessed.

Balanced elasticity with virtual nodes Using virtual nodes has the following advantages: -If a node becomes unavailable (due to failures or routine maintenance), the load handled by this node is evenly dispersed across the remaining available nodes. -When a node becomes available again, or a new node is added to the system, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. -The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.

The need for fixed-size partitions Strategy 1: T random tokens per node and partition by token value While using this strategy, the following problems were encountered. First, when a new node joins the system…the nodes handing the key ranges off to the new node have to scan their local persistence store to retrieve the appropriate set of data items. Note that performing such a scan operation on a production node is tricky as scans are highly resource intensive operations…during busy shopping season, when the nodes are handling millions of requests a day, the bootstrapping has taken almost a day to complete. Second, when a node joins/leaves the system…the Merkle trees for the [many] new ranges need to be recalculated. Finally, there was no easy way to take a snapshot of the entire key space due to the randomness in key ranges…archiving the entire key space requires us to retrieve the keys from each node separately, which is highly inefficient.

Consistent hashing in practice II Strategy 2: T random tokens per node and equal sized partitions. In this strategy, the hash space is divided into Q equally sized partitions/ranges and each node is assigned T random tokens…. A partition is placed on the first N unique nodes that are encountered while walking the consistent hashing ring clockwise from the end of the partition. Figure 7 illustrates this strategy for N=3. In this example, nodes A, B, C are encountered while walking the ring from the end of the partition that contains key k1.

Riak is a commercial open-source KV store inspired by Dynamo. Its partitioning scheme is similar to Dynamo.

Benefits of fixed-size partitions Since partition ranges are fixed, they can be stored in separate files, meaning a partition can be relocated as a unit by simply transferring the file (avoiding random accesses needed to locate specific items).… Archiving the entire dataset stored by Dynamo is simpler … because the partition files can be archived separately. [Reconciliation by Merkle trees is easier.]

Tweaking the partition scheme For Q partitions, choose T= Q/S Juggle tokens around the nodes as desired to tweak the partitioning scheme on the fly. Keep a token list for each node. Claims are made, but details (vs. Strategy 2) are fuzzy. (?) Strategy 3: Q/S tokens per node, equal-sized partitions: Similar to strategy 2, but each node is assigned T=Q/S tokens where S is the number of nodes in the system. When a node leaves the system, its tokens are randomly distributed to the remaining nodes….Similarly, when a node joins the system it "steals" tokens from nodes in the system in a way that preserves these properties. …each node needs to maintain the information regarding the partitions assigned to each node.

Load skew vs. metadata size measure of load skew

Also to discuss Ring membership changes by admin command. Ring membership propagation by gossip and “seeds”. Anti-entropy reconciliation with Merkle trees

Merkle Hash Tree Goal: compute a single hash/signature over a set of objects (or KV pairs) Fast update when the set changes. Also enable proofs that a given object is in the set. And fast “diffing” of sets [Dynamo]

Quorum How to build a replicated store that is atomic (consistent) always, and available unless there is a partition? Read and write operations complete only when a minimum number (a quorum) of replicas ack them. Set the quorum size so that any read set is guaranteed to overlap with any write set. This property is sufficient to ensure that any read “sees” the value of the “latest” write. So it ensures consistency, but it must deny service if “too many” replicas fail or become unreachable.

Quorum consistency [Keith Marzullo] rv+wv > n

Weighted quorum voting [Keith Marzullo] Any write quorum must intersect every other quorum. rv+wv > n

“Sloppy quorum” aka partial quorum What if R+W < N+1? Dynamo allows configurations that set R and W much lower than a full quorum. – E.g., the Dynamo paper describes “buffered writes” that return after writing to the memory of a single node!

“Sloppy quorum” aka partial quorum What if R+W < N+1? Dynamo allows configurations that set R and W much lower than a full quorum. – E.g., the Dynamo paper describes “buffered writes” that return after writing to the memory of a single node! Good: reads and/or writes don’t have to wait for a full quorum of replicas to respond  lower latency. Good: better availability in failure scenarios. Bad: reads may return stale data  eventual consistency Bad: replicas may diverge.

Quantifying latency Focus on tail latency: A common approach in the industry for forming a performance oriented SLA is to describe it using average, median and expected variance. At Amazon we have found that these metrics are not good enough… In this paper there are many references to this 99.9th percentile of distributions, which reflects Amazon engineers’ relentless focus on performance from the perspective of the customers’ experience. Many papers report on averages, so these are included where it makes sense for comparison purposes. Nevertheless, Amazon’s engineering and optimization efforts are not focused on averages.

10% quantile 90% quantile median value 80% of the requests (90-10) have response time R with x1 < R < x2. x1x2 “Tail” of 10% of requests with response time R > x2. What’s the mean R? Understand how/why the mean (average) response time can be misleading. A few requests have very long response times. 50% (median) Cumulative Distribution Function (CDF) R