VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer : Dr. Pavle Mogin
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 1 Plan for Amazon’s Dynamo Context Data Model Partitioning and Replication Data Versioning Executing get() and put() Membership changes Replica Synchronization and Anti-Entropy Algorithm –Reedings: Have a look at Readings on the Home Page
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 2 Context Dynamo is one of the CDBMS used at Amazon –The others are: SimpleDB or S3, and Simple Storage Service –Dynamo is used for simple services requiring data access via the primary key, like the Shopping Cart application At Amazon, Dynamo is used to manage services that: –Have very high reliability requirements and –Need a tight control over tradeoffs between: Availability, Consistency, Cost-effectiveness, and Performance Dynamo is already in use since 2006 and has influenced the design of a number of other NoSQL CDBMS’s
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 3 Design Requirements Technical context: –The infrastructure is made up of tens thousands of servers and network components located in many data centres around the world, –Commodity hardware is used, –Components failure is a “standard mode of operation”, –Amazon uses a highly decentralized, loosely coupled, service oriented architecture consisting of hundreds of services Business considerations: –A strict internal service level agreement (SLA) has to be met for, practically, all customers, regardless of the amount of processing their requests need A simple SLA : response time of 300 ms for 99.9% of requests for a peak client load of 500 requests per second –High reliability since even a slightest outage has significant financial consequences and impacts user’s trust –High scalability to support a continuous growth
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 4 System Design (Data Model and API) Data model: key/value –Most services at Amazon need only to store and retrieve data by primary key and do not require complex querying and data management functionality –The value part is a BLOB –Updates are limited to one key/value pair with no references Operations: –get(key), returning a list of objects and a context –put(key, context, value), with no return value –“ context ” is the system metadata containing a version vector –The get() operation may return more than one value if there is a conflict between objects with the given key –Dynamo treats key and value as opaque arrays of bytes –The key is hashed by the MD5 algorithm to determine the storage node
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 5 Design (Partitioning and Replication) To provide for incremental scalability, Dynamo uses consistent hashing to dynamically partition data across the present storage hosts –Each physical node contains a number of virtual nodes according to its performance Dynamo uses optimistic replication to ensure availability and durability in an environment where machine crushes are a standard mode of operation –Each data object is replicated n times A typical value for n at Amazon is 3 –Each node contains a list of nodes, called the preference list, for each key k to be stored A node from the top of the preference list becomes responsible for storing and replicating an update to the object with the key k
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 6 System Design (Data Versioning) Dynamo is designed to be an eventually consistent system that is always update available: –An update operation returns before all replica nodes have received and applied the update –Also, an update is accepted from a client even if it is apparent that the client is not aware of the latest version of the object To handle multiple versions of an object: –Dynamo uses a version vector (called vector clock), and –Always creates a new and immutable version of the object updated Many of the object versions are reconciled syntactically by Dynamo itself Whenever two replicas have ordered version vectors But some reads, may return a set of conflicting object versions that have to be reconciled semantically by an application knowing schema and business logic
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 7 Design ( get() and put()) Dynamo allows any storage node to receive a get() or put() request for any key –The node then uses the preference list to forward the request to a healthy prioritized storage host (the coordinator) To provide a consistent view to clients, Dynamo applies a quorum consistency protocol –Values of r = 2, w = 2, and n = 3 satisfy Amazon’s SLA, where r and w are minimum numbers of storage host to take part in a successful read or write, respectively, –Parameters r, w, and n are configurable by the application, –Applications needing the highest level of availability may set w = 1: Then, a write request is rejected only if all node in the system are unavailable –To achieve a higher level of durability, w should be greater than 1
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 8 Design (Membership Changes) Dynamo uses a gossip network communication protocol to transfer messages between nodes Node outages (due to a failure or maintenance) are often transient, although may last for extended intervals A node outage rarely signifies a permanent departure and therefore should not result in rebalancing of the partition assignment For these reasons, Dynamo uses an explicit mechanism for addition and removing nodes from a Dynamo consistent hashing ring
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 9 Design (Node Addition and Removal) (1) An administrator uses a command line tool to a node and issues a command for a node addition or removal The node stores the membership change The gossip protocol is used to propagate membership changes –Each second, a node chooses a random peer to exchange the information about membership changes When a new node joins the consistent hashing ring, a token is chosen for each of its virtual nodes and stored permanently –Tokens are spread to other nodes by gossip together with membership changes information –By having this information, nodes are able to send a request to a node responsible for the key range
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 10 Design (Node Addition and Removal) (2) Adding nodes to the system changes the ownership of key ranges on the ring When a node determines it is not responsible for a key range any more, it transfers objects to the new node At the removal of a node, database objects are relocated in a reverse process Temporary failure detection is performed during gossiping –To avoid failed attempts during get(), put() operations, and data transfers, a node A considers a node B temporarily inaccessible if the node B does not respond to a node A’s gossip message
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 11 Design (Handling of Failures) The hinted handoff is a technique used to compensate for not relocating database objects of temporarily failed nodes Dynamo’s quorum is a sloppy one, since the first n healthy nodes from the preference list for the key are used when executing a read or write operation –Some of these n nodes may even be not responsible for the key Hence, a new object may be written on a node j that is not responsible for the key, instead off on the node i being an intended recipient of the object’s replica The new object is stored along with a hint about the intended recipient node i of the replica When the node i revives, the node j sends the object to it and deletes the object
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 12 Hinted Handoff (Example) A H B C D E F G Replication factor n = 3 Temporary down The preference list for the key k: C, D, E. F, G,... The object (k, o) stored here (k, o) Hinted Handoff Not responsible for Range BC Cordinator
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 13 Design (Replica Synchronization) Hinted handoff works well if the system membership changes are infrequent and failures are transient There are scenarios under which hinted replicas may become unavailable before they can be returned to the original replica node To detect the inconsistencies between replicas faster and to minimize the amount of data transfer between nodes, Dynamo uses Merkle trees Merkle trees are used to discover differences in key sets of the same key range held on different nodes
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 14 Design (Merkle Trees)(1) A Merkle tree is a full binary hash tree where leaves are hashes of individual keys Parent nodes are hashes of their concatenated children Let k be the number of keys and h the height of the tree, then: –The number of tree leaves 2 h – 1 > k and –The k-th key has to be replicated r = (2 h – 1 – k) times in order to get a full tree Example: –k = 5, h = 4, the number of replicated keys is r = 3
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 15 Design (Merkle Trees)(2) Two Merkle trees have the same respective node values if they are produced using the same set of keys by applying the same hashing function Use of Merkle trees in comparing the key ranges of two replicas is performed in the following way: –If tree roots are the same, then key ranges contain the same keys –If tree roots are different, then their subtrees have to be compared By applying the rule above recursively, one finally finds a missing key
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 16 Merkle Trees (Example)(1) This is an extremely simplified example, far from reality The only aim of the example is to give you an idea how Merkle trees might be built and used in synchronizing replicas Assume: –The replica1 contains the following keys: 159, 973, 414, 003 –The replica2 contains the following keys: 159, 973, 414 –We use the hash function h(k) = k mod 7
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 17 Merkle Trees (Example)(2) Replica Replica
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 18 Design (Anti-Entropy Algorithm) Each physical node maintains a separate Merkle tree for each key range hosted by one of its virtual nodes Two nodes exchange roots of Merkle trees for the key ranges they have in common By applying the tree traversal scheme described, the nodes determine if they have any differences –If a difference exists, nodes apply a corresponding corrective action by copying the missing object
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 19 Failure of a Whole Data Centre A highly available storage system should be able to handle the failure of an entire data centre –A data centre failures happen due to: Power outages, Cooling failures, Network failures, and Natural disasters Dynamo is configured in such a way that each object is replicated across multiple data centres –Nodes in the preference list for a key belong to multiple centres
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 20 Summary (1) Dynamo is one of Amazon’s CDBMSs It is in use since 2006 and has influenced a number of other CDBMSs including Cassandra Data model: key-value with a very simple API Data partitioning and replication: consistent hash ring with optimistic replication Data versioning: vector clocks
Advanced Database Design and Implementation 2015 Amazon’s Dynamo 21 Summary(2) Network communication: gossip protocol Handling of failures: hinted hand-off is used to compensate for not relocating database objects of temporarily failed nodes Replica synchronization: to detect inconsistencies Merkle trees are used Anti-Entropy algorithm: two nodes exchange Merkle trees for key ranges they have in common, find differences in key ranges, and apply corrective actions