1 Dynamo Amazon’s Highly Available Key-value Store Scott Dougan
2 Amazon One of Amazon’s biggest challenges is reliability at a massive scale. Even a small outage will have large financial consequences and impact customer trust. Also provides services for a lot of other websites.
3 Amazon Tens of thousands of servers handle amazon’s requests. With this many server’s, small and large components fail continuously.
4 Amazon Strict requirements for performance, reliability, efficiency and to support for new growth. Individuals should still be able to add items to their shopping cart even if disks are failing, network routes are flapping or if data centers are being destroyed by tornadoes.
5 Dynamo Dynamo was introduced to some of amazon’s core services to provide an “always-on” experience. In certain situations sacrifices consistency instead of failure. Uses object versioning, conflict resolution techniques and hashing.
6 Dynamo Uses a variety of well known techniques to achieve scalability and availability. Treats failure as a normal case. Manages the state of services. Minimal need for administration.
7 Assumptions and Requirements Query Model: simple read and write operations to a data item that is uniquely identified by a key. ACID Properties: ACID (Atomicity, Consistency, Isolation, Durability). Efficiency: Latency requirements (99.9%).
8 Service Level Agreements Clients and services agree on several system-related characteristics. Use of a particular API Expected service latency. An example would be a service that was guaranteed a response within 300ms for 99.9% of it’s requests for a peak load of 500 requests per second.
9 Amazon’s Platform
10 Design Eventually have consistent data store. Resolving update conflicts on reads. Writes will never be rejected. Rejecting customer updates could result in poor customer experience.
11 Conflict The process of conflict resolution can be done by either data store or the application. The data store can only do simple conflict resolutions as such “last write wins”. The application can merge the conflicting versions.
12 Nodes Each dynamo instance is classified as a node. Every node has the same set of responsibilities. Symmetry simplifies the process of debugging, provisioning and maintenance.
13 Nodes Decentralized peer-to-peer techniques as centralized control has resulted in outages. The system needs to favour heterogeneity. The distribution must be proportional to the capabilities of the individual servers.
14 Peer-to-Peer Systems First Generation: Freenet Gnutella Second Generation: Pastry Chord
15 Dynamo
16 Interface Two basic operations: get() and put(). A MD5 hash on the key determines which storage nodes that are responsible for serving the key.
17 Partitioning Each node is responsible for the area between it and its predecessor. Each node gets assigned to multiple positions in the ring by using virtual nodes. Each position in the ring is called a token.
18 Conflicts
19 Execution Both get() and put() can either go though a load balancer first, or directly connect to a node. Without the load balancer, the latency between responses is lower. Read and writes use the first N healthy nodes in the preference list.
20 Quorum This protocol has two key configurable values: R is the minimum number of nodes that must participate in a successful read. W is the minimum number of nodes that must participate in a successful write. R + W > N is a quorum like system.
21 Hinted handoff Dynamo doesn’t enforce struct quorum membership and instead it uses a “sloppy quorum”. Reads and writes are performed on the first N healthy nodes and is allowed to skip unavailable nodes. Setting W to 1 will only fail if every node is unavailable.
22 Merkle trees A merkle tree is a hash tree where leaves are hashes of values of individual keys. Part of the merkle tree can be used to verify part of the data. The hash value of two root nodes are equals the leaves are also equal.
23 Ring Membership An administrator uses a command line tool or a browser to control the dynamo nodes. Each node contacts a peer chosen at random every second and transfer personal data. Some nodes are seeds which are known by all nodes.
24 Failure Detection
25 Adding/Removing Nodes When a node is added into the system, it gets random tokens. Other nodes give up some of their tokens. This has shown to distribute the load of keys uniformly across the storage nodes.
26 Implementation Dynamo allows for different storage engines to be plugged in. Berkeley Database (BDB) Transactional Data Store BDB Java Edition MySQL in - memory buffer with persistent backing store.
27 Models Business logic specific reconciliation. Timestamp based reconciliation. High performance read engine.
28 Optimizations Each write is stored in a buffer and periodically gets written to disk by a writer thread. Has resulted in lowering the latency by a factor of 5. Higher chance of lose data.
29 Load Distribution T Random tokens per node and partition by token value. T random tokens per and equal sized partitions. Q/S tokens per node, equal-sized partitions.
30 Load Distribution
31 Load Distribution
32 Client/Server Driven
33 Time Slices Background tasks triggered the problem of resource contention and affected the performance of regular put and get operations. Each background task uses a controller to reserve a runtime slice. The controller decides when to give up these time slices.
34 References DECANDIA, G., HASTORUN, D., JAMPANI, M., KAKULAPATI, G., LAKSHMAN, A., PILCHIN, A., SIVASUBRAMANIAN, S., VOSSHALL, P., AND VOGELS, W. Dynamo: Amazon’s highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles (2007), ACM Press New York, NY, USA, pp. 205–220.