Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels Presented By Ramprasad Ramachandran 07/20/2010
7/20/2010Motivation2 Motivation. DYNAMO-Highly available and distributed data store Millions of servers and network Components across multiple data Centers, Even slightest outage would affect financially and the customers trust Provides “always-on” Experience to customers. Hundreds of services in highly decentralized SOA environment. Services need only Primary- key access to data store. Solution
7/20/2010System Assumptions and Requirements3 Query Model: Simple read and write operations to a data item identified by a primary key. Query Model: Simple read and write operations to a data item identified by a primary key. ACID Properties: ACID guarantees tended to provide poor availability. ACID Properties: ACID guarantees tended to provide poor availability. Efficiency: Stringent latency requirements that are generally measured 99.9 th percentile of the distribution to meet stringent Service Level Agreements(SLA). Efficiency: Stringent latency requirements that are generally measured 99.9 th percentile of the distribution to meet stringent Service Level Agreements(SLA). Other Assumptions: Amazon's internal services that do not need any authentication or authorization. Other Assumptions: Amazon's internal services that do not need any authentication or authorization.
7/20/2010Service Level Agreements4 Example: Example: -Service guarantees to provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. Here, SLAs are measured and expressed at 99.9 th percentile of distribution. Here, SLAs are measured and expressed at 99.9 th percentile of distribution. ClientsServices Agree on system-related characteristics SLA
7/20/2010SOA of Amazon's Platform5
7/20/2010Design Considerations6. Strong consistency High data availability Possible Optimistic replication techniques Conflicts in updates When they are resolved? Who resolves them? During write, i.e., “always writeable” Application or data store
7/20/2010Design Considerations(Contd)7 Other key principles: Incremental Scalability Incremental Scalability Symmetry Symmetry Decentralization Decentralization Heterogeneity Heterogeneity
7/20/2010System Architecture8 System Interface Two operations- get() and put() get(key)put(key, context, object) Locates object replicas associated with the key. Determines where replica objects are to be placed on the disk and writes them. Encodes the system metadata about the object.
7/20/2010System Architecture(Contd.)9 Partitioning Algorithm Incremental scaling is a key requirement. Incremental scaling is a key requirement. Introduces consistent hashing. Introduces consistent hashing. Output of the hash function is seen as a ring Output of the hash function is seen as a ring Leads to non-uniform data and load distribution. Leads to non-uniform data and load distribution. Oblivious to heterogeneity in the performance of nodes. Oblivious to heterogeneity in the performance of nodes. Solution – Virtual nodes. Solution – Virtual nodes.
7/20/2010System Architecture(Contd.)10 System Architecture(Contd.) Replication Each key assigned a coordinator Coordinator replicates the key a N-1 clockwise successor nodes. Introduces preference list. A key can be owned by less than N physical nodes. Solution – Ensure that Only physical nodes are Added to preference list.
7/20/2010System Architecture(Contd.)11 System Architecture(Contd.) Data Versioning Might result in objects having distinct version sub-histories. Might result in objects having distinct version sub-histories. Solution – vector clocks to capture causality between different versions. Solution – vector clocks to capture causality between different versions. Vector clock is a list of (node, counter) pairs associated with every version of every object. Vector clock is a list of (node, counter) pairs associated with every version of every object. Updates to replicas are propagated asynchronously, multiple versions exist Syntactic reconciliation Semantic reconciliation
7/20/2010System Architecture(Contd.)12 System Architecture(Contd.) Data Versioning(Contd)
7/20/2010System Architecture(Contd.)13 System Architecture(Contd.) Routing Requests Quorum like protocol is used to maintain consistency. Quorum like protocol is used to maintain consistency. Two configurable values – R and W, the minimum number of nodes that must participate in successful read and write operations respectively. Two configurable values – R and W, the minimum number of nodes that must participate in successful read and write operations respectively. Setting R+W<N yields a quorum-like system. Setting R+W<N yields a quorum-like system. Two strategies Generic load balancer. Partition-aware client library. Chooses node based on load information. Routes request to appropriate nodes directly.
7/20/ System Architecture(Contd.) Routing Requests(Contd) Example: Consider put() Example: Consider put() get() get() Generates vector clock for new version. Writes new version locally. Sent to N highest ranked nodes. Write successful if W-1 nodes respond. Coordinator requests all existing versions of the key from N nodes. Wait for R responses before Returning request to the client. Reconciliation is done in case of multiple versions and written back.
7/20/2010System Architecture(Contd.)15 System Architecture(Contd.) Hinted Handoff Traditional quorum approach will cause unavailability and reduce durability even for simplest failures. Traditional quorum approach will cause unavailability and reduce durability even for simplest failures. Solution - “sloppy quorum”. Solution - “sloppy quorum”. -When a node is unavailable, the replica is stored locally by another node. -Sent back to the actual node when it is available. -The locally stored object in first node is deleted. Each object is replicated across multiple data centers. Each object is replicated across multiple data centers.
7/20/2010System Architecture(Contd.)16 System Architecture(Contd.) Replica Synchronization(anti-entropy protocol) Hinted handoff doesn't address certain durability issues. Hinted handoff doesn't address certain durability issues. Solution – Merkle trees. Solution – Merkle trees. Leaves are hashes of individual keys. Leaves are hashes of individual keys. Each branch of the tree can be checked independently. Each branch of the tree can be checked independently. This also reduces number of disk reads. This also reduces number of disk reads. Parent nodes In two trees have same hash values? yesyes They are synchronized No Exchange hash values of Children, move on till the leaves.
7/20/2010System Architecture(Contd.)17 System Architecture(Contd.) Ring Membership Outages may be transient or for a long time. Outages may be transient or for a long time. Should not cause re-balancing of partition or repair unreachable replicas. Should not cause re-balancing of partition or repair unreachable replicas. Solution – Explicit mechanism to add or remove nodes from ring. Solution – Explicit mechanism to add or remove nodes from ring. New node starts up Chooses set of tokens (virtual nodes). Maps nodes to their Respective tokens. Mapping is persisted on the disk. Membership changes are reconciled by Gossip-based protocol.
7/20/2010System Architecture(Contd.)18 System Architecture(Contd.) External Discovery The above technique can temporarily form logically partitioned ring. The above technique can temporarily form logically partitioned ring. Solution – Nodes play as seeds. Solution – Nodes play as seeds. Seeds are nodes discovered by an external mechanism. Seeds are nodes discovered by an external mechanism. Obtained from either a static configuration or configuration service. Obtained from either a static configuration or configuration service. Visible to all nodes. Visible to all nodes. All nodes eventually reconcile their membership with a seed. All nodes eventually reconcile their membership with a seed.
7/20/2010System Architecture(Contd.)19 System Architecture(Contd.) Failure Detection Need to avoid communication with unreachable nodes. Need to avoid communication with unreachable nodes. Process made purely local to a node. Process made purely local to a node. Gossip-style protocol is used to detect arrival or removal of a node. Gossip-style protocol is used to detect arrival or removal of a node. Example Example – Node A considers node B failed if B fails to respond, even if B can still communicate with another node C.
7/20/2010Implementation20 Implementation.ComponentImplementation Request coordinationJava Membership and failure detection Java Local persistence engineJava Pluggable storage enginesBDB Transactional Data Store, MySQL, etc. CommunicationsJava NIO channels.
7/20/2010Performance and Experiences21 Performance and Experiences Patterns in which Dynamo is used mainly Patterns in which Dynamo is used mainly -Business logic specific reconciliation. -Timestamp based reconciliation. -High performance read engines. Client applications can tune the values of N, R and W to get the desired levels of performance, availability and durability. Client applications can tune the values of N, R and W to get the desired levels of performance, availability and durability. Typical value of N is usually 3. Typical value of N is usually 3. Common (N, R, W) configuration of Dynamo instances is (3, 2, 2). Common (N, R, W) configuration of Dynamo instances is (3, 2, 2).
7/20/2010Performance and Experiences(Contd.)22 Performance and Experiences(Contd.)
7/20/2010Performance and Experiences(Contd.)23 Performance and Experiences(Contd.)
7/20/2010Performance and Experiences(Contd.)24 Performance and Experiences(Contd.) Load Imbalance in Dynamo
7/20/2010Performance and Experiences(Contd.)25 Performance and Experiences(Contd.) Partitioning Strategies on load distribution Three strategies T random tokens per node, partition by token value T random tokens per node, equal sized partition. Q>>N, Q>>S*T Q/S tokens per node, equal-sized partitions. Q is number of partitions/ranges, S is the number of nodes
7/20/2010Performance and Experiences(Contd.)26 Performance and Experiences(Contd.) Comparison between the three strategies
7/20/2010Performance and Experiences(Contd.)27 Performance and Experiences(Contd.) Coordination Request coordinator has state machine. Load balancer is used. State machine is moved to client. Request coordination done locally.
7/20/2010Performance and Experiences(Contd.)28 Performance and Experiences(Contd.) Balance between background and foreground tasks Background tasks include replica synchronization, data handoff. Background tasks include replica synchronization, data handoff. Foreground tasks are normal get/put operations. Foreground tasks are normal get/put operations. Background tasks resulted in resource contention problems. Background tasks resulted in resource contention problems. So they were integrated with an admission control mechanism. So they were integrated with an admission control mechanism. This controls runtime slices of the resource. This controls runtime slices of the resource.
7/20/2010Conclusion29 Conclusion Pros Instances can be tuned based on their needs by changing the parameters (N, R, W). Instances can be tuned based on their needs by changing the parameters (N, R, W). Exposes data consistency and reconciliation logic issues to developers. Exposes data consistency and reconciliation logic issues to developers. Incrementally scalable data store. Incrementally scalable data store.
7/20/2010Conclusion30 Conclusion Cons Semantic reconciliation. Additional burden on application developers. Semantic reconciliation. Additional burden on application developers. Only small-scale scalability test results are shown. For example, using only 30 nodes. Only small-scale scalability test results are shown. For example, using only 30 nodes. Due to sensitivity issues, some of the key topics are not covered well. Due to sensitivity issues, some of the key topics are not covered well. Some issues were not thoroughly inspected just because they are yet to face such problems. Some issues were not thoroughly inspected just because they are yet to face such problems. Nodes are assigned in random, so a possibility that a low performing node could be assigned to the clients. Nodes are assigned in random, so a possibility that a low performing node could be assigned to the clients. Application developers need to know the (N, R, W) configuration to achieve their needs. Application developers need to know the (N, R, W) configuration to achieve their needs.
7/20/2010References31 References - D. Hastorun et al. Dynamo: Amazon's Highly Available Key-Value Store. Proceedings of the ACM Syposium on Operating System Principles (SOSP). pp , D. Hastorun et al. Dynamo: Amazon's Highly Available Key-Value Store. Proceedings of the ACM Syposium on Operating System Principles (SOSP). pp , Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and Lewin, D Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on theory of Computing (El Paso, Texas, United States, May , 1997). STOC '97. ACM Press, New York, NY, Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., and Lewin, D Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on theory of Computing (El Paso, Texas, United States, May , 1997). STOC '97. ACM Press, New York, NY, Gray, J., Helland, P., O'Neil, P., and Shasha, D The dangers of replication and a solution. In Proceedings of the 1996 ACM SIGMOD international Conference on Management of Data (Montreal, Quebec, Canada, June , 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New York, NY, Gray, J., Helland, P., O'Neil, P., and Shasha, D The dangers of replication and a solution. In Proceedings of the 1996 ACM SIGMOD international Conference on Management of Data (Montreal, Quebec, Canada, June , 1996). J. Widom, Ed. SIGMOD '96. ACM Press, New York, NY, Welsh, M., Culler, D., and Brewer, E SEDA: an architecture for well- conditioned, scalable internet services. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles (Banff, Alberta, Canada, October , 2001). SOSP '01. ACM Press, New York, NY, Welsh, M., Culler, D., and Brewer, E SEDA: an architecture for well- conditioned, scalable internet services. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles (Banff, Alberta, Canada, October , 2001). SOSP '01. ACM Press, New York, NY,
7/20/2010Q&A32 Q&A Q&A