Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store Slides taken from created by paper authors Giuseppe DeCandia, Deniz Hastorun,
Last Class: Clock Synchronization
Dynamo: Amazon’s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat.
Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available.
D YNAMO : A MAZON ’ S H IGHLY A VAILABLE K EY - V ALUE S TORE Presented By Roni Hyam Ami Desai.
Distributed Hash Tables Chord and Dynamo Costin Raiciu, Advanced Topics in Distributed Systems 18/12/2012.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Amazon Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
Dynamo: Amazon’s Highly Available Key-value Store Adopted from slides and/or materials by paper authors (Giuseppe DeCandia, Deniz Hastorun, Madan Jampani,
1 Dynamo Amazon’s Highly Available Key-value Store Scott Dougan.
Time and Clock Primary standard = rotation of earth De facto primary standard = atomic clock (1 atomic second = 9,192,631,770 orbital transitions of Cesium.
Distributed Systems Spring 2009
Dynamo: Amazon’s Highly Available Key- value Store (SOSP’07) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,
Versioning and Eventual Consistency COS 461: Computer Networks Spring 2011 Mike Freedman 1.
Time, Clocks and the Ordering of Events in a Distributed System - by Leslie Lamport.
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
Multicast Communication Multicast is the delivery of a message to a group of receivers simultaneously in a single transmission from the source – The source.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.
Logical Clocks (2). Topics r Logical clocks r Totally-Ordered Multicasting r Vector timestamps.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Dynamo: Amazon’s Highly Available Key-value Store COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
Communication (II) Chapter 4
Dynamo: Amazon's Highly Available Key-value Store Dr. Yingwu Zhu.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
Dynamo: Amazon’s Highly Available Key-value Store
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Synchronization. Why we need synchronization? It is important that multiple processes do not access shared resources simultaneously. Synchronization in.
Synchronization Chapter 5.
CIS825 Lecture 2. Model Processors Communication medium.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
Feb 15, 2001CSCI {4,6}900: Ubiquitous Computing1 Announcements.
COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 6: Synchronyzation 3/5/20161 Distributed Systems - COMP 655.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Big Data Yuan Xue CS 292 Special topics on.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
Amazon Simple Storage Service (S3)
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Dynamo: Amazon’s Highly Available Key-value Store
Gossip-based Data Dissemination
CSC 8320 Advanced Operating System
Distributed Mutex EE324 Lecture 11.
Overview of Ordering and Logical Time
SYNCHORNIZATION Logical Clocks.
EECS 498 Introduction to Distributed Systems Fall 2017
Time and Clock.
湖南大学-信息科学与工程学院-计算机与科学系
Distributed Systems CS
Time and Clock.
EECS 498 Introduction to Distributed Systems Fall 2017
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Advanced Operating System
Outline Theoretical Foundations
Chapter 5 (through section 5.4)
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Outline Theoretical Foundations
Presentation transcript:

Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD COSC6376 Cloud Computing Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Dynamo

Dynamo

DynamoDB Scalable Reliable Speed Schemaless Dynamo architecture Replicas over multiple data centers Speed Fast, single-digit milliseconds Schemaless

Data Model Table Item Example Container, similar to a worksheet in excel, Cannot query across domains Item Item name item name ->(Attribute, value) pairs An item is stored in a domain (a row in a worksheet. Attributes are column names) Example domain: “cars” Item 1: “car1”:{“make”:”BMW”, “year”:”2009”}

Data Model Primary key of table Data type dynamo = Fog::AWS::DynamoDB.new( aws_access_key_id: "YOUR KEY", aws_secret_access_key: "YOUR SECRET") dynamo.create_table("people", {HashKeyElement: {AttributeName: "username", AttributeType: "S"}}, {ReadCapacityUnits: 5, WriteCapacityUnits: 5}) Primary key of table Single key (hash) Data type Simple: string and number Multi-valued: string set and number set

Example

Access methods Amazon DynamoDB is a web service that uses HTTP and HTTPS as the transport method JavaScript Object Notation (JSON) as a message serialization format APIs Java, PHP, .Net

High Availability for writes and Vector Clock

Data Versioning A put() call may return to its caller before the update has been applied at all the replicas. A get() call may return many versions of the same object. Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future. Solution: uses vector clocks in order to capture causality between different versions of the same object.

Logical Clock Lamport in 1978 Defines the “happened before” relationship, the clock condition Connects these concerns to special relativity

Event Ordering and Clock Time Time Time

Happened Before Assume that sending or receiving a message is an event in a process, then we can define the '‘happened before” relation, denoted by “->”, as follows. Process P If a and b are events in the same process, and a comes before b, then a -> b. b Time a If a is the sending of a message by one process and b is the receipt of the same message by another process, then a-> b. Process P Process Q b Time Time a If a ->b and b->c then a -> c.

Happened Before Two distinct events a and b are said to be concurrent if a -> b and b -> a. Process P Process Q Process R

Logical Clock If A happened before B, then it is possible for A to causally effect B. If neither can effect the other, then they are concurrent. A logical clock is a function Ci which assigns a number Ci(a) to any event a in process Pi. Clock Condition. For any events a, b: if a -> b then C(a) < C(b).

Lamport Clock Each node keeps a logical clock, Cp Each node updates its logical clock between successive events Cp← Cp + 1. A sender includes its clock value, ts, in the message ts = Cp(message) A receiver advances its clock be greater than the message’s clock value and its own clock max (ts, Cq)

Logical Clock Logical clocks satisfy the clock condition C1. If a and b are events in process P, and a comes before b, the Cp(a) < Cp(b). C2. If a is the sending of a message by process P, and b is the receipt of that message by process Q, then Cp(a) < Cq(b)

Vector Clocks

Vector Clocks Vector clocks are constructed by letting each node i maintain a vector VCi : VCp [p] is the number of events that have occurred so far at node p. In other words, VCp [p] is the local logical clock at node p. If VCq [p] = k then node q knows that k events have occurred at p. It is thus node q’s knowledge of the local time at node p. Time Time VC p = (3,3) VC q = (2,2) VC q = (2,1) VC p = (1,0) VC q = (0,0) VC p = (0,0) Process p Process q Preserve more information than logical clocks.

Vector Clock Keep a timestamp for each process A process increments its own timestamp before each event A process updates its values of other process’ timestamps when receiving messages

Vector Clocks Before executing an event node p executes VCp [ p ] ← VCp [p ] + 1. When node p sends a message m to node q, it sets the message’s vector timestamp ts (m) equal to VCp after having executed the previous step. Upon the receipt of a message m, node q adjusts its own vector by setting VCq [k ] ← max{VCq [k ], ts (m)[k ]} for each k, after which it executes the first step and delivers the message to the application.

Vector Clock A vector clock is a list of (node, counter) pairs. Every version of every object is associated with one vector clock. If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.

Vector Clock Example

Handling Temporary Failures and Hinted Handoff

Sloppy Quorum R/W is the minimum number of nodes that must participate in a successful read/write operation. Setting R + W > N yields a quorum-like system. In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.

Hinted Handoff Assume N = 3. When A is temporarily down or unreachable during a write, send replica to D. D is hinted that the replica is belong to A and it will deliver to A when A is recovered. Again: “always writeable”

Recovering from Permanent Failures and Merkle Trees

Replica Synchronization: The Merkle Hash Tree h(n1) h(n2) h(n3) h(n4) ha hb hr Merkle Tree is a tree of hashes where the leaves in the tree are hashes of the authentic data values n1, n2, ..., nw. The value of an internal node A is ha = h(h(n1)||h(n2)). The value of the root node is hr = h(ha||hb).

Membership and Failure Detection Gossip Protocol and Membership and Failure Detection

Gossip Protocol Gossip based algorithms Similar to how gossips spread. Propagating information in large peer-to-peer systems deployed on Internet or ad hoc networks Easy to deploy Robust Resilient to failure Similar to how gossips spread.

Gossip How do you gossip? If someone tells you a hot piece of gossip, you’ll try to tell other people. If you tell one person, and they didn’t know it beforehand, you’ll feel some satisfaction, and want to tell another person. If you tell N people, and they all know it, you lose interest in telling more people.

Bad news travels fast

Gossip Protocol Gossip protocols: Node are one of: Anti-entropy: Infected: Holds data that it is willing to spread. Susceptible: Not yet seen this data. Removed: Not able or willing to spread data. Anti-entropy: Node P picks another node Q at random, and exchanges updates. Three approaches to the exchange: P only pushes to Q. P only pulls from Q. P and Q do an exchange.

Gossip Protocol When it comes to rapidly spreading updates, only pushing updates turns out to be a bad choice. A pull-based approach works much better when many nodes are infected. A round is a period of time when each node will have had a chance to be active. It will take O(lg N) rounds to propagate a single update to all nodes.

Implementation Java Local persistence component allows for different storage engines to be plugged in: Berkeley Database (BDB) Transactional Data Store: object of tens of kilobytes MySQL: object of > tens of kilobytes BDB Java Edition, etc.