Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD

Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD
COSC6376 Cloud Computing Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Dynamo

Dynamo

DynamoDB Scalable Reliable Speed Schemaless Dynamo architecture
Replicas over multiple data centers Speed Fast, single-digit milliseconds Schemaless

Data Model Table Item Example
Container, similar to a worksheet in excel, Cannot query across domains Item Item name item name ->(Attribute, value) pairs An item is stored in a domain (a row in a worksheet. Attributes are column names) Example domain: “cars” Item 1: “car1”:{“make”:”BMW”, “year”:”2009”}

Data Model Primary key of table Data type
dynamo = Fog::AWS::DynamoDB.new( aws_access_key_id: "YOUR KEY", aws_secret_access_key: "YOUR SECRET") dynamo.create_table("people", {HashKeyElement: {AttributeName: "username", AttributeType: "S"}}, {ReadCapacityUnits: 5, WriteCapacityUnits: 5}) Primary key of table Single key (hash) Data type Simple: string and number Multi-valued: string set and number set

Example

Access methods Amazon DynamoDB is a web service that uses HTTP and HTTPS as the transport method JavaScript Object Notation (JSON) as a message serialization format APIs Java, PHP, .Net

High Availability for writes
and Vector Clock

Data Versioning A put() call may return to its caller before the update has been applied at all the replicas. A get() call may return many versions of the same object. Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future. Solution: uses vector clocks in order to capture causality between different versions of the same object.

Logical Clock Lamport in 1978
Defines the “happened before” relationship, the clock condition Connects these concerns to special relativity

Event Ordering and Clock
Time Time Time

Happened Before Assume that sending or receiving a message is an event in a process, then we can define the '‘happened before” relation, denoted by “->”, as follows. Process P If a and b are events in the same process, and a comes before b, then a -> b. b Time a If a is the sending of a message by one process and b is the receipt of the same message by another process, then a-> b. Process P Process Q b Time Time a If a ->b and b->c then a -> c.

Happened Before Two distinct events a and b are said to be concurrent if a -> b and b -> a. Process P Process Q Process R

Logical Clock If A happened before B, then it is possible for A to causally effect B. If neither can effect the other, then they are concurrent. A logical clock is a function Ci which assigns a number Ci(a) to any event a in process Pi. Clock Condition. For any events a, b: if a -> b then C(a) < C(b).

Lamport Clock Each node keeps a logical clock, Cp
Each node updates its logical clock between successive events Cp← Cp + 1. A sender includes its clock value, ts, in the message ts = Cp(message) A receiver advances its clock be greater than the message’s clock value and its own clock max (ts, Cq)

Logical Clock Logical clocks satisfy the clock condition
C1. If a and b are events in process P, and a comes before b, the Cp(a) < Cp(b). C2. If a is the sending of a message by process P, and b is the receipt of that message by process Q, then Cp(a) < Cq(b)

Vector Clocks

Vector Clocks Vector clocks are constructed by letting each node i maintain a vector VCi : VCp [p] is the number of events that have occurred so far at node p. In other words, VCp [p] is the local logical clock at node p. If VCq [p] = k then node q knows that k events have occurred at p. It is thus node q’s knowledge of the local time at node p. Time Time VC p = (3,3) VC q = (2,2) VC q = (2,1) VC p = (1,0) VC q = (0,0) VC p = (0,0) Process p Process q Preserve more information than logical clocks.

Vector Clock Keep a timestamp for each process
A process increments its own timestamp before each event A process updates its values of other process’ timestamps when receiving messages

Vector Clocks Before executing an event node p executes VCp [ p ] ← VCp [p ] + 1. When node p sends a message m to node q, it sets the message’s vector timestamp ts (m) equal to VCp after having executed the previous step. Upon the receipt of a message m, node q adjusts its own vector by setting VCq [k ] ← max{VCq [k ], ts (m)[k ]} for each k, after which it executes the first step and delivers the message to the application.

Vector Clock A vector clock is a list of (node, counter) pairs.
Every version of every object is associated with one vector clock. If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.

Vector Clock Example

Handling Temporary Failures and Hinted Handoff

Sloppy Quorum R/W is the minimum number of nodes that must participate in a successful read/write operation. Setting R + W > N yields a quorum-like system. In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.

Hinted Handoff Assume N = 3. When A is temporarily down or unreachable during a write, send replica to D. D is hinted that the replica is belong to A and it will deliver to A when A is recovered. Again: “always writeable”

Recovering from Permanent Failures and Merkle Trees

Replica Synchronization: The Merkle Hash Tree
h(n1) h(n2) h(n3) h(n4) ha hb hr Merkle Tree is a tree of hashes where the leaves in the tree are hashes of the authentic data values n1, n2, ..., nw. The value of an internal node A is ha = h(h(n1)||h(n2)). The value of the root node is hr = h(ha||hb).

Membership and Failure Detection
Gossip Protocol and Membership and Failure Detection

Gossip Protocol Gossip based algorithms Similar to how gossips spread.
Propagating information in large peer-to-peer systems deployed on Internet or ad hoc networks Easy to deploy Robust Resilient to failure Similar to how gossips spread.

Gossip How do you gossip?
If someone tells you a hot piece of gossip, you’ll try to tell other people. If you tell one person, and they didn’t know it beforehand, you’ll feel some satisfaction, and want to tell another person. If you tell N people, and they all know it, you lose interest in telling more people.

Bad news travels fast

Gossip Protocol Gossip protocols: Node are one of: Anti-entropy:
Infected: Holds data that it is willing to spread. Susceptible: Not yet seen this data. Removed: Not able or willing to spread data. Anti-entropy: Node P picks another node Q at random, and exchanges updates. Three approaches to the exchange: P only pushes to Q. P only pulls from Q. P and Q do an exchange.

Gossip Protocol When it comes to rapidly spreading updates, only pushing updates turns out to be a bad choice. A pull-based approach works much better when many nodes are infected. A round is a period of time when each node will have had a chance to be active. It will take O(lg N) rounds to propagate a single update to all nodes.

Implementation Java Local persistence component allows for different storage engines to be plugged in: Berkeley Database (BDB) Transactional Data Store: object of tens of kilobytes MySQL: object of > tens of kilobytes BDB Java Edition, etc.

Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback