Chapter 8 Coordination. Topics Election algorithms Mutual exclusion Deadlock Transaction.

Chapter 8 Coordination

Topics Election algorithms Mutual exclusion Deadlock Transaction

Election Algorithms This is the way nodes in a DS electing a new coordinator when the old one failed or was cut out of the network In the following algorithms, each processor (node) has a unique ID. Communications are reliable (messages are not dropped or corrupted).

Requirements Safety: each process Pi has coordinator =null or coordinator = P, where P is the live process Liveness: each process Pi eventually has coordinator ≠ null or it has failed.

The Bully Algorithm (Garcia-Molina) “Node with highest ID bullies his way into leadership”. When a process notices that the coordinator fails, it holds an election: 1. P sends an ELECTION (E-message) to all processes with higher numbers 2. If no one responds, P wins the election and becomes coordinator. 3. If one of the higher-ups answers, say Q, it takes over. P’s job is done.

An Example Process 4 holds an election Process 5 and 6 respond, telling 4 to stop Now 5 and 6 each hold an election

An Example (Cont.) d) Process 6 tells 5 to stop e) Process 6 wins and tells everyone

The Cost In a network of N nodes, assume the coordinator with ID N fails If the process with ID (N-1) starts an election, the cost is O(N) messages If the lowest numbered node starts an election, the cost is O(N 2 )

A Ring Election Algorithm Nodes are physically or logically organized in a ring. Nodes know their successors. Node states are: Normal, Election, Leader. Any node that notices that the leader is not functioning, changes his state to Election, starts an election message containing his ID and sends it to his clockwise neighbor.

An Example

A Ring Election Algorithm (2) When a node receives an election message: Add its ID to the message, send it to the successor If the message contains its own ID, it sends a CORDINATOR message, which contains the list member with the highest number as the coordinator. This message circulates once.

An Example

An Example (Cont.)

Complexity In the best case, only one node starts an election message, so the number of messages is 2N. In the worst case, N nodes start an election message resulting in O(N 2 ). Improvements Drop election messages arriving in less than time , where  is the time a message takes to traverse the ring. Does it work?

LCR Ring Election Each node sends a message with its ID around the ring. When a process receives an incoming message, it compares the ID with its own. If the incoming ID is greater than its own, it passes it to the next node; if it is less than its own, it discards it; if it is equal to its own, it declares itself leader. 3 5 0 Elect 3 Elect 5 Elect 0

Complexity If messages are passed clockwise…only one survives after the first round. If messages are passed counter- clockwise... Best case O(N), worst case O(N 2 ). 1 2 3 0 Elect 0 Elect 1 Elect 2 Elect 3

HS (Hirschberg Sinclair) Ring Election (1) Motivation: O(N 2 ) is a lot of messages. Improve it to O(N log N). Assumptions: the ring size can be unknown. The communications must be bidirectional. All nodes start more or less at the same time. Each node operates in phases and sends out tokens. The tokens carry hop-counts and direction flags in addition to the ID of the sender. 3 ID=3 2 hops clockwise ID=3, 2 hops Counter-clckws

HS Ring Election (2) Phases are numbered 0, 1, 2, 3, …  log 2 N . In each phase, k, node j sends out tokens u j containing its ID in both directions. The tokens travel 2 k hops then return to their origin j. Travel only the distance of 2 k If both tokens make it back, process j continues with the next phase (increments k). If both tokens do not make it back, process j simply waits to be told who the results of the election. 3 x x Outbound Inbound

HS Ring Election (3) All processes always relay inbound tokens. If a process i receives a token u j going in the outbound direction, it compares the token’s ID with its own. If it has a larger ID, it simply discards the token. If it has a smaller ID, it relays the token as requested. If it is equal to the token ID, it has received its own token in the outbound direction, so the token has gone clear around the ring and the process declares itself leader. 4 ID=3, 2 hops clockwise

Complexity Communications Complexity: In the first phase, every process sends out 2 tokens and they go one hop and return. This is a total of 4N messages for the tokens to go out and return. In phase k, where k>0, a node sends out tokens if it was not overruled in the previous phase, that is by a process within a distance of 2 k-1 in either direction. This implies that within group of 2 k-1 +1consecutive nodes, at most one goes on to send out tokens in phase k. This limits the message complexity to O(N log N).

Mutual Exclusion in DS Mutual exclusion is needed for restricting access to a shared resource. We use semaphores, monitors and similar constructs to enforce mutual exclusion on a centralized system. We need the same capabilities on DS. As in the one processor case, we are interested in safety (mutual exclusion), progress, and bounded waiting (fairness).

Solutions Centralized lock manager Token-passing lock manager Distributed lock manager Ricard/Agrawala Algorithm Voting Quorum

A Centralized Algorithm a) Process 1 asks the coordinator for permission to enter a critical region. Permission is granted b) Process 2 then asks permission to enter the same critical region. The coordinator does not reply. c) When process 1 exits the critical region, it tells the coordinator, when then replies to 2

Problems with Centralized Locking? Other issues?

The Token Ring Algorithm Assumption: Processes are ordered in a ring. Communications are reliable and can be limited to one direction. Size of ring can be unknown and each process is only required to know his immediate neighbor. A single token circulates around the ring (in one direction only). 3 5 0 token

Algorithm Details When a process has the token, he can enter the CR at most once. Then he must pass the token on. Only the process with the token can enter the CR, thus Mutual Exclusion is ensured. Bounded waiting since the token circulates. Liveness: as long as the process with the token doesn’t fail, progress in ensures. Global snapshots can be used if a lost token is suspected. 3 5 0 token

Problems with Token-Algorithm 1. How to distinguish if token is lost or if it is used very long? 2. What happens if token-holder crashes for some time? 3. How to maintain a logical ring if a participant drops out (voluntarily or by failure) of the system? 4. How to identify and add new participants? 5. Token is perpetually passed over the ring even when none of the participants wants to enter its CS ⇒ unnecessary overhead consuming bandwidth 6. Ring imposes an average delay of N/2 hops limiting scalability

Distributed Algorithm: Ricart and Agrawala Timestamp Algorithm Assumption: there is a total ordering of all events in the system (Lamport’s timestamps will provide this). Communications are reliable. Each process must maintain a queue for each critical region or resource if there is more than one resource to be shared. 1 0 2 resource

Ricart and Agrawala (2) When a process wants to enter the Critical Region or obtain a resource, it sends a message with its ID and a Lamport timestamp (t, pid) to all other processes. It can proceed to enter the CR when it gets an “OK” message from all other processes. When it is done with the CR, it sends an “OK” message to every process on its wait queue and removes them from the queue.

Ricart and Agrawala (3) When a process, P1, receives a request for the resource from process, P2: If P1 is not in the CR and does not want the CR, it sends back an “OK” message. If P1 is currently in the CR, it does not reply, but queues P2’s request. If P1 wants to enter the CR but has not yet received all the permissions, it compares the timestamp in P2’s message with the one in the message that P1 sent out to request the CR. The lowest timestamp wins. If TS(P1) < TS(P2), then P2’s message is put on the queue. If TS(P1) > TS(P2), then P1 sends P2 an “OK” message.

Ricart and Agrawala (4) a) Two processes want to enter the same critical region at the same moment. b) Process 0 has the lowest timestamp, so it wins. c) When process 0 is done, it sends an OK also, so 2 can now enter the critical region.

Analysis No tokens anymore Cooperative voting to determine sequence of CSs Does not rely on an interconnection media offering ordered messages Serialization based on logical time stamps ( total ordering) If a participant wants to enter its CS it asks all others for permission and does not proceed until all others have agreed If a participant gets a permission request and is not interested in its CS, it returns permission immediately to the requester. Message complexity: 2(N-1). Algorithm ensures: mutual exclusion (no 2 have the lowest timestamp) progress (someone has the lowest timestamp) bounded waiting

Voting for Mutual Exclusion Potential problems: You must be sure you have more votes than any other process to enter the CR: if P1 has 4 and P2 has 3 and P3 has 2, P1 has the most votes, but how does he know without communicating (costly) with other contenders? Just having 4 votes is not enough: what if P1 has 4 and P2 has 5 ? Potential solution: require a simple majority to win. But 4 is not a majority of 9, so in this example, no one can go. Worse: processes are deadlocked. Must be a way to resolve this kind of deadlock.

Timestamp Resolution When a process makes a request, it attaches a Lamport timestamp. Voters will prefer candidates with the smaller timestamp. If voter V has voted for P1 and then receives a request for vote from P2 with an earlier timestamp, V will try to retrieve its vote. V retrieves his vote by sending an INQUIRE message to P1. If P1 has not yet received all the needed votes, he must relinquish V’s vote, in which case, V now gives his vote to P2. This avoids deadlock. When the P1 is finished with the CR, he sends release messages to all his voters, so they can give their votes to new candidates.

Anti-quorum Resolution An anti-quorum is any set of nodes that has a non-empty intersection with all quorums. A voter votes YES to one process and NO to other processes seeking the same resource. When process gets a quorum of YES votes: proceeds to the CR. When he gets an anti- quorum of NO votes, he knows he will not get enough YES votes, so he “withdraws his candidacy” and releases his votes. After waiting a specified time, he tries again to gain enough votes.

Quorums Do we need to get a majority of votes or is there some smaller set of votes that will do? Different nodes could have different voting districts as long as any two districts have a non-empty intersection. Quorums have the property that any 2 have a non-empty intersection. Simple majorities are quorums. Any 2 sets whose sizes are simple majorities must have at least one element in common.

Quorums (2) Grid quorum: arrange nodes in logical grid (square). A quorum is all of a row and all of a column. Quorum size is 2*sqrt(n) –1. Finite Projective Plane (Maekawa): if N=7, form coteries of 3

Comparison AlgorithmMessages per entry/exit Delay before entry (in message times) Problems Centralized 32Coordinator crash Token ring 1 to  0 to n-1Lost token, process crash Distributed2(n-1) Crash of process Voting2(n-1) Crash of process

Transaction Property Atomicity. Either all operations of the transaction are properly reflected in the database or none are. Consistency. Execution of a transaction in isolation preserves the consistency of the database. Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. Intermediate transaction results must be hidden from other concurrently executed transactions. Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.

Example: Funds Transfer Transaction to transfer $50 from account A to account B: 1.read(A) 2.A := A – 50 3.write(A) 4.read(B) 5.B := B + 50 6.write(B) Consistency requirement – the sum of A and B is unchanged by the execution of the transaction. Atomicity requirement — if the transaction fails after step 3 and before step 6, the system ensures that its updates are not reflected in the database.

Example: Funds Transfer continued Durability requirement — once the user has been notified that the transaction has completed (i.e., the transfer of the $50 has taken place), the updates to the DB must persist despite failures. Isolation requirement — if between steps 3 and 6, another transaction is allowed to access the partially updated database, it will see an inconsistent database (the sum A + B will be less than it should be). Can be ensured by running transactions serially.

The Transaction Model Write data to a file, a table, or otherwiseWRITE Read data from a file, a table, or otherwiseREAD Kill the transaction and restore the old valuesABORT_TRANSACTION Terminate the transaction and try to commitEND_TRANSACTION Make the start of a transactionBEGIN_TRANSACTION DescriptionPrimitive

Transaction Types Flat transactions No partial results available A nested transaction is a transaction that is logically decomposed into a hierarchy of sub- transactions. Allow partial results to be committed A distributed transaction is a logically flat indivisible transaction that operates on distributed data.

Distributed Transactions: Illustration

Private Workspace a) The file index and disk blocks for a three-block file b) The situation after a transaction has modified block 0 and appended block 3 c) After committing Q: the cost of copying data?

More Efficient Implementation Two common methods of implementation are write-ahead logs and before/after images. With write-ahead logs, the transactions act on the permanent workspace, but before they can make a change, a log record is written to stable storage with the transaction and data item ID and the old and new values. This log can then be used if the transaction aborts and the changes need to be rolled back.

Write-ahead Log x = 0; y = 0; BEGIN_TRANSACTION; x = x + 1; y = y + 2 x = y * y; END_TRANSACTION; (a) Log [x = 0 / 1] (b) Log [x = 0 / 1] [y = 0/2] (c) Log [x = 0 / 1] [y = 0/2] [x = 1/4] (d) a) A transaction b) – d) The log before each statement is executed

Before- and After- Images A before- and after-image is kept for each data item. When a data item is changed, the old value is written to the before-image and the new value is the after- image. Other transactions are not allowed to “see” the new value until the current transaction commits. The after-image is made permanent and durable once the transaction which wrote it commits. If the transaction aborts, the before-image is restored.

DBMS Organization General organization of managers for handling transactions.

DBMS Organization

Levels of Consistency (SQL92) Serializable — default Repeatable read — only committed records to be read, repeated reads of same record must return same value. However, a transaction may not be serializable. Read committed — only committed records can be read, but successive reads of record may return different (but committed) values. Read uncommitted — even uncommitted records may be read (browse).

Serializability BEGIN_TR ANSACTIO N x = 0; x = x + 1; END_TRA NSACTION (a) BEGIN_TR ANSACTIO N x = 0; x = x + 2; END_TRA NSACTION (b) BEGIN_TR ANSACTIO N x = 0; x = x + 3; END_TRA NSACTION (c) Schedule 1x = 0; x = x + 1; x = 0; x = x + 2; x = 0; x = x + 3Legal Schedule 2x = 0; x = 0; x = x + 1; x = x + 2; x = 0; x = x + 3;Legal Schedule 3x = 0; x = 0; x = x + 1; x = 0; x = x + 2; x = x + 3;Illegal

Two-Phase Locking (2PL)

Strict 2PL

Pessimistic Timestamp Ordering Target: enforce serializability Every transaction gets a (Lamport, totally ordered) timestamp. Every data item has a read ts and a write ts and a commit bit c. The commit bit c is true if and only if the most recent transaction to write to that item has committed. The scheduler maintains the item timestamps and checks to make sure the reads and writes are correct.

Read Too Late T2 writes X T1 reads X? T1 starts T2 starts T1 tries to read X, but ts(T1) < write-ts(X) meaning X has been written to by a later transaction. T1 should not be allowed to read X because it was written by a transaction that occurs later in the serialization order (transactions are serialized by start time). Solution: T1 is aborted.

Write Too Late T2 reads X T1 writes X? T1 starts T2 starts T1 tries to write X, but the read-ts indicates that some other transaction should have read the value about to be written. Solution: T1 is aborted.

Dirty Reads T2 writes X T1 reads X? T2 starts T1 starts T2 abort T1 reads X that was last written by T2. The timestamps are properly ordered, but the commit bit c=false so if T2 later aborts then T1 must abort. Solution: We can avoid cascading aborts by delaying T1’s read until T2 has committed (though not necessary to ensure serializability).

Thomas Write Rule T2 writes X T1 writes X? T1 starts T2 starts T2 has written to X before T1. When T1 tries to write, the appropriate action is to do nothing. No other transaction T3 that should have read T1’s value of X got T2’s value instead, because it would have been aborted because of a too late read. Future reads of X want T2’s value or a later value, not T1’s value. Solution: T1’s write can be skipped.

TS Ordering Rules When scheduler receives a read request from transaction T, if ts(T)>= write-ts(X) and c(X) is true, grant request and set read-ts(X) to MAX{ts(T),read-ts(X)} if ts(T)>= write-ts(X) and c(X) is false, delay T until c(X) becomes true or txn aborts. If ts(T)< write-ts(X), abort T and restart with new timestamp.

TS Ordering Rules, continued When scheduler receives a write request from transaction T, if ts(T)>= read-ts(X) and ts(T)>= write-ts(X), grant request, set write-ts(X) to ts(T) and c(X)=false if ts(T)>= read-ts(X) and ts(T)< write-ts(X), don’t do the operation but allow T to continue as if done (Thomas write rule). If ts(T)< read-ts(X), abort T and restart with new timestamp.

Optimistic Timestamp Ordering In any optimistic concurrency control, each transaction does its writes to a private workspace until completion of a validation phase. In the validate phase, the scheduler validates the transaction by comparing its read set and write set with those of other transactions. After validation, the write set values are written to the database and the transaction commits Validation is frequently done with the help of timestamps.

Two-Phase Commit (2PC) When several database take part in a single transaction a protocol called Two-Phase Commit is used Each database is assumed to have its own local “resource manager” A single system component called the Coordinator controls the whole process.

Steps Phase 1: Coordinator sends a VOTE_REQUEST message Clients return VOTE_COMMIT or VOTE_ABORT Phase 2: Coordinator collects all votes and sends GLOBAL_COMMIT or GLOBAL_ABORT Each client commits or aborts. Important factor: time-out

2PC (2) a) The finite state machine for the coordinator in 2PC. b) The finite state machine for a participant. 1)Client fail? 2)Coordinate fail?

Chapter 8 Coordination. Topics Election algorithms Mutual exclusion Deadlock Transaction.

Similar presentations

Presentation on theme: "Chapter 8 Coordination. Topics Election algorithms Mutual exclusion Deadlock Transaction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 8 Coordination. Topics Election algorithms Mutual exclusion Deadlock Transaction.

Similar presentations

Presentation on theme: "Chapter 8 Coordination. Topics Election algorithms Mutual exclusion Deadlock Transaction."— Presentation transcript:

Similar presentations

About project

Feedback