Lecture 19-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) October 29, 2013 Lecture 19 Gossiping Reading: Section.

Slides:



Advertisements
Similar presentations
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy)
Advertisements

CS425/CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
CS542 Topics in Distributed Systems Diganta Goswami.
Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
Consistency and Replication (3). Topics Consistency protocols.
DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
1 Linearizability (p566) the strictest criterion for a replication system  The correctness criteria for replicated objects are defined by referring to.
DISTRIBUTED SYSTEMS II REPLICATION –QUORUM CONSENSUS Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Consistency Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS542: Topics in Distributed Systems
CS 582 / CMPE 481 Distributed Systems
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class: Web Caching Use web caching as an illustrative example Distribution protocols –Invalidate.
CS 582 / CMPE 481 Distributed Systems
CS 582 / CMPE 481 Distributed Systems Replication.
Distributed Systems Fall 2011 Gossip and highly available services.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 9: Time, Coordination and Replication Dr. Michael R. Lyu Computer.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
Multicast Communication Multicast is the delivery of a message to a group of receivers simultaneously in a single transmission from the source – The source.
Communication (II) Chapter 4
Replication ( ) by Ramya Balakumar
Distributed Systems Course Replication 14.1 Introduction to replication 14.2 System model and group communication 14.3 Fault-tolerant services 14.4 Highly.
Slides for Chapter 14: Replication From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 3, © Addison-Wesley 2001.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
DISTRIBUTED SYSTEMS II REPLICATION CNT. Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
1 Highly available services  we discuss the application of replication techniques to make services highly available. –we aim to give clients access to.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
DISTRIBUTED SYSTEMS II REPLICATION Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
 Communication Distributed Systems IT332. Outline  Fundamentals  Layered network communication protocols  Types of communication  Remote Procedure.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Services
Building Dependable Distributed Systems, Copyright Wenbing Zhao
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Consistency Steve Ko Computer Sciences and Engineering University at Buffalo.
Providing High Availability Using Lazy Replication Rivaka Ladin, Barbara Liskov, Liuba Shrira, Sanjay Ghemawat Presented by Huang-Ming Huang.
Lecture 10: Coordination and Agreement (Chap 12) Haibin Zhu, PhD. Assistant Professor Department of Computer Science Nipissing University © 2002.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 12-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) October 4, 2012 Lecture 12 Mutual Exclusion.
Lecture 13: Replication Haibin Zhu, PhD. Assistant Professor Department of Computer Science Nipissing University © 2002.
Computer Science 425 Distributed Systems (Fall 2009) Lecture 24 Transactions with Replication Reading: Section 15.5 Klara Nahrstedt.
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
Fault Tolerance (2). Topics r Reliable Group Communication.
CSE 486/586 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 19-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) October 30, 2012 Lecture 19 Gossiping.
Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2011 Gossiping Reading: Section 15.4 / 18.4  2011, N. Borisov, I. Gupta, K. Nahrtstedt,
Distributed Computing Systems Replication Dr. Sunny Jeong. Mr. Colin Zhang With Thanks to Prof. G. Coulouris,
1 Highly available services  we discuss the application of replication techniques to make services highly available. –we aim to give clients access to.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
CSE 486/586 Distributed Systems Gossiping
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
湖南大学-信息科学与工程学院-计算机与科学系
Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013
Distributed systems II Replication Cnt.
Active replication for fault tolerance
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
Slides for Chapter 15: Replication
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
CSE 486/586 Distributed Systems Consistency --- 2
CS 425 / ECE 428 Distributed Systems Fall 2018 Indranil Gupta (Indy)
Presentation transcript:

Lecture 19-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) October 29, 2013 Lecture 19 Gossiping Reading: Section 18.4 (relevant parts)  2013, I. Gupta.

Lecture 19-2 Passive (Primary-Backup) Replication  Request Communication: the request is issued to the primary RM and carries a unique request id.  Coordination: Primary takes requests atomically, in order, checks id (resends response if not new id.)  Execution: Primary executes & stores the response  Agreement: If update, primary sends updated state/result, req-id and response to all backup RMs (1-phase commit enough).  Response: primary sends result to the front end Client Front End RM Client Front End RM primary Backup …. ?

Lecture 19-3 Active Replication  Request Communication: The request contains a unique identifier and is multicast to all by a reliable totally-ordered multicast.  Coordination: Group communication ensures that requests are delivered to each RM in the same order (but may be at different physical times!).  Execution: Each replica executes the request. (Correct replicas return same result since they are running the same program, i.e., they are replicated protocols or replicated state machines)  Agreement: No agreement phase is needed, because of multicast delivery semantics of requests  Response: Each replica sends response directly to FE Client Front End RM Client Front End RM …. ?

Lecture 19-4 Eager versus Lazy Eager replication, e.g., B-multicast, R-multicast, etc. (previously in the course) –Multicast request to all RMs immediately Alternative: Lazy replication –“Don’t hurry; Be lazy.” –Allow replicas to converge eventually and lazily –Propagate updates and queries lazily, e.g., when network bandwidth available –Allow other RMs to be disconnected/unavailable –May provide weaker consistency than sequential consistency, but improves performance Lazy replication can be provided by using gossiping

Lecture 19-5 Multicast Distributed Group of Processes at Internet- based hosts Process with a piece of information to be communicated to everyone

Lecture 19-6 Fault-tolerance and Scalability Multicast sender Multicast Protocol Process crashes Process crashes Packets may Packets may be dropped be dropped Possibly Possibly 1000’s of processes X X

Lecture 19-7 Centralized (B-multicast) UDP/TCP packets Simplest Simplest implementation implementation Problems? Problems?

Lecture 19-8 R-multicast UDP/TCP packets Reliability Reliability(atomicity) Overhead is Overhead is quadratic in N + Every process B-multicasts the message

Lecture 19-9 Tree-Based UDP/TCP packets Application-level: Application-level: SRM, RMTP, TRAM,TMTP Also network-level: Also network-level: IP multicast Tree setup Tree setup and maintenance and maintenance Problems? Problems?

Lecture A Third Approach Multicast sender

Lecture Gossip messages (UDP) Periodically, transmit to b random targets

Lecture Other processes do same after receiving multicast Gossip messages (UDP)

Lecture 19-13

Lecture “Epidemic” Multicast (or “Gossip”) Protocol rounds (local clock) Protocol rounds (local clock) b random targets per round b random targets per round Non-infected Non-infected Infected Infected Gossip Message (UDP)

Lecture Properties Claim that this simple protocol Is lightweight in large groups Spreads a multicast quickly Is highly fault-tolerant

Analysis For analysis purposes, assume loose synchronization and # gossip targets (i.e., b) = 1 In the first few rounds, gossip spreads like a tree –Very few processes receive multiple gossip messages If q(i) = fraction of non-infected processes after round i, then q(i) is initially close to 1, and later: –Prob.(given process is non-infected after i+1) = Prob.(given process was non-infected after i) TIMES Prob. (not being picked as gossip target during round i+1) –N(1-q(i)) gossips go out, each to a random process –Probability of a given non-infected process not being picked by any given gossip is (1-1/N) Source: “Epidemic algorithms for replicated database management”, Demers et al Lecture 19-16

Gossip is fast and lightweight (1) In first few rounds, takes O(log(N)) rounds to get to about half the processes –Think of a binary tree Later, if q(i) is the fraction of processes that have not received the gossip after round i, then: For large N and q(i+1) close to 0, approximates to: (2) In the end game, it takes O(log(N)) rounds for q(i+1) to be whittled down to close to 0 (1)+(2) = O(log(N)) = Latency of gossip with high probability = Average number of gossips each process sends out Source: “Epidemic algorithms for replicated database management”, Demers et al Lecture 19-17

Lecture Fault-tolerance Packet loss –50% packet loss: analyze with b replaced with b/2 –To achieve same reliability as 0% packet loss, takes twice as many rounds –Work it out! Process failure –50% of processes fail: analyze with N replaced with N/2 and b replaced with b/2 –Same as above –Work it out!

Lecture Fault-tolerance With failures, is it possible that the epidemic might die out quickly? Possible, but improbable: –Once a few processes are infected, with high probability, the epidemic will not die out –So the analysis we saw in the previous slides is actually behavior with high probability Think: why do rumors spread so fast? why do infectious diseases cascade quickly into epidemics? why does a worm like Blaster spread rapidly?

Lecture So,… Is this all theory and a bunch of equations? Or are there implementations yet?

Lecture Some implementations Amazon Web Services EC2/S3 (rumored) Clearinghouse project: and database transactions [PODC ‘87] refDBMS system [Usenix ‘94] Bimodal Multicast [ACM TOCS ‘99] Ad-hoc networks [Li Li et al, Infocom ‘02] Delay-Tolerant Networks [Y. Li et al ‘09] Usenet NNTP (Network News Transport Protocol)! [‘79] – Newsgroup servers use gossip

Lecture NNTP Inter-server Protocol Server retains news posts for a while, transmits them lazily, deletes them after a while 1.Each client uploads and downloads news posts from a news server 2.

23 Gossip-style Membership Array of Heartbeat Seq. l for member subset Good accuracy properties pi (Remember this?)

24 Gossip-Style Failure Detection Protocol Each process maintains a membership list Each process periodically increments its own heartbeat counter Each process periodically gossips its membership list On receipt, the heartbeats are merged, and local times are updated Current time : 70 at node 2 (asynchronous clocks) Address Heartbeat Counter Time (local) Fig and animation by: Dongyun Jin and Thuy Ngyuen (Remember this?)

25 Gossip-Style Failure Detection Now you know: –In a group of N processes, it takes O(log(N)) time for a heartbeat update to propagate to everyone with high probability –Very robust against failures – even if a large number of processes crash, most/all of the remaining processes still receive all heartbeats Failure detection: If the heartbeat has not increased for more than T fail seconds, the member is considered failed –T fail usually set to O(log(N)). But entry not deleted immediately: wait another T cleanup seconds (usually = T fail ) Why not delete it immediately after the T fail timeout?

26 Gossip-Style Failure Detection What if an entry pointing to a failed node is deleted right after T fail (=24) seconds? Fix: remember for another T fail Current time : 75 at node 2

Lecture Selecting Gossip Partners The frequency with which RMs send gossip messages depends on the application. Policy for choosing a partner to exchange gossip with: –Random policies: choose a partner randomly (perhaps with weighted probabilities) »Fastest, does not pay attention to updates, not so good on topology –Deterministic policies: a RM can examine its timestamp table and choose the RM that is the furthest behind in the updates it has received. »Somewhat fast, pays attention to updates, not so good on topology –Topological policies: choose gossip targets based on round-trip times (RTTs), or network topology.

28 Multi-level Gossiping Network topology is hierarchical Random gossip target selection => core routers face O(N) load (Why?) Fix: Select gossip target in subnet i, which contains n i nodes, with probability 1/n i Router load=O(1) Dissemination time=O(log(N)) Why? Can extend to multi-level hierarchical topology Router N/2 nodes in a subnet

Lecture Gossipping Architecture: Query and Update Operations QueryVal FE RM Query,prev TSVal,new TS Update FE Update, prev TS Update id Service Clients gossip

Lecture Gossiping Architecture The RMs exchange “gossip” messages (1) periodically and (2) amongst each other. Gossip messages convey updates they have each received from clients, and serve to achieve anti-entropy (convergence of all RMs). Properties: –Each client obtains a consistent service over time: in response to a query, an RM may have to wait until it receives “required” updates from other RMs. The RM then provides client with data that at least reflects the updates that the client has observed so far. –Relaxed consistency among replicas: RMs may be inconsistent at any given point of time. Yet all RMs eventually receive all updates and they apply updates with ordering guarantees. Provides eventual consistency

Lecture Various Timestamps Virtual timestamps are used to control the order of operation processing. The timestamp contains an entry for each RM (i.e., it is a vector timestamp). Each front end keeps a vector timestamp, prev, that reflects the latest data values accessed by that front end. The FE sends this along with every request it sends to any RM. Replies to FE: –When an RM returns a value as a result of a query operation, it supplies a new timestamp, new. –An update operation returns a timestamp, update id. Each returned timestamp is merged with the FE’s previous timestamp to record the data that has been observed by the client. –Merging is a pairwise max operation applied to each element i (from 1 to N)

Lecture Front ends Propagate Their Timestamps FE Clients FE Service Vector timestamps RM gossip Since client-to-client communication can also lead to causal relationships between operations applied to services, the FE piggybacks its timestamp on messages to other clients. Expanded on next slide…

Lecture A Gossip Replica Manager Replica timestamp Update log Value timestamp Value Executed operation table Stable updates Updates Gossip messages FE Replica timestamp Replica log OperationIDUpdate Prev FE Replica manager Other replicamanagers Timestamp table

Lecture Value: value of the object maintained by the RM. Value timestamp: the timestamp that represents the updates reflected in the value. Updated whenever an update operation is applied. Replica timestamp Update log Value timestamp Value Executed operation table Stable updates Updates Gossip messages FE Replica timestamp Replica log OperationIDUpdate Prev FE Replica manager Other replicamanagers Timestamp table

Lecture Update log: records all update operations as soon as they are received, until they are reflected in Value. –Keeps all the updates that are not stable, where a stable update is one that has been received by all other RMs and can be applied consistently with its ordering guarantees. –Keeps stable updates that have been applied, but cannot be purged yet, because no confirmation has been received from all other RMs. Replica timestamp: represents updates that have been accepted by the RM into the log. Replica timestamp Update log Value timestamp Value Executed operation table Stable updates Updates Gossip messages FE Replica timestamp Replica log OperationIDUpdate Prev FE Replica manager Other replicamanagers Timestamp table

Lecture Executed operation table: contains the FE-supplied ids of updates (stable ones) that have been applied to the value. –Used to prevent an update being applied twice, as an update may arrive from a FE and in gossip messages from other RMs. Timestamp table: contains, for each other RM, the latest timestamp that has arrived in a gossip message from that other RM. Replica timestamp Update log Value timestamp Value Executed operation table Stable updates Updates Gossip messages FE Replica timestamp Replica log OperationIDUpdate Prev FE Replica manager Other replicamanagers Timestamp table

Lecture The ith element of a vector timestamp held by RM i corresponds to the total number of updates received from FEs by RM i The jth element of a vector timestamp held by RM i (j not equal to i) equals the number of updates received by RM j that have been forwarded to RM i in gossip messages. Replica timestamp Update log Value timestamp Value Executed operation table Stable updates Updates Gossip messages FE Replica timestamp Replica log OperationIDUpdate Prev FE Replica manager Other replicamanagers Timestamp table

Lecture Update Operations Each update request u contains –The update operation, u.op –The FE’s timestamp, u.prev –A unique id that the FE generates, u.id. Upon receipt of an update request, the RM i –Checks if u has been processed by looking up u.id in the executed operation table and in the update log. –If not, increments the i-th element in the replica timestamp by 1 to keep track of the number of updates directly received from FEs. –Places a record for the update in the RM’s log. logRecord := where ts is derived from u.prev by replacing u.prev’s ith element by the ith element of its replica timestamp. –Returns ts back to the FE, which merges it with its timestamp.

Lecture Update Operation (Cont’d) The stability condition for an update u is u.prev <= valueTS i.e., All the updates on which this update depends have already been applied to the value. When the update operation u becomes stable, the RM does the following –value := apply(value, u.op) –valueTS := merge(valueTS, ts) (update the value timestamp) –executed := executed U {u.id} (update the executed operation table)

Lecture Exchange of Gossiping Messages A gossip message m consists of the log of the RM, m.log, and the replica timestamp, m.ts. –Replica timestamp contains info about non-stable updates An RM that receives a gossip message m has three tasks: –(1) Merge the arriving log with its own. »Let replicaTS denote the recipient RM’s replica timestamp. A record r in m.log is added to the recipient’s log unless r.ts <= replicaTS. »replicaTS  merge(replicaTS, m.ts) –(2) Apply any updates that have become stable but have not yet been executed (stable updates in the arrived log may cause some pending updates to become stable) –(3) Garbage collect: Eliminate records from the log and the executed operation table when it is known that the updates have been applied everywhere.

Lecture Query Operations A query request q contains the operation, q.op, and the timestamp, q.prev, sent by the FE. Let valueTS denote the RM’s value timestamp, then q can be applied if q.prev <= valueTS The RM keeps q on a hold back queue until the condition is fulfilled. –If valueTs is (2,5,5) and q.prev is (2,4,6), then one update from RM 3 is missing. Once the query is applied, the RM returns new  valueTS to the FE (along with the value), and the FE merges new with its timestamp.

Lecture More Examples Bayou –Replicated database with weaker guarantees than sequential consistency –Uses gossip, timestamps and concept of anti-entropy –Section Coda –Provides high availability in spite of disconnected operation, e.g., roving and transiently-disconnected laptops –Based on AFS –Aims to provide Constant data availability –Section

Lecture Summary Reading for this lecture: Section 18.4 MP3: By now you must have a design and must have started coding