Download presentation
Presentation is loading. Please wait.
1
Sergio Rajsbaum 2006 Lecture 1 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico
2
Sergio Rajsbaum 2006 Lecture 1 Part I: Two-phase commit. An example of a distributed protocol Part II: What is a distributed system and its parameters. Problems solved in such a system. The need for a theoretical foundation
3
Sergio Rajsbaum 2006 Part I: Two-phase commit An example of an important distributed DBMS protocol
4
Sergio Rajsbaum 2006 Principles of Distributed Computing With the development of networking infrastructures and software, it is more and more rare to find a system that does not interact with other systems Distributed computing studies systems where components interact and collaborate Principles of distributed computing tries to understand the fundamental possibilities and limitations of such systems, with a precise, scientific approach Goal: to design efficient and reliable systems, and techniques to design them, analyze them and prove them correct, or to prove impossibility results when no protocol exists
5
Sergio Rajsbaum 2006 Distributed Commit An example from databases [ Garcia-Molina, Ullman, Widom, Database Systems, 2001 ]
6
Sergio Rajsbaum 2006 Distributed Commit A distributed transaction with components at several sites should execute atomically Example: A manager of a chain of stores wants to query all the stores, find the inventory of toothbrushes at each, and issue instructions to move toothbrushes from store to store in order to balance the inventory. The operation is done by a single global transaction T that has component Ti at the i-th store and a component T0 at the office where the manages is located.
7
Sergio Rajsbaum 2006 Sequence of activities performed by T 1.Component T0 is created at the site of the manager 2.T0 sends messages to all the stores instructing them to create components Ti 3.Each Ti executes a query at store i to discover the number of toothbrushes in inventory and reports this number to T0 4.T0 takes these numbers and determines, by some algorithm we shall not discuss, what shipments of toothbrushes are desired. T0 then sends messages such as “store 10 should ship 500 toothbrushes to store 7” to the appropriate stores 5.Stores receiving instructions update their inventory and perform the shipments
8
Sergio Rajsbaum 2006 Atomicity Make sure it does not happen: some of the actions of T get executed, but others do not We do assume atomicity of each Ti, through mechanisms such as logging and recovery Failures make difficult the achievement of atomicity of T –A site fails or is disconnected from the network –A bug in the algorithm to redistribute toothbrushes instructs store 10 to ship more than it has
9
Sergio Rajsbaum 2006 Example of failures Suppose T10 replies to T0’s 1 st message with its inventory. The machine at 10 then crashes, the instructions form T0 are never received by T10 However, T7 sees no problem, and receives the instructions from T0 Can distributed transaction T ever commit?
10
Sergio Rajsbaum 2006 Two-phase commit To guarantee atomicity distributed DBMS use a protocol for deciding whether or not to commit a distributed transaction Each component of the transaction will commit, or non will The protocol is coordinator based. It could be the site at which the transaction originates, such as T0. Call it C
11
Sergio Rajsbaum 2006 Phase 1 1.C sends to each site “prepare T” 2.Each site receiving it, decides whether to commit or abort its component of T. It must eventually send this response to C 3.At this point the site performs all actions associated with the local T, to be sure the local component does not abort later due to a local failure, and sends “ready T” to C. Only C could instruct it to abort later on 4.If the site send “don’t commit” is can locally abort T, since T will surely abort even if another wants to commit
12
Sergio Rajsbaum 2006 Phase 2 1.It begins when C receives the responses “ready” or “don’t commit” by all. Or after a timeout, in case of a site fails or gets disconnected 2.If C receives “ready” from all then it decides to commit T. Sends “commit T” to all 3.If C receives a “don’t commit” it aborts T, and sends “abort” to all 4.If a site receives “commit” it commits its local component of T; if it receives “abort” it aborts it
13
Sergio Rajsbaum 2006 Recovery in case of failures Two cases: when C fails or when another fails
14
Sergio Rajsbaum 2006 Recovery of a non-C site Suppose a site S fails during a two-phase commit. Need to make sure when it recovers is does not introduce inconsistencies The hard case: its last log is it sent a “ready” to C Then it must communicate to at least one other site to find out the global decision for T. In the worst case, no other site can be contacted and the local component of T must be kept active
15
Sergio Rajsbaum 2006 What if C fails? The surviving participant sites must wait for C to recover, or after a waiting time, elect a new coordinator Leader election problem is in itself a complex problem. The simple solution of interchanging IP addresses and electing the smallest works often, but may fail Once a new leader C’ exists, it polls the sites about the status of all transactions If some site had already received a “commit” from C then C’ sends “commit” to all If some site had already received a “ abort” it sends it to all If no site received such messages, and at least one was not ready to commit, it is safe to abort T
16
Sergio Rajsbaum 2006 The hard case No site had received “commit” or “abort” and every site is ready to commit C’ cannot be sure if C found some reason to abort T (perhaps for a local reason, or some delays), or to commit it and it had already committed it locally. Must wait until the original C recovers In real systems the administrator has the ability to intervene manually and force the waiting transaction components to finish The result is a possible loss of atomicity, and the person executing T will be notified
17
Sergio Rajsbaum 2006 So, is the protocol correct ? Is it efficient? Is manual intervention unavoidable ??
18
Sergio Rajsbaum 2006 How to analyze the protocol ? What exactly does it mean to be correct? –Not just “either all commit or all abort” –What is the termination correctness condition ? Under what circumstances it is correct? –E.g. Trivial if no failures are possible. –What if there are malicious failures? –How to choose the time-outs? What is its performance? –Depends on the timeouts, delays. –On the type and number of failures Is it efficient? Are there more efficient protocols?
19
Sergio Rajsbaum 2006 Part II: What is a distributed system and its parameters. Problems solved in such a system. The need for a theoretical foundation
20
Sergio Rajsbaum 2006 What is distributed computing? Any system where several independent computing components interact This broad definition encompasses – VLSI chips, and any modern PC –tightly-coupled shared memory multiprocessor –local area cluster of workstations –internet, WEB, Web services –wireless networks, sensor networks, ad-hoc networks –cooperating robots, mobile agents, P2P systems
21
Sergio Rajsbaum 2006 Computing components Referred to processors or processes in the literature Can represent a –microprocessor –process in a multiprocessing operating system –Java thread –mobile agent, mobile node (e.g. laptop), robot –computing element in a VLSI chip
22
Sergio Rajsbaum 2006 Interaction – message passing vs. shared memory Processors need to communicate with each other to collaborate, via Message passing –Point-to-point channels, defining an interconnection graph –All-to-all using an underlying infrastructure (e.g. TCP/IP) –Broadcast; wireless, satellite Shared memory –Shared-objects: read/write, test&set, compare&swap, etc –Usually harder to implement, easier to program
23
Sergio Rajsbaum 2006 A distributed system processors Communication media collaborate
24
Sergio Rajsbaum 2006 Failures Any system that includes many components running over a long period of time must consider the possibility of failures of processors and communication media of different severity –from processor crashes or message loses, to –malicious Byzantine behavior
25
Sergio Rajsbaum 2006 A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport
26
Sergio Rajsbaum 2006 A note on parallel vs. distributed computing
27
Sergio Rajsbaum 2006 Parallel vs. distributed computing Distributed computing –focuses on the more loosely coupled end of the spectrum –Each processor has its own semi-independent agenda, but needs to coordinate with others for fault-tolerance, sharing resources, availability, etc. –Processors tend to be somewhat independent: sometimes physically separated, running at independent speeds, communicating through a media with unpredictable delays –Multi-layered, many different issues need to be addressed: communication, synchronization, fault-tolerance… Parallel processing –studies the more tightly-coupled end of the spectrum –usually all processors are dedicated to perform one large task –as fast as possible. Parallelizing a problem is a main goal.
28
Sergio Rajsbaum 2006 Many kinds of problems Clock synchronization Routing Broadcasting Naming P2P, how to share and find resources sharing resources, mutual exclusion Increasing fault-tolerance, failure detection Security, authentication, cryptography Database transactions, atomic commitment Backups, reliable storage, file systems Applications, airline reservation, banking, electronic commerce, publish/subscribe systems, web search, web caching, …
29
Sergio Rajsbaum 2006 Multi-layered, complex interactions An example A fault-tolerant broadcast service is useful to build a higher level database transaction module Naming, authentication is required And may work more efficiently if clocks are tightly synchronized And good routing schemes should exist If the clock synchronization is attacked, the whole system may be compromised
30
Sergio Rajsbaum 2006 Chaos We need a good foundation, principles of distributed computing
31
Sergio Rajsbaum 2006 Chaos Too many models, problems and orthogonal, interacting issues Very hard to get things right, to reproduce operating scenarios Sometimes it is easy to adapt a solution to a different model, sometimes a small change in the model makes a problem unsolvable
32
Sergio Rajsbaum 2006 Distributed computing theory Models –Good models [Schneider Ch.2 in Distributed Systems, Mullender (Ed.)] –Relation between models: solve a problem only once; solve it in the strongest possible model Problems –Search of paradigms that represent fundamental distributed computing issues –Relations between problems: hierarchies of solvable and unsolvable problems; reductions Solutions –Design algorithms, verification techniques, programming abstractions –Impossibility results and lower bounds Efficiency measures –Time, communication, failures, recovery time, bottlenecks, congestion
33
Sergio Rajsbaum 2006 Bibliography Theory of distributed computing textbooks Attiya, Welch, Distributed Computing, Wiley-Interscience, 2 ed., 2004 Garg, Elements of Distributed Computing, Wiley-IEEE, 2002 Lynch, Distributed Algorithms, Morgan Kaufmann,1997 Tel, Introduction to Distributed Algorithms, Cambridge U., 2 ed. 2001
34
Sergio Rajsbaum 2006 Bibliography others Distributed Algorithms and Systems http://www.md.chalmers.se/~tsigas/DISAS/index.html Conferences: DISC, PODC,… Journals: Distributed Computing,… –Special issue PODC 20 th anniversary, Sept. 2003 ACM SIGACT News Distributed Computing Column. Also one in EATCS Bulletin
35
Sergio Rajsbaum 2006
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.