Download presentation
Presentation is loading. Please wait.
1
Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman University of Connecticut & MIT Performing Tasks in Asynchronous Environments
2
Performing Work with Asynchronous Processors 2 Do-All problem ( [DHW] et al. ) DA (p,t) problem abstracts the basic problem of cooperation in a distributed setting: p processors must perform t tasks, and at least one processor must know about it [Dwork Halpern Waarts 92/98] Tasks are: known to every processor similar - each takes similar number of local steps independent - may be performed in any order idempotent - may be performed concurrently
3
Performing Work with Asynchronous Processors 3 Do-All: synchronous model with crashes Model: processors are synchronous, may fail by crashes Solutions: problem well understood, results close to optimal Shared-memory model -- communication by read/write Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel computation. Kluwer Academic Publishers (1997) Message-passing model -- communication by exchanging messages Dwork, C., Halpern, J., Waarts, O. Performing work efficiently in the presence of faults. SIAM Journal on Computing, 27 (1998) De Prisco, R., Mayer, A., Yung, M. Time-optimal message-efficient work performance in the presence of faults. Proc. of 13th PODC, (1994) Chlebus, B., De Prisco, R., Shvartsman, A.A. Performing tasks on synchronous restartable message- passing processors. Distributed Computing, 14 (2001)
4
Performing Work with Asynchronous Processors 4 Do-All: asynchronous models Models: Shared-memory model -- communication by read/write -- widely studied, but solutions far from optimal Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel computation. Kluwer Academic Publishers (1997) Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) Kedem, Z., Palem, K., Raghunathan, A., Spirakis, P.: Combining tentative and definite executions for very fast dependable parallel computing. Proc. of 23rd STOC, (1991) Message-passing model -- communication by exchanging messages -- no interesting solutions until recently
5
Performing Work with Asynchronous Processors 5 Shared-Memory vs. Message-Passing Shared-Memory (atomic registers): processors communicate by read/write in shared-memory atomicity - guarantees that read outputs the last written value one read/write operation per local clock cycle information propagates and information is persistent Hence cooperation is always possible, although delayed Here processor scheduling is the major challenge Message-Passing: processors communicate by exchanging messages duration of a local step may be unbounded message delays may be unbounded information may not propagate -- send/recv depend on delay
6
Performing Work with Asynchronous Processors 6 Message-delay-sensitive approach Even if messages delay are bounded by d (d-adversary), cooperation may be difficult Observation: If d = (t) then work must be (t ·p) This means that cooperation is difficult, and addressing scheduling alone is not enough - - algorithm design and analysis must be d-sensitive Message-delay-sensitive approach C. Dwork, N. Lynch and L. Stockmeyer.: Consensus in the presence of partial synchrony. J. of the ACM, 35 (1988)
7
Performing Work with Asynchronous Processors 7 Measures of efficiency Termination time : the first time when all tasks are done and at least one processors knows about it Used only to define work and message complexity Not interesting on its own: if all processors but one are delayed then trivially time is (t) Work : measures the sum, over all processors, of the number of local steps taken until termination time Message complexity (message-passing model): measures number of all point-to-point messages sent until termination time
8
Performing Work with Asynchronous Processors 8 Structure of the presentation Part 2: Message-passing model. Model: asynchrony, message delay, and modeling issues Delay-sensitive lower bounds for Do-All Progress-tree Do-All algorithms Simulating shared-memory and Anderson-Woll (AW) Asynch. message-passing progress-tree algorithm Permutation Do-All algorithms Part 1: Shared-memory model Model and bibliography Improving AW algorithm in shared-memory by better scheduling processors (task load-balancing)
9
Performing Work with Asynchronous Processors 9 Shared-Memory - model and goal We consider the following model: p asynchronous processors with PID in {0,…,p-1} processors communicate by read/write in shared-memory atomicity - read outputs the last written value one read/write operation per local clock cycle Write-All : write 1’s into t locations of given array Goal: improve scheduling of cooperating asynchronous processors leading to better load-balancing wrt tasks
10
Performing Work with Asynchronous Processors 10 Write-All: Selected Bibliography Introducing Write-All problem Kanellakis, P.C., Shvartsman, A.A.: Efficient parallel algorithms can be made robust. PODC (1989), Distributed Computing (1992) AW algorithm with work O(t p ) Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) Randomized algorithm with work (t + plog p) Martel, C., Subramonian, R.: On the complexity of Certified Write- All algorithms. J. Algorithms 16 (1994) First work-optimal deterministic algorithm for t = (p 4 log p) Malewicz, G.: A work-optimal deterministic algorithm for the asynchronous Certified Write-All problem. PODC (2003)
11
Performing Work with Asynchronous Processors 11 Shared memory p processors, t tasks (p = t) q permutations of [q] q-ary progress tree of depth log q p nodes are binary completion bits Progress tree algorithms [BKRS, AW] Permutations establish the order in which the children are visited p processors traverse the tree and use q-ary expansion of their PID to choose permutations [Anderson Woll] 1 2 3 q
12
Performing Work with Asynchronous Processors 12 Algorithm AWT [Anderson Woll] Progress tree data structure is stored in shared memory p, t = 9, q = 3 : list of 3 schedules from S 3 T : ternary tree of 9 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 12 123 123 3 0 PID = 0,3,6 1 PID = 1,4,7 2 PID = 2,5,8 123 458791012116 0 1 12 2 3 3 7=21 3
13
Performing Work with Asynchronous Processors 13 Contention of permutations S n -group of all permutations on set [n], with composition and identity n , - permutations in S n - set of q permutations from S n i is lrm (left-to-right maximum) in if (i) > max j<i (j) LRM( ) - number of lrm in [Knuth] Cont( , ) = LRM( -1 ) Contention of : Cont( ) = max Cont( , ) [AW] Theorem: [AW] For any n > 0 there exists set of n permutations from S n with Cont( ) 3nH n = (n log n). [Knuth] Knuth, D.E.: The art of computer programming Vol. 3 (third edition). Addison-Wesley Pub Co. (1998) 1035246197811
14
Performing Work with Asynchronous Processors 14 Procedure “Oblivious Do” n - number of jobs and units - list of n schedules from S n Procedure Oblivious : Forall processors PID = 0 to n-1 for i = 1 to n do perform Job( PID (i)) Execution of Job( PID (i)) by processor PID is primary, if job PID (i) has not been previously performed Lemma: [AW] In algorithm Oblivious with n units, n jobs, and using the list of n permutations from S n, the number of primary job executions is at most Cont( ).
15
Performing Work with Asynchronous Processors 15 AWT(q) - new progress tree traversal algorithm Instead of using q permutations on set [q], we use q permutations on set [n], where n = q 2 log q p = 6, t = 16, q = 2, n = 4 : list of 2 schedules from S 4 T : 4-ary tree of 16 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 0 PID : even 1 PID : odd 123 5698101113127 0 123 124 4 3 5=101 4 4 141516 143 2 312 4 17181920
16
Performing Work with Asynchronous Processors 16 Main result Set n = q 2 log q and let be list of q schedules from S n Define Cont( , ) = max Cont( , ) Lemma: For sufficiently large q and any set of at most exp(q 2 log 2 q) permutations on set [q 2 log q], there is a list of q schedules from S n such that Cont( , ) q 2 log q + 6q log q Take q = log p and from above Lemma Theorem: For every > 0, sufficiently large p and t = (p 2+ ), algorithm AWT(q) performs work O(t).
17
Performing Work with Asynchronous Processors 17 Message-Passing - model and goals We consider the following model: p asynchronous processors with PID in {0,…,p-1} processors communicate by message passing in one local step each processor can send a message to any subset of processors messages incur delays between send and receive processing of all received messages can be done during one local step Goal: understand the impact of message delay on efficiency of algorithmic solutions for Do-All
18
Performing Work with Asynchronous Processors 18 Lower bound - randomized algorithms Theorem: Any randomized algorithm solving DA with t tasks using p asynchronous message-passing processors performs expected work (t+p d log d+1 t) against any d-adversary. Proof (sketch): Adversary partitions computation into stages, each containing d time units, and constructs delay pattern stage after stage: delays all messages in stage to be received at the end of stage delays linear number of processors (which want to perform more than (1-1/(3d)) fraction of undone tasks) during stage selection is on-line, with high probability has good properties
19
Performing Work with Asynchronous Processors 19 Simulating shared-memory algorithms Write-All algorithm AWT Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) Quorum systems & Atomic memory services Attiya, H., Bar-Noy, A., Dolev, D.: Sharing memory robust-ly in message passing systems. J. of the ACM, 42 (1996) Lynch, N., Shvartsman, A.: RAMBO: A Reconfigurable Atomic Memory Service. Proc. of 16th DISC, (2002) Emulating asynchronous shared-memory algorithms : Momenzadeh, M.: Emulating shared-memory Do-All in asynchronous message passing systems. Masters Thesis, CSE, University of Conn, (2003)
20
Performing Work with Asynchronous Processors 20 Atomic memory is not required We use q-ary progress trees as the main data structure that is “written” and “read” -- note that atomicity is not required If the following two writes occur (the entire tree is written), then a subsequent read may obtain a third value that was never written: Property of monotone progress : 1 at a tree node i indicates that all tasks attached to the leaves in the sub-tree rooted in i have been performed If 1 is written at a node i in the progress tree of a processor, it remains 1 forever 0 10 0 01 0 11 write read
21
Performing Work with Asynchronous Processors 21 Algorithm DA q - traverse progress tree Instead of using shared memory, processors broadcast their progress trees as soon as local progress is recorded p, t = 9, q = 3 : list of 3 schedules from S 3 T : ternary tree of 9 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 12 123 123 3 0 PID = 0,3,6 1 PID = 1,4,7 2 PID = 2,5,8 123 458791012116 0 1 12 2 3 3 7=21 3
22
Performing Work with Asynchronous Processors 22 Algorithm DA q - case p t
23
Performing Work with Asynchronous Processors 23 Procedure D OWORK
24
Performing Work with Asynchronous Processors 24 Algorithm DA q - analysis Modification of algorithm DA q for p < t : We partition the t tasks into p jobs of size t /p and let the algorithm DA q work with these jobs. It takes a processor O(t /p) work (instead of constant) to process such a job (job unit). In each step, a processor broadcasts at most one message to p-1 other processors, we obtain: Theorem 4: For any constant > 0 there is a constant q such that the algorithm DA q has work W(p,t,d) = O(t p + p d t /d ) and message complexity O(p W(p,t,d)) against any d-adversary (d=o(t)).
25
Performing Work with Asynchronous Processors 25 Permutation algorithms - case p t Algorithms proceed in a loop: select the next task using ORDER+SELECT rule perform selected task send messages, receive messages, and update state O RDER+ S ELECT rules: P A R AN 1 : initially processor PID permutes tasks randomly PID selects first task remaining on his schedule P A R AN 2 : no initial order PID selects task from remaining sets randomly P A D ET : initially processor PID chooses schedule PID in PID selects first task remaining on schedule PID - list of p schedules from S t
26
Performing Work with Asynchronous Processors 26 d-Contention of permutations We introduce the notion of d-Contention : i is d-lrm in if |{j < i | (i) < (j)}| < d d = 2 LRM d ( ) - number of d-lrm in Cont d ( , ) = LRM d ( -1 ) d-Contention of : Cont d ( ) = max Cont d ( , ) Theorem: For sufficiently large p and n, there is a list of p permutations from S n such that, for every integer d >1, Cont d ( ) n log n + 5pd ln(e+n/d). Moreover, random is good with high probability. 1035246197811
27
Performing Work with Asynchronous Processors 27 d-Contention and work Lemma: For algorithms P A D ET and P A R AN 1, the respective worst case work and expected work is at most Cont d ( ) against any d-adversary. Example: p = 2, t = 11, d = 2 1325749861110 2468 1197531 Order of tasks to perform : 1,2,3,4,5,6,7,8,9,10,11 1 2 32 4 5 6 7 8 9 1011 10
28
Performing Work with Asynchronous Processors 28 Permutation algorithms - results Theorem: Randomized algorithms P A R AN 1 and P A R AN 2 perform expected work O(t log p + p d log(t /d)) and have expected communication O(t p log p + p 2 d log(t /d)) against any d-adversary (d=o(t)). Corollary: There exists a deterministic list of schedules such that algorithm P A D ET performs work O(t log p + p min{t,d} log(2+t /d)) and has communication O(t p log p + p 2 min{t,d} log(2+t /d)) when p t.
29
Performing Work with Asynchronous Processors 29 Conclusions and open problems Work-optimal Write-All algorithm for t = (p 2+ ) First message-delay-sensitive analysis of the Do-All problem for asynchronous processors in message-passing model lower bounds for deterministic and randomized algorithms deterministic and randomized algorithms with subquadratic (in p and t ) work for any message delay d as long as d=o(t) Among the interesting open questions are is there work-optimal scheduling for t = (p log p) for algorithm P A D ET : how to construct list of permutations efficiently closing the gap between the upper and the lower bounds investigate algorithms that simultaneously control work and message complexity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.