Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman University of Connecticut & MIT Performing Tasks in Asynchronous.

Similar presentations


Presentation on theme: "Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman University of Connecticut & MIT Performing Tasks in Asynchronous."— Presentation transcript:

1 Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman University of Connecticut & MIT Performing Tasks in Asynchronous Environments

2 Performing Work with Asynchronous Processors 2 Do-All problem ( [DHW] et al. ) DA (p,t) problem abstracts the basic problem of cooperation in a distributed setting: p processors must perform t tasks, and at least one processor must know about it [Dwork Halpern Waarts 92/98] Tasks are: known to every processor similar - each takes similar number of local steps independent - may be performed in any order idempotent - may be performed concurrently

3 Performing Work with Asynchronous Processors 3 Do-All: synchronous model with crashes Model: processors are synchronous, may fail by crashes Solutions: problem well understood, results close to optimal Shared-memory model -- communication by read/write  Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel computation. Kluwer Academic Publishers (1997) Message-passing model -- communication by exchanging messages  Dwork, C., Halpern, J., Waarts, O. Performing work efficiently in the presence of faults. SIAM Journal on Computing, 27 (1998)  De Prisco, R., Mayer, A., Yung, M. Time-optimal message-efficient work performance in the presence of faults. Proc. of 13th PODC, (1994)  Chlebus, B., De Prisco, R., Shvartsman, A.A. Performing tasks on synchronous restartable message- passing processors. Distributed Computing, 14 (2001)

4 Performing Work with Asynchronous Processors 4 Do-All: asynchronous models Models: Shared-memory model -- communication by read/write -- widely studied, but solutions far from optimal  Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel computation. Kluwer Academic Publishers (1997)  Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997)  Kedem, Z., Palem, K., Raghunathan, A., Spirakis, P.: Combining tentative and definite executions for very fast dependable parallel computing. Proc. of 23rd STOC, (1991) Message-passing model -- communication by exchanging messages -- no interesting solutions until recently

5 Performing Work with Asynchronous Processors 5 Shared-Memory vs. Message-Passing Shared-Memory (atomic registers):  processors communicate by read/write in shared-memory  atomicity - guarantees that read outputs the last written value  one read/write operation per local clock cycle  information propagates and information is persistent Hence cooperation is always possible, although delayed Here processor scheduling is the major challenge Message-Passing:  processors communicate by exchanging messages  duration of a local step may be unbounded  message delays may be unbounded  information may not propagate -- send/recv depend on delay

6 Performing Work with Asynchronous Processors 6 Message-delay-sensitive approach Even if messages delay are bounded by d (d-adversary), cooperation may be difficult Observation: If d =  (t) then work must be  (t ·p) This means that cooperation is difficult, and addressing scheduling alone is not enough - - algorithm design and analysis must be d-sensitive Message-delay-sensitive approach C. Dwork, N. Lynch and L. Stockmeyer.: Consensus in the presence of partial synchrony. J. of the ACM, 35 (1988)

7 Performing Work with Asynchronous Processors 7 Measures of efficiency Termination time : the first time when all tasks are done and at least one processors knows about it  Used only to define work and message complexity  Not interesting on its own: if all processors but one are delayed then trivially time is  (t) Work : measures the sum, over all processors, of the number of local steps taken until termination time Message complexity (message-passing model): measures number of all point-to-point messages sent until termination time

8 Performing Work with Asynchronous Processors 8 Structure of the presentation Part 2: Message-passing model. Model: asynchrony, message delay, and modeling issues Delay-sensitive lower bounds for Do-All Progress-tree Do-All algorithms  Simulating shared-memory and Anderson-Woll (AW)  Asynch. message-passing progress-tree algorithm Permutation Do-All algorithms Part 1: Shared-memory model Model and bibliography Improving AW algorithm in shared-memory by better scheduling processors (task load-balancing)

9 Performing Work with Asynchronous Processors 9 Shared-Memory - model and goal We consider the following model:  p asynchronous processors with PID in {0,…,p-1}  processors communicate by read/write in shared-memory  atomicity - read outputs the last written value  one read/write operation per local clock cycle Write-All : write 1’s into t locations of given array Goal: improve scheduling of cooperating asynchronous processors leading to better load-balancing wrt tasks

10 Performing Work with Asynchronous Processors 10 Write-All: Selected Bibliography Introducing Write-All problem Kanellakis, P.C., Shvartsman, A.A.: Efficient parallel algorithms can be made robust. PODC (1989), Distributed Computing (1992) AW algorithm with work O(t p  ) Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) Randomized algorithm with work  (t + plog p) Martel, C., Subramonian, R.: On the complexity of Certified Write- All algorithms. J. Algorithms 16 (1994) First work-optimal deterministic algorithm for t =  (p 4 log p) Malewicz, G.: A work-optimal deterministic algorithm for the asynchronous Certified Write-All problem. PODC (2003)

11 Performing Work with Asynchronous Processors 11 Shared memory p processors, t tasks (p = t) q permutations of [q] q-ary progress tree of depth log q p nodes are binary completion bits Progress tree algorithms [BKRS, AW] Permutations establish the order in which the children are visited p processors traverse the tree and use q-ary expansion of their PID to choose permutations [Anderson Woll] 1 2 3 q

12 Performing Work with Asynchronous Processors 12 Algorithm AWT [Anderson Woll] Progress tree data structure is stored in shared memory p, t = 9, q = 3  : list of 3 schedules from S 3 T : ternary tree of 9 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 12 123 123 3  0 PID = 0,3,6  1 PID = 1,4,7  2 PID = 2,5,8 123 458791012116 0 1 12 2 3 3 7=21 3

13 Performing Work with Asynchronous Processors 13 Contention of permutations S n -group of all permutations on set [n], with composition  and identity  n ,  - permutations in S n  - set of q permutations from S n i is lrm (left-to-right maximum) in  if  (i) > max j<i  (j) LRM(  ) - number of lrm in  [Knuth] Cont( ,  ) =    LRM(  -1   ) Contention of  : Cont(  ) = max  Cont( ,  ) [AW] Theorem: [AW] For any n > 0 there exists set  of n permutations from S n with Cont(  )  3nH n =  (n log n). [Knuth] Knuth, D.E.: The art of computer programming Vol. 3 (third edition). Addison-Wesley Pub Co. (1998) 1035246197811

14 Performing Work with Asynchronous Processors 14 Procedure “Oblivious Do” n - number of jobs and units  - list of n schedules from S n Procedure Oblivious : Forall processors PID = 0 to n-1 for i = 1 to n do perform Job(  PID (i)) Execution of Job(  PID (i)) by processor PID is primary, if job  PID (i) has not been previously performed Lemma: [AW] In algorithm Oblivious with n units, n jobs, and using the list  of n permutations from S n, the number of primary job executions is at most Cont(  ).

15 Performing Work with Asynchronous Processors 15 AWT(q) - new progress tree traversal algorithm Instead of using q permutations on set [q], we use q permutations on set [n], where n = q 2 log q p = 6, t = 16, q = 2, n = 4  : list of 2 schedules from S 4 T : 4-ary tree of 16 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID  0 PID : even  1 PID : odd 123 5698101113127 0 123 124 4 3 5=101 4 4 141516 143 2 312 4 17181920

16 Performing Work with Asynchronous Processors 16 Main result Set n = q 2 log q and let  be list of q schedules from S n Define Cont( , ) = max   Cont( ,  ) Lemma: For sufficiently large q and any set of at most exp(q 2 log 2 q) permutations on set [q 2 log q], there is a list of q schedules  from S n such that Cont( , )  q 2 log q + 6q log q Take q = log p and  from above Lemma Theorem: For every  > 0, sufficiently large p and t =  (p 2+  ), algorithm AWT(q) performs work O(t).

17 Performing Work with Asynchronous Processors 17 Message-Passing - model and goals We consider the following model:  p asynchronous processors with PID in {0,…,p-1}  processors communicate by message passing  in one local step each processor can send a message to any subset of processors  messages incur delays between send and receive  processing of all received messages can be done during one local step Goal: understand the impact of message delay on efficiency of algorithmic solutions for Do-All

18 Performing Work with Asynchronous Processors 18 Lower bound - randomized algorithms Theorem: Any randomized algorithm solving DA with t tasks using p asynchronous message-passing processors performs expected work  (t+p  d  log d+1 t) against any d-adversary. Proof (sketch): Adversary partitions computation into stages, each containing d time units, and constructs delay pattern stage after stage:  delays all messages in stage to be received at the end of stage  delays linear number of processors (which want to perform more than (1-1/(3d)) fraction of undone tasks) during stage selection is on-line, with high probability has good properties

19 Performing Work with Asynchronous Processors 19 Simulating shared-memory algorithms Write-All algorithm AWT Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) Quorum systems & Atomic memory services Attiya, H., Bar-Noy, A., Dolev, D.: Sharing memory robust-ly in message passing systems. J. of the ACM, 42 (1996) Lynch, N., Shvartsman, A.: RAMBO: A Reconfigurable Atomic Memory Service. Proc. of 16th DISC, (2002) Emulating asynchronous shared-memory algorithms : Momenzadeh, M.: Emulating shared-memory Do-All in asynchronous message passing systems. Masters Thesis, CSE, University of Conn, (2003)

20 Performing Work with Asynchronous Processors 20 Atomic memory is not required We use q-ary progress trees as the main data structure that is “written” and “read” -- note that atomicity is not required If the following two writes occur (the entire tree is written), then a subsequent read may obtain a third value that was never written: Property of monotone progress :  1 at a tree node i indicates that all tasks attached to the leaves in the sub-tree rooted in i have been performed  If 1 is written at a node i in the progress tree of a processor, it remains 1 forever 0 10 0 01 0 11 write read

21 Performing Work with Asynchronous Processors 21 Algorithm DA q - traverse progress tree Instead of using shared memory, processors broadcast their progress trees as soon as local progress is recorded p, t = 9, q = 3  : list of 3 schedules from S 3 T : ternary tree of 9 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 12 123 123 3  0 PID = 0,3,6  1 PID = 1,4,7  2 PID = 2,5,8 123 458791012116 0 1 12 2 3 3 7=21 3

22 Performing Work with Asynchronous Processors 22 Algorithm DA q - case p  t

23 Performing Work with Asynchronous Processors 23 Procedure D OWORK

24 Performing Work with Asynchronous Processors 24 Algorithm DA q - analysis Modification of algorithm DA q for p < t :  We partition the t tasks into p jobs of size t /p and let the algorithm DA q work with these jobs. It takes a processor O(t /p) work (instead of constant) to process such a job (job unit). In each step, a processor broadcasts at most one message to p-1 other processors, we obtain: Theorem 4: For any constant  > 0 there is a constant q such that the algorithm DA q has work W(p,t,d) = O(t  p  + p  d   t /d   ) and message complexity O(p  W(p,t,d)) against any d-adversary (d=o(t)).

25 Performing Work with Asynchronous Processors 25 Permutation algorithms - case p  t Algorithms proceed in a loop:  select the next task using ORDER+SELECT rule  perform selected task  send messages, receive messages, and update state O RDER+ S ELECT rules: P A R AN 1 : initially processor PID permutes tasks randomly PID selects first task remaining on his schedule P A R AN 2 : no initial order PID selects task from remaining sets randomly P A D ET : initially processor PID chooses schedule  PID in  PID selects first task remaining on schedule  PID  - list of p schedules from S t

26 Performing Work with Asynchronous Processors 26 d-Contention of permutations We introduce the notion of d-Contention : i is d-lrm in  if |{j < i |  (i) <  (j)}| < d d = 2 LRM d (  ) - number of d-lrm in  Cont d ( ,  ) =    LRM d (  -1   ) d-Contention of  : Cont d (  ) = max  Cont d ( ,  ) Theorem: For sufficiently large p and n, there is a list  of p permutations from S n such that, for every integer d >1, Cont d (  )  n log n + 5pd ln(e+n/d). Moreover, random  is good with high probability. 1035246197811

27 Performing Work with Asynchronous Processors 27 d-Contention and work Lemma: For algorithms P A D ET and P A R AN 1, the respective worst case work and expected work is at most Cont d (  ) against any d-adversary. Example: p = 2, t = 11, d = 2 1325749861110 2468 1197531 Order of tasks to perform : 1,2,3,4,5,6,7,8,9,10,11 1 2 32 4 5 6 7 8 9 1011 10

28 Performing Work with Asynchronous Processors 28 Permutation algorithms - results Theorem: Randomized algorithms P A R AN 1 and P A R AN 2 perform expected work O(t  log p + p  d  log(t /d)) and have expected communication O(t  p  log p + p 2  d  log(t /d)) against any d-adversary (d=o(t)). Corollary: There exists a deterministic list of schedules  such that algorithm P A D ET performs work O(t  log p + p  min{t,d}  log(2+t /d)) and has communication O(t  p  log p + p 2  min{t,d}  log(2+t /d)) when p  t.

29 Performing Work with Asynchronous Processors 29 Conclusions and open problems Work-optimal Write-All algorithm for t =  (p 2+  ) First message-delay-sensitive analysis of the Do-All problem for asynchronous processors in message-passing model  lower bounds for deterministic and randomized algorithms  deterministic and randomized algorithms with subquadratic (in p and t ) work for any message delay d as long as d=o(t) Among the interesting open questions are  is there work-optimal scheduling for t =  (p log p)  for algorithm P A D ET : how to construct list  of permutations efficiently  closing the gap between the upper and the lower bounds  investigate algorithms that simultaneously control work and message complexity


Download ppt "Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman University of Connecticut & MIT Performing Tasks in Asynchronous."

Similar presentations


Ads by Google