Download presentation
Presentation is loading. Please wait.
Published byKerrie Underwood Modified over 9 years ago
1
On-line adaptative parallel prefix computation Jean-Louis Roch, Daouda Traore, Julien Bernard INRIA-CNRS Moais team - LIG Grenoble, France Contents I. Motivation II. Work-stealing scheduling of parallel algorithms III.Processor-oblivious parallel prefix computation EUROPAR’2006 - Dresden, Germany - 2006, August 29th,
2
Prefix problem : input : a 0, a 1, …, a n output : 1, …, n with Parallel prefix on fixed architecture Tight lower bound on p identical processors: Optimal time T p = 2n / (p+1) but performs 2.n.p/(p+1) ops [Nicolau&al. 1996] Parallel requires twice more operations than sequential !! performs only n operations Sequential algorithm : for ( [0] = a[0], i = 1 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ; Critical time = 2. log n but performs 2.n ops [Ladner- Fisher-81] Fine grain optimal parallel algorithm :
3
Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode The problem To design a single algorithm that computes efficiently prefix( a ) on an arbitrary dynamic architecture Sequential algorithm parallel P=2 parallel P=100 parallel P=max...... Multi-user SMP serverGridHeterogeneous network ? Which algorithm to choose ? ……
4
- Model of heterogeneous processors with changing speed [Bender&al 02] => i (t) = instantaneous speed of processor i at time t ( in #operations * per second ) Assumption : max (t) < constant. min (t) Def: ave = average speed per processor for a computation with duration T - Theorem 2 : Lower bound for the time of prefix computation on p processors with changing speeds : Sketch of the proof: - extension of the lower bound on p identical processors [Faith82] - based on the analysis on the number of performed operations. Lower bound for prefix on processors with changing speeds
5
Changing speeds and work-stealing Workstealing schedule on-line adapts to processors availability and speeds [Bender-02] Principle of work-stealing= “ greedy ” schedule but distributed and randomized Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) «Depth » W = #ops on a critical path (parallel time on resources) « Work » W 1 = #total operations performed [Bender-Rabin02]
6
Work-stealing and adaptation «Depth » W = #ops on a critical path (parallel time on resources) « Work » W 1 = #total operations performed Interest : if W 1 fixed and W small, near-optimal adaptative schedule with good probability on p processors with average speeds ave Moreover : #steals = #task migrations < p.W [Blumofe 98 Narlikar 01 Bender 02] But lower bounds for prefix : Minimal work W 1 = n W = n Minimal depth W 2n With work-stealing, how to reach the lower bound ?
7
General approach: by coupling two algorithms : a sequential algorithm with optimal number of operations W s and a fine grain parallel algorithm with minimal critical time W but parallel work >> W s Folk technique : parallel, than sequential Parallel algorithm until a certain « grain »; then use the sequential one Drawback with changing speeds : Either too much idle processors or too much operations Work-preserving speed-up technique [Bini-Pan94] sequential, then parallel Cascading [Jaja92] =Careful interplay of both algorithms to build one with both W small and W 1 = O( W seq ) Use the work-optimal sequential algorithm to reduce the size Then use the time-optimal parallel algorithm to decrease the time Drawback : sequential at coarse grain and parallel at fine grain How to get both work W 1 and depth W small?
8
Alternative : concurrently sequential and parallel SeqCompute Extract_par LastPartComputation SeqCompute Based on the work-stealing and the Work-first principle : Execute always a sequential algorithm to reduce parallelism overhead use parallel algorithm only if a processor becomes idle (ie workstealing) by extracting parallelism from a sequential computation (ie adaptive granularity) Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm – Self-adaptive granularity based on work-stealing
9
Alternative : concurrently sequential and parallel SeqCompute preempt
10
Alternative : concurrently sequential and parallel SeqCompute merge/jump complete Seq
11
Parallel Sequential 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 Work- stealer 1 Main Seq. Work- stealer 2 Adaptive Prefix on 3 processors 11 Steal request
12
Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 77 33 Steal request 22 66 i =a 5 *…*a i
13
Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 77 33 44 22 66 i =a 5 *…*a i a 9 a 10 a 11 a 12 88 44 Preempt 10 i =a 9 *…*a i 88 88
14
Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 8 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 77 33 44 22 66 i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55 10 i =a 9 *…*a i 99 66 11 88 Preempt 11 88
15
Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 8 11 a 12 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 77 33 44 22 66 i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55 10 i =a 9 *…*a i 99 66 11 12 10 77 11 88
16
Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 8 11 a 12 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 77 33 44 22 66 i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55 10 i =a 9 *…*a i 99 66 11 12 10 77 11 88 Implicit critical path on the sequential process
17
Theorem 3: Execution time Sketch of the proof : Analysis of the operations performed by : –The sequential main performs S operations on one processor –The (p-1) work-stealers perform X = 2(n-S) operations with depth log X –Each non constant time task can potentially be splitted (variable speeds) The coupling ensures both algorithms complete simultaneously T s = T p - O(log X) => enables to bound the whole number X of operations performed and the overhead of parallelism = (S+X) - #ops_optimal Analysis of the algorithm Lower bound
18
Adaptive prefix : experiments1 Single-user context : processor-adaptive prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors - Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors : Optimal off-line on p procs Adaptive Prefix sum of 8.10 6 double on a SMP 8 procs (IA64 1.5GHz/ linux) Time (s) #processors Pure sequential Single user context
19
Adaptive prefix : experiments 2 Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-adaptive prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule, Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-adaptive prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule, External charge (9-p external processes) Off-line parallel algorithm for p processors Adaptive Prefix sum of 8.10 6 double on a SMP 8 procs (IA64 1.5GHz/ linux) Time (s) #processors Multi-user context :
20
Conclusion The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors with changing speeds - practically, achieves near-optimal performances on multi-user SMPs Generic adaptive scheme to implement parallel algorithms with provable performance - work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]
21
Thank you ! Interactive Distributed Simulation [B Raffin &E Boyer] - 5 cameras, - 6 PCs 3D-reconstruction + simulation + rendering ->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp [L Suares, B Raffin, JL Roch]
22
The Prefix race: sequential/parallel fixed/ adaptive Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential On each of the 10 executions, adaptive completes first
23
Adaptive prefix : some experiments Single user context Adaptive is equivalent to : - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm External charge Parallel Adaptive Parallel Adaptive Prefix of 10000 elements on a SMP 8 procs (IA64 / linux) #processors Time (s) #processors
24
With * = double sum ( r[i]=r[i-1] + x[i] ) Single userProcessors with variable speeds Remark for n=4.096.000 doubles : - “pure” sequential : 0,20 s - minimal ”grain” = 100 doubles : 0.26s on 1 proc and 0.175 on 2 procs (close to lower bound) Finest “grain” limited to 1 page = 16384 octets = 2048 double
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.