On-line adaptive parallel prefix computation Jean-Louis Roch, Daouda Traoré and Julien Bernard Presented by Andreas Söderström, ITN
The prefix problem Given X = x 1,x 2,…,x n compute the n products π k =x 0 о x 1 о … ο x k for 1 ≤ k ≤ n where ο is some associative operation Given X = x 1,x 2,…,x n compute the n products π k =x 0 о x 1 о … ο x k for 1 ≤ k ≤ n where ο is some associative operation Example: o = + (i.e. addition) X = 1,3,5,7 π 1 = 1 π 2 = 1+3 = 4 π 3 = = 9 π 4 = = 16 Example: o = + (i.e. addition) X = 1,3,5,7 π 1 = 1 π 2 = 1+3 = 4 π 3 = = 9 π 4 = = 16
Parallel prefix sum (first pass) Step 0 Step 1 Step 2 Step 3
Parallel prefix sum (second pass) For every even position use the value of the parent node For every even position use the value of the parent node For evey odd position p n compute p n-1 + p n For evey odd position p n compute p n-1 + p n Step 0 Step 1 Step 2 Step
Parallel prefix computation Parallel time: 2n/p + O(log n) for p < n/(log n) Parallel time: 2n/p + O(log n) for p < n/(log n) Lower bound for parallel time: 2n/(p+1) for n > p(p+1)/2 Lower bound for parallel time: 2n/(p+1) for n > p(p+1)/2 Assumes identical processors! Assumes identical processors!
Parallel prefix computation Potential practical problems: Potential practical problems: Processor setup may be heterogenous Processor setup may be heterogenous Processor load may vary due to other users computing on the same machine Processor load may vary due to other users computing on the same machine Off-line optimal scheduling potentially not optimal anymore! Off-line optimal scheduling potentially not optimal anymore! Solution: Solution: Use on-line scheduling! Use on-line scheduling!
The basic idea Combine a sequentially optimal algorithm with fine-grained parallellism using work stealing Combine a sequentially optimal algorithm with fine-grained parallellism using work stealing P0P1Pn … P2 Steal work
The algorithm Sequential process P s : The sequential process P s starts working on [π 1, π k ], i.e. value indices [1,k] where indices [k+1,m] has been stolen The sequential process P s starts working on [π 1, π k ], i.e. value indices [1,k] where indices [k+1,m] has been stolen When P s reaches the index k it communicates π k to the parallel process P v that has stolen [k+1,m] and recoveres the last index n computed by P v together with the local prefix result r n When P s reaches the index k it communicates π k to the parallel process P v that has stolen [k+1,m] and recoveres the last index n computed by P v together with the local prefix result r n P s uses associativity to calculate π n+1 = π k o r n and continues with the computation from index n+1 P s uses associativity to calculate π n+1 = π k o r n and continues with the computation from index n+1
The algorithm Parallel process P v P v scans for active processes (can be P s or another P v ) and steals part of the work from that process. P v scans for active processes (can be P s or another P v ) and steals part of the work from that process. P v computes the local prefix operation on the stolen interval P v computes the local prefix operation on the stolen interval The computation of P v depends on a previous value and need to be finalized when that value is known The computation of P v depends on a previous value and need to be finalized when that value is known
The algorithm P0 P1 P Result Jump Finalize Stealable
Performance If a processor is or becomes slow part of its work can be stolen by an idle processor If a processor is or becomes slow part of its work can be stolen by an idle processor Asymptotic optimality (proof provided in the paper) Asymptotic optimality (proof provided in the paper)
Performance P homogenous processeors
Performance P heterogenous processors
Questions?