On-line adaptive parallel prefix computation Jean-Louis Roch, Daouda Traoré and Julien Bernard Presented by Andreas Söderström, ITN.

Slides:

Advertisements

Similar presentations

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

Lecture 3: Parallel Algorithm Design

CS223 Advanced Data Structures and Algorithms 1 Greedy Algorithms Neil Tang 4/8/2010.

1 Parallel Parentheses Matching Plus Some Applications.

CSCI-455/552 Introduction to High Performance Computing Lecture 11.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble,

Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.

1 Potential for Parallel Computation Module 2. 2 Potential for Parallelism Much trivially parallel computing  Independent data, accounts  Nothing to.

Precedence Constrained Scheduling Abhiram Ranade Dept. of CSE IIT Bombay.

Soft Real-Time Semi-Partitioned Scheduling with Restricted Migrations on Uniform Heterogeneous Multiprocessors Kecheng Yang James H. Anderson Dept. of.

Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.

Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.

Parallel Prefix Computation Advanced Algorithms & Data Structures Lecture Theme 14 Prof. Dr. Th. Ottmann Summer Semester 2006.

Author: David He, Astghik Babayan, Andrew Kusiak By: Carl Haehl Date: 11/18/09.

1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.

Adaptive Data Collection Strategies for Lifetime-Constrained Wireless Sensor Networks Xueyan Tang Jianliang Xu Sch. of Comput. Eng., Nanyang Technol. Univ.,

CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

SMART: A Scan-based Movement- Assisted Sensor Deployment Method in Wireless Sensor Networks Jie Wu and Shuhui Yang Department of Computer Science and Engineering.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

EDA (CS286.5b) Day 11 Scheduling (List, Force, Approximation) N.B. no class Thursday (FPGA) …

1 Scheduling on Heterogeneous Machines: Minimize Total Energy + Flowtime Ravishankar Krishnaswamy Carnegie Mellon University Joint work with Anupam Gupta.

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.

Parallel and Distributed IR

Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.

International Graduate School of Dynamic Intelligent Systems, University of Paderborn Improved Algorithms for Dynamic Page Migration Marcin Bieńkowski.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.

1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 12: February 13, 2002 Scheduling Heuristics and Approximation.

Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.

Distributed Computing 3. Leader Election – lower bound for ring networks Shmuel Zaks ©

1 PRAM Algorithms Sums Prefix Sums by Doubling List Ranking.

Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.

Jennifer Campbell November 30,  Problem Statement and Motivation  Analysis of previous work  Simple - competitive strategy  Near optimal deterministic.

1 Uwe Schwiegelshohn, 2 Andrei Tchernykh, 1 Ramin Yahyapour 1 Technische Universität Dortmund, Germany

Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.

Parallel and Distributed Simulation Time Parallel Simulation.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

On-line adaptative parallel prefix computation Jean-Louis Roch, Daouda Traore, Julien Bernard INRIA-CNRS Moais team - LIG Grenoble, France Contents I.

Rounding scheme if r * j  1 then r j := 1  When the number of processors assigned in the continuous solution is between 0 and 1 for each task, the speed.

Analysis of cooperation in multi-organization Scheduling Pierre-François Dutot (Grenoble University) Krzysztof Rzadca (Polish-Japanese school, Warsaw)

Data Structures and Algorithms in Parallel Computing Lecture 8.

CSCI-455/552 Introduction to High Performance Computing Lecture 21.

1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.

Scheduling Parallel DAG Jobs to Minimize the Average Flow Time K. Agrawal, J. Li, K. Lu, B. Moseley.

Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.

Lecture 3: Parallel Algorithm Design

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Odd-Even Sort Algorithm Dr. Xiao.

Sorting Quiz questions

Load Balancing: List Scheduling

On Scheduling in Map-Reduce and Flow-Shops

Parallel Computing Spring 2010

Numerical Algorithms Quiz questions

New Scheduling Algorithms: Improving Fairness and Quality of Service

ورود اطلاعات بصورت غيربرخط

György Dósa – M. Grazia Speranza – Zsolt Tuza:

Parallel Speedup.

3.5 Limits at Infinity Horizontal Asymptote.

Load Balancing: List Scheduling

Bin Packing Michael T. Goodrich Some slides adapted from slides from

Presentation transcript:

On-line adaptive parallel prefix computation Jean-Louis Roch, Daouda Traoré and Julien Bernard Presented by Andreas Söderström, ITN

The prefix problem Given X = x 1,x 2,…,x n compute the n products π k =x 0 о x 1 о … ο x k for 1 ≤ k ≤ n where ο is some associative operation Given X = x 1,x 2,…,x n compute the n products π k =x 0 о x 1 о … ο x k for 1 ≤ k ≤ n where ο is some associative operation Example: o = + (i.e. addition) X = 1,3,5,7 π 1 = 1 π 2 = 1+3 = 4 π 3 = = 9 π 4 = = 16 Example: o = + (i.e. addition) X = 1,3,5,7 π 1 = 1 π 2 = 1+3 = 4 π 3 = = 9 π 4 = = 16

Parallel prefix sum (first pass) Step 0 Step 1 Step 2 Step 3

Parallel prefix sum (second pass) For every even position use the value of the parent node For every even position use the value of the parent node For evey odd position p n compute p n-1 + p n For evey odd position p n compute p n-1 + p n Step 0 Step 1 Step 2 Step

Parallel prefix computation Parallel time: 2n/p + O(log n) for p < n/(log n) Parallel time: 2n/p + O(log n) for p < n/(log n) Lower bound for parallel time: 2n/(p+1) for n > p(p+1)/2 Lower bound for parallel time: 2n/(p+1) for n > p(p+1)/2 Assumes identical processors! Assumes identical processors!

Parallel prefix computation Potential practical problems: Potential practical problems: Processor setup may be heterogenous Processor setup may be heterogenous Processor load may vary due to other users computing on the same machine Processor load may vary due to other users computing on the same machine Off-line optimal scheduling potentially not optimal anymore! Off-line optimal scheduling potentially not optimal anymore! Solution: Solution: Use on-line scheduling! Use on-line scheduling!

The basic idea Combine a sequentially optimal algorithm with fine-grained parallellism using work stealing Combine a sequentially optimal algorithm with fine-grained parallellism using work stealing P0P1Pn … P2 Steal work

The algorithm Sequential process P s : The sequential process P s starts working on [π 1, π k ], i.e. value indices [1,k] where indices [k+1,m] has been stolen The sequential process P s starts working on [π 1, π k ], i.e. value indices [1,k] where indices [k+1,m] has been stolen When P s reaches the index k it communicates π k to the parallel process P v that has stolen [k+1,m] and recoveres the last index n computed by P v together with the local prefix result r n When P s reaches the index k it communicates π k to the parallel process P v that has stolen [k+1,m] and recoveres the last index n computed by P v together with the local prefix result r n P s uses associativity to calculate π n+1 = π k o r n and continues with the computation from index n+1 P s uses associativity to calculate π n+1 = π k o r n and continues with the computation from index n+1

The algorithm Parallel process P v P v scans for active processes (can be P s or another P v ) and steals part of the work from that process. P v scans for active processes (can be P s or another P v ) and steals part of the work from that process. P v computes the local prefix operation on the stolen interval P v computes the local prefix operation on the stolen interval The computation of P v depends on a previous value and need to be finalized when that value is known The computation of P v depends on a previous value and need to be finalized when that value is known

The algorithm P0 P1 P Result Jump Finalize Stealable

Performance If a processor is or becomes slow part of its work can be stolen by an idle processor If a processor is or becomes slow part of its work can be stolen by an idle processor Asymptotic optimality (proof provided in the paper) Asymptotic optimality (proof provided in the paper)

Performance P homogenous processeors

Performance P heterogenous processors

Questions?