Introduction to parallel algorithms

Slides:



Advertisements
Similar presentations
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advertisements

Lecture 19: Parallel Algorithms
Lecture 3: Parallel Algorithm Design
1 Parallel Parentheses Matching Plus Some Applications.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Numerical Algorithms Matrix multiplication
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Numerical Algorithms • Matrix multiplication
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Design of parallel algorithms Matrix operations J. Porras.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Reducing number of operations: The joy of algebraic transformations CS498DHP Program Optimization.
Data Structures and Algorithms in Parallel Computing Lecture 8.
Lecture 9 Architecture Independent (MPI) Algorithm Design
Numerical Algorithms Chapter 11.
Auburn University
Lecture 3: Parallel Algorithm Design
Analysis of Algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Interquery Parallelism
Lecture 2: Parallel computational models
Introduction to parallel algorithms
Parallel Programming in C with MPI and OpenMP
Chapter 4 Divide-and-Conquer
CS 3343: Analysis of Algorithms
Parallel Matrix Multiplication and other Full Matrix Algorithms
Introduction to Parallel Programming
Lecture 16: Parallel Algorithms I
Lecture 22: Parallel Algorithms
Collective Communication Operations
The Complexity of Algorithms and the Lower Bounds of Problems
CSCE569 Parallel Computing
Parallel Analysis of the Rijndael Block Cipher
Parallel Matrix Operations
Decomposition Data Decomposition Functional Decomposition
Numerical Algorithms • Parallelizing matrix multiplication
CSCE569 Parallel Computing
Parallel Matrix Multiplication and other Full Matrix Algorithms
Parallel Programming in C with MPI and OpenMP
Introduction to parallel algorithms
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
CS 584.
Parallelism for summation
Parallelismo.
COMP60621 Fundamentals of Parallel and Distributed Systems
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt March 20, 2014.
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson slides5.ppt August 17, 2014.
Jacobi Project Salvatore Orlando.
Parallel Speedup.
To accompany the text “Introduction to Parallel Computing”,
Parallel Programming in C with MPI and OpenMP
Data Parallel Pattern 6c.1
COMP60611 Fundamentals of Parallel and Distributed Systems
Divide and Conquer Merge sort and quick sort Binary search
Mattan Erez The University of Texas at Austin
Presentation transcript:

Introduction to parallel algorithms CIS 5930-09 – Spring 2005 Introduction to parallel algorithms Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University

Outline Background Primitives Algorithms Important points

Background Terminology Communication cost model Time complexity Speedup Efficiency Scalability Communication cost model

Time complexity Parallel computation A group of processors work together to solve a problem Time required for the computation is the period from when the first processor starts working until when the last processor stops Sequential Parallel - bad Parallel - ideal Parallel - realistic

Other terminology Speedup: S = T1/TP Efficiency: E = S/P Work: W = P TP Scalability How does TP decrease as we increase P to solve the same problem? How should the problem size increase with P, to keep E constant? Notation P = Number of processors T1 = Time on one processor TP = Time on P processors

Communication cost model Processes spend some time doing useful work, and some time communicating Model communication cost as TC = ts + L tb L = message size Independent of location of processes Any process can communicate with any other process A process can simultaneously send and receive one message

I/O model We will ignore I/O issues, for the most part We will assume that input and output are distributed across the processors in a manner of our choosing Example: Sorting Input: x1, x2, ..., xn Initially, xi is on processor i Output xp1, xp2, ..., xpn xpi on processor i xpi < xpi+1

Primitives Reduction Broadcast Gather/Scatter All gather Prefix

Reduction -- 1 Tn = n-1 + (n-1)(ts+tb) Sn = 1/(1 + ts + tb) x1 Compute x1 + x2 + ... + xn x2 xn x3 x4 Tn = n-1 + (n-1)(ts+tb) Sn = 1/(1 + ts + tb)

Reduction -- 2 Tn = n/2-1 + (n/2-1)(ts+ tb) + (ts+ tb) + 1 x1 Reduction-1 for {x1, ... xn/2} xn/2+1 Reduction-1 for {xn/2+1, ... xn} Tn = n/2-1 + (n/2-1)(ts+ tb) + (ts+ tb) + 1 = n/2 + n/2 (ts+ tb) Sn ~ 2/(1 + ts+ tb)

Reduction -- 3 Apply reduction-2 recursively xn/2+1 x1 Reduction-1 for {x1, ... xn/2} Reduction-1 for {xn/2+1, ... xn} xn/4+1 xn/2+1 x3n/4+1 x1 Reduction-1 for {x1, ... xn/4} Reduction-1 for {xn/4+1, ... xn/2} Reduction-1 for {xn/2+1, ... x3n/4} Reduction-1 for {x3n/4+1, ... xn} Apply reduction-2 recursively * Divide and conquer Tn ~ log2n + (ts+ tb) log2n Sn ~ (n/ log2n) x 1/(1 + ts+ tb) Note that any associative operator can be used in place of +

Parallel addition features If n >> P * Each processor adds n/P distinct numbers Perform parallel reduction on P numbers TP ~ n/P + (1 + ts+ tb) log P Optimal P obtained by differentiating wrt P Popt ~ n/(1 + ts+ tb) If communication cost is high, then fewer processors ought to be used E = [1 + (1+ ts+ tb) P log P/n]-1 * As problem size increases, efficiency increases * As number of processors increases, efficiency decreases

Some common collective operations Broadcast A B C D A, B, C, D Gather A A, B, C, D B C D Scatter A B C D A, B, C, D All Gather

Broadcast T ~ (ts+ Ltb) log P L: Length of data x7 x8 x1 x1 x2 x3 x4

Gather/Scatter Gather: Data move towards the root Note: Si=0log P–1 2i = (2 log P – 1)/(2–1) = P-1 ~ P x18 4L x14 x58 2L 2L x12 x34 x56 x78 L L L L x1 x2 x3 x4 x5 x6 x7 x8 Gather: Data move towards the root Scatter: Review question T ~ ts log P + PLtb

All gather x7 x8 x3 x4 x5 x6 L x1 x2 Equivalent to each processor broadcasting to all the processors

All gather x78 x78 x34 x34 2L x56 x56 L x12 x12

All gather x58 x58 x14 x14 2L x58 x58 L 4L x14 x14

All gather Tn ~ ts log P + PLtb 2L L 4L x18 x18 x18 x18 x18 x18 x18

Review question: Pipelining * Useful when repeatedly and regularly performing a large number of primitive operations Optimal time for a broadcast = log P But doing this n times takes n log P time Pipelining the broadcasts takes n + P time Almost constant amortized time per broadcast if n >> P n + P << n log P when n >> P Review question: How can you accomplish this time complexity?

Sequential prefix Input Output Algorithm Values xi , 1 < i < n Xi = x1 * x2 * ... * xi, 1 < i < n * is an associative operator Algorithm X1 = x1 for i = 2 to n Xi = Xi-1 * xi

Parallel prefix Input Output Define f(a,b) as follows Processor i has xi Output Processor i has x1 * x2 * ... * xi Define f(a,b) as follows if a == b Xi = xi, on Proc Pi else compute in parallel f(a,(a+b)/2) f((a+b)/2+1,b) Pi and Pj send Xi and Xj to each other, respectively a < i < (a+b)/2 j = i + (a+b)/2 Xi = Xi*Xj on Pi Xj = Xi*Xj on Pj Divide and conquer f(a,b) yields the following Xi = xa *... * xi, Proc Pi Xi = xa *... * xb, Proc Pi a < i < b f(1,n) solves the problem T(n) = t(n/2) + 2 + (ts+tw) => T(n) = O(log n) An iterative implementation improves the constant

Iterative parallel prefix example

Algorithms Linear recurrence Matrix vector multiplication

Determine each xi, 2 < i < n Linear recurrence Determine each xi, 2 < i < n xi = ai xi-1 + bi xi-2 x0 = x0, x1 = x1 Sequential solution for i = 2 to n Follows directly from the recurrence This approach is not easily parallelized

Linear recurrence in parallel Given xi = ai xi-1 + bi xi-2 x2i = a2i x2i-1 + b2i x2i-2 x2i+1 = a2i+1 x2i + b2i+1 x2i-1 Rewrite this in matrix form x2i x2i+1 x2i-2 x2i-1 b2i a2i a2i+1 b2i b2i+1 + a2i+1 a2i Ai Xi-1 Xi Xi = Ai A i-1 ... A1X0 This is a parallel prefix computation, since matrix multiplication is associative Solved in O(log n) time

Matrix-vector multiplication c = A b Often performed repeatedly bi = A bi-1 We need same data distribution for c and b One dimensional decomposition Example: row-wise block striped for A b and c replicated Each process computes its components of c independently Then all-gather the components of c

1-D matrix-vector multiplication c: Replicated A: Row-wise b: Replicated Each process computes its components of c independently Time = Q(n2/P) Then all-gather the components of c Time = ts log P + tb n Note: P < n

2-D matrix-vector multiplication B0 C1 A10 A11 A12 A13 B1 C2 A20 A21 A22 A23 B2 C3 A30 A31 A32 A33 B3 Processes Pi0 sends Bi to P0i Time: ts + tbn/P0.5 Processes P0j broadcast Bj to all Pij Time = ts log P0.5 + tb n log P0.5 / P0.5 Processes Pij compute Cij = AijBj Time = Q(n2/P) Processes Pij reduce Cij on to Pi0, 0 < i < P0.5 Total time = Q(n2/P + ts log P + tb n log P / P0.5 ) P < n2 * More scalable than 1-dimensional decomposition

Important points Efficiency Increases with increase in problem size Decreases with increase in number of processors Aggregation of tasks to increase granularity Reduces communication overhead Data distribution 2-dimensional may be more scalable than 1-dimensional Has an effect on load balance too General techniques Divide and conquer Pipelining