Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel.

Slides:



Advertisements
Similar presentations
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Advertisements

Lecture 19: Parallel Algorithms
1 Parallel Parentheses Matching Plus Some Applications.
1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
1 5.1 Pipelined Computations. 2 Problem divided into a series of tasks that have to be completed one after the other (the basis of sequential programming).
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Numerical Algorithms Matrix multiplication
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Numerical Algorithms • Matrix multiplication
Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.
Linear Algebraic Equations
Reference: Message Passing Fundamentals.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
Lecture 21: Parallel Algorithms
Instruction count for statements Methods Examples
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Pipelining.
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
COMPE575 Parallel & Cluster Computing 5.1 Pipelined Computations Chapter 5.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
1 Calculating Polynomials We will use a generic polynomial form of: where the coefficient values are known constants The value of x will be the input and.
Linear Systems Gaussian Elimination CSE 541 Roger Crawfis.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
Pipelined Computations P0P1P2P3P5. Pipelined Computations a s in s out a[0] a s in s out a[1] a s in s out a[2] a s in s out a[3] a s in s out a[4] for(i=0.
CSCI-455/552 Introduction to High Performance Computing Lecture 13.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Algorithm Design.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 3. Time Complexity Calculations.
Analysis of Algorithms Spring 2016CS202 - Fundamentals of Computer Science II1.
Auburn University
GC211Data Structure Lecture2 Sara Alhajjam.
Problem Solving Strategies
Pipelined Computations
PRAM Algorithms.
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
Lecture 22: Parallel Algorithms
Decomposition Data Decomposition Functional Decomposition
CSCE569 Parallel Computing
Pipelined Computations
Introduction to High Performance Computing Lecture 12
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
Pipelined Pattern This pattern is implemented in Seeds, see
Algorithm Discovery and Design
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt March 20, 2014.
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson slides5.ppt August 17, 2014.
Dense Linear Algebra (Data Distributions)
Jacobi Project Salvatore Orlando.
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Decomposition Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel applications Functional Decomposition – Dividing an algorithm into its functional pieces and executing the pieces in separate processors – Example: Pipelining

Pipelined Computations Divide a problem into a series of tasks A processor completes a task sequentially and pipes the results to the next processor Example of Summing Groups of Numbers P0P0 P1P1 P4P4 P2P2 P3P3 P5P5 P0P0 P1P1 P4P4 P2P2 P3P3 P5P5 ∑A[i 0 ]∑A[i 1 ]∑A[i 2 ]∑A[i 3 ]∑A[i 4 ]∑A[i 5 ] zero total Question: Is this data or is it functional decomposition?

Where is Pipelining Applicable? Type 1 – More than one instance of a problem – Example: Multiple simulations with different parameter settings Type 2 – Series of data items with multiple operations – Example: Signal Filter or Eratosthenes Sieve Type 3 – Partial results passed on while processing continues – Example: Solving sets of linear equations Considerations – Are there a series of sequential tasks? – Is the processing of each tack approximately equal? – Can items be grouped to minimize communication cost – If stages exceed processors oGroup stages oWrap last stage back to the first – Determine where the result will be at the end of the process

Summing Numbers Example process P i >0 && <N-1 recv(&sum, P i-1 ); sum += number; send(&sum, P i+1 ); Process P 0 send(&number, P 1 ); Process P N-1 recv(&number, P n-2 ); sum += number; Save or display result

Application Remove frequencies from a signal – Sequential Algorithm: Fourier Analysis (O(N lg(N)) – Parallel: Apply filters to the signal (O(N*FilterLength)) with convolution. – Filter Examples: Chebyshev, ButtorWorth, etc. – Derive filter: Set Z-domain poles and zeroes, perform inverse tranformation. – Filters can be useful to manipulate signals, detect patterns, etc.

Chebyshev Filter Design Chebyshev in the z-domainChebyshev Frequency Response Note: Depending on the placement of the poles (+) and zeroes (0), the filter will effect a signal differently

Type 1: Multiple Instances Sequential execution: t 1 = m*t m Parallel Processing: (m + p – 1)*t m /p Parallel Communication: (m+p-1)*(t start +n*t data ) Speed up: t p = m*t m /((m+p-1)*(t m /p+t start +n*t data )) P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Time Space Time Diagram Notation 1.m = instances, p = processors 2.t start = latency t data = bandwidth 3.n = data transmitted /instance 4.t m = total time to process an instance 5.Total pipeline cycles = m + p – 1 6.Assume: Equal processing per stage

Type 2: Multiple Data Elements P0P0 P1P1 P4P4 P2P2 P3P3 P5P5 Filter f 0 Unfiltered Signal Filtered Signal Filter f 1 Filter f 2 Filter f 3 Filter f 4 Filter f 5 d9d8d7d6d5d4d3d2d1d0d9d8d7d6d5d4d3d2d1d0 P0P0 P0P0 P0P0 P0P0 P0P0 P0P0 Example: Signal Filter Each process removes one or more frequencies from a digitized signal

Type 2 Timing Diagram

Type 3: Partial Processing The next stage receives information to continue processing Additional processing continues at the source processor Question: How do we determine speed-up? P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 Linear EquationsA More Balanced Load = Idle = Executing

Operation at each processor Types 1 and 2 Processor with rank r = 0 – Generate the instance (type 1) or the data (type 2) to process – Process appropriately – Send message to the processor with rank 1 Processors with rank r = 1, 2, p-2 – Receive message from the processor with rank r-1 – Process appropriately – Send message to the processor with rank r+1 Processor with rank r = p-1 – Receive message from processor with rank r-1 – Process appropriately – Output final results Examples 1)Adding Numbers: n1 -> n1+n2 -> n1+n2+n3 ->... 2)Frequency removal: f(t) -> f0; f(t-f0)-> f1; f(t-f0-f1)->...

Parallel Pipeline Sort Step Numbers P 0 P 1 P 2 P 3 P 4 4, 3, 1, 2, 5 4, 3, 1, 2 4, 3, 1 4, Pseudo code Receive x i IF x i < x max Send x i ELSE Send x max x max = x i Note: Processors can hold blocks of numbers for better efficiency

Bi-Directional Pipeline Use the pipeline to return results to the master – Useful for line topologies, ring, or hypercube P0P0 P1P1 P4P4 P2P2 P3P3 P5P5 Sorting Phase P4P3P2P1P0P4P3P2P1P0 Time Gather Phase Phases N(generate steps); N-1 (propagate steps); N-1 (return steps) = 3N-2 Sort Phase If (myid == 0) generate number Else receive(&number, p myid-1 ) If (number > max and myid<P-1) { send(max,p myid+1 ); maximuSoFar=number;} Gather phase If (myid < P-1) receive sorted numbers from p myid+1 If (myid > 0) send sorted numbers to p myid-1 Example: Sorting

Sieve of Eratosthenes

Prime Number Generation Sieve of Eratosthenes (Type 2 pipeline) Concept – Each processor filters blocks of non-primes from the flow of data – The “potential” prime numbers pass through to the next processor Pseudo-code The Master processor generates an array of odd n numbers In a loop after receiving a group of numbers Filter a group of numbers; pass unfiltered numbers down the pipeline Gather all of the primes Notes – Wrapping the pipeline in a ring could help maintain load balance – A termination message determines when the pipeline empties Question: What range of numbers should each processor get?

Sequential code for (i = 2; i < n; i++) prime[i] = 1; for (i = 2; i <= sqrt_n; i++) if (prime[i] == 1) for (j = i + i; j < n; j = j + i) prime[j] = 0 Parallel Code Processor pi > 0 Recv(number, rank-1); PRIME = TRUE; FOR (int x=MIN; x<MAX; x+=MIN) IF ((number % x) == 0) PRIME = FALSE and BREAK IF (PRIME) send(number, rank+1); Termination recv(number, rank-1); send(number, rank+1) IF (number == terminator) break; Sequential Time O(n 2 ) Implementation

Upper Triangular Matrix All entries below the diagonal are zero Useful for solving N equations and N unknowns

Solving Sets of Linear Equations Upper Triangular Form a n-1,0 x 0 + a n-1,1 x1 + … + a n-1,n-1 x n-1 = b n-1 a n-2,0 x 0 + a n-2,1 x1 + … + a n-2,n-2 x n-1 = b n-2 a 1, 0 x 0 + a 1,1 x 1 = b 1 a 0,0 x 0 = b 0 Back Substitution x 0 =b 0 /a 0,0 x 1 =(b 1 -a 1,0 x 0 )/a 1,1 x 2 =(b 2 -a 2,0 x 0 -a 2,1 x 1 )/a 2,2 Parallel code for p i where 1<=i<n sum = 0 For (j=0; j<i; j++) {receive(&x[j], p i-1 ); sum += a i,j * x j ; send(x j,p i+1 ) } x i = (b i – sum)/a i,i General solution for x i x i = (b i – ∑ j=0 to i-1 a i,j x j )/a i,I Sequential code x[0] = b 0 /a 0,0, FOR (i=1; i<n; i++) sum=0; FOR (j=0; j<i; j++) sum += a i,I x j x i = (b i – sum)/a i,I Parallel Pseudo code for (j = 0; j < i; j++) recv(x[j], p-1); send(x[j], p+1); sum = 0; for (j = 0; j < i; j++) sum = sum + a[i][j]*x[j] x[i] = (b[i] - sum)/a[i][i]; send(x[i], p+1); This is a type 3 pipeline example Note: a i,j and b i are constants

Pipeline Solution DO IF p ≠ master, receive x j from previous processor IF p ≠ P-1, send x j to next processor back substitute x j UNTIL x i evaluated IF p ≠ P-1send x i to the next processor Notes: 1.Processing continues after sending values down the pipeline 2.Is the load imbalanced?

Illustration of Type 3 Solution Compute x 0 Compute x 1 Compute x 2 Compute x 3 x0x1x2x3x0x1x2x3 x0x0 x0x1x0x1 x0x1x2x0x1x2 P0P0 P1P1 P2P2 P3P3 Time P5P4P3P2P1P0P5P4P3P2P1P0 How balanced is This load?