Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

Slides:



Advertisements
Similar presentations
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Advertisements

1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.
PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Instructor Neelima Gupta Table of Contents Parallel Algorithms.
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Lecture 3: Parallel Algorithm Design
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
PRAM (Parallel Random Access Machine)
Efficient Parallel Algorithms COMP308
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
Numerical Algorithms • Matrix multiplication
Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Fall 2008Array Manipulation Algorithms1. Fall 2008Array Manipulation Algorithms2 Searching Let A = (a 1, a 2, …, a n ) be a sorted array of data such.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.
Fall 2008Paradigms for Parallel Algorithms1 Paradigms for Parallel Algorithms.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Basic PRAM algorithms Problem 1. Min of n numbers Problem 2. Computing a position of the first one in the sequence of 0’s and 1’s.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3.
Analysis of Algorithms
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
1 PRAM Algorithms Sums Prefix Sums by Doubling List Ranking.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
MATRIX MULTIPLICATION 4 th week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MATRIX MULTIPLICATION 4 th week References Sequential matrix.
1 Asymptotic Notations Iterative Algorithms and their analysis Asymptotic Notations –Big O,  Notations Review of Discrete Math –Summations –Logarithms.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
Lecture 9 Architecture Independent (MPI) Algorithm Design
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-1.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
PRAM and Parallel Computing
Lecture 3: Parallel Algorithm Design
PRAM Model for Parallel Computation
Parallel Algorithms (chap. 30, 1st edition)
Lecture 2: Parallel computational models
PRAM Algorithms.
PRAM Model for Parallel Computation
Parallel Matrix Operations
Lecture 5 PRAM Algorithms (cont.)
Numerical Algorithms Quiz questions
Unit –VIII PRAM Algorithms.
Professor Ioana Banicescu CSE 8843
Module 6: Introduction to Parallel Computing
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Presentation transcript:

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8

Computer Science and Engineering Contents  Computing sum on EREW PRAM  Computing all partial sums on EREW PRAM  Matrix Multiplication on CREW  Other Algorithms

Computer Science and Engineering Recall (PRAM Model) Synchronized Read Compute Write Cycle EREW ERCW CREW CRCW Complexity: T(n), P(n), C(n) Control Private Memory P1P1 Private Memory P2P2 Private Memory PpPp Global Memory

Computer Science and Engineering Sum on EREW PRAM  Compute the sum of an array A[1..n]  We use n/2 processors  Summation will end up in location A[n]  For simplicity, we assume n is an integral power of 2  Work is done in log n iterations. In the first iteration, all processors are active. In the second iteration, only half the processors will be active, and so on.

Computer Science and Engineering Example Sum of an array of numbers on the EREW model Example of algorithm Sum_EREW when n= Active processors P1, P2, P3, P4 P2, P4 P4 A[1] A[2]A[3] A[4] A[5]A[6] A[7] A[8]

Computer Science and Engineering Group Work 1- Discuss the algorithm with your neighbor 2- Design the main loops 3- Discuss the Complexity

Computer Science and Engineering Algorithm sum_EREW for i =1 to log n do forall P j, where 1 < j < n/2 do in parallel if (2j mod 2 i ) = 0 then A[2j]  A[2j] + A[j – 2 i-1 ] endif endfor

Computer Science and Engineering Complexity Run time: T(n) = O(log n) Number of processors: P(n) = n/2 Cost: c(n) = O(n log n) Is it cost optimal?

Computer Science and Engineering All partial sums - EREW PRAM Compute all partial sums of an array A[1..n] These are A[1], A[1]+A[2], A[1]+A[2]+A[3], …, A[1]+A[2]+… + A[n]. At first glance you might think it is inherently sequential because one must add up the first k elements before adding in element k+1 We’ll see that it can be parallelized Let’s extend sum_EREW to do that

Computer Science and Engineering All partial sums (cont.)  We noticed that in sum_EREW most processors are idle most of the time  By exploiting these idle processors, we should be able to compute all partial sums in the same amount of time it takes to compute the single sum

Computer Science and Engineering All partial sums (cont.)  Compute all partial sums of A[1..n]  We use n-1 processors (P 2, P 3, …, P n )  A[k] will be replaced by the sum of all elements preceding and including A[k]  In algorithm sum_EREW, at iteration i, only n/2 i processors were active, while in allsums_EREW, nearly all processors will be in use.

Computer Science and Engineering Example All partial sums on EREW PRAM Example of algorithm allsums_EREW when n= Active processors P2, P3, …, P8 P3, P4, …, P8 P5, P6, P7, P8 A[1] A[2]A[3] A[4] A[5]A[6] A[7] A[8]

Computer Science and Engineering Group Work 1- Discuss the algorithm with your neighbor 2- Design the main loops 3- Discuss the Complexity

Computer Science and Engineering Algorithm allsums_EREW for i =1 to log n do forall P j, where 2 i < j < n do in parallel a[j]  A[j] + A[j – 2 i-1 ] endfor

Computer Science and Engineering Complexity Run time: T(n) = O(log n) Number of processors: P(n) = n-1 Cost: c(n) = O(n log n)

Computer Science and Engineering Matrix Multiplication  Two n X n matrices  For clarity, we assume n is power of 2  We use CREW to allow concurrent read  Two matrices in the shared memory A[1..n,1..n], B[1..n,1..n].  We will use n 3 processors  We will also show how to reduce the number of processors

Computer Science and Engineering Matrix Multiplication (cont)  The n 3 processors are arranged in a three dimensional array. Processor P i,j,k is the one with index (i,j,k)  We will use the 3 dimensional array C[1..n,1..n,1..n] in the shared memory as working space.  The resulting matrix will be stored in locations C[i,j,n], where 1<= i,j <= n

Computer Science and Engineering Two steps 1. All n 3 processors operate in parallel to compute n 3 multiplications. (For each of the n 2 cells in the output matrix, n products are computed) 2. The n products are summed to produce the final value of each cell

Computer Science and Engineering Matrix multiplication Using n 3 processors Two steps of the Algorithms 1. Each processors P i,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k]. 2. The idea of Algorithm Sum_EREW is applied along the k dimension n 2 times in parallel to compute C[i,j,n], where 1<i, j<n. Each processors P i,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

Computer Science and Engineering Algorithm MatMult_CREW /* step 1 */ forall P i,j,k, where 1 < i, j, k<n do in parallel C[i,j,k]  A[i,k] * B[k,j] Endfor /* step 2 */ for i=1 to log n do forall P i,j,k, where 1 < i, j<n & 1<k<n/2 do in parallel if (2k mod 2 l ) = 0 then C[i,j,2k]  C[i,j,2k] + C[i,j, 2k-2 l-1 ] endif endfor /* the output matrix is stored in locations C[i,j,n], where l<i, j<n */ endfor

Computer Science and Engineering Complexity Run time: T(n) = O(log n) Number of processors: P(n) = n 3 Cost: c(n) = O(n 3 log n) Is it cost optimal?

Computer Science and Engineering Example Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW C[1,1,1]  A[1,1]B[1,1]C[1,2,1]  A[1,1]B[1,2] C[2,1,1]  A[2,1]B[1,1]C[2,2,1]  A[2,1]B[1,2] C[1,1,2]  A[1,2]B[2,1]C[1,2,2]  A[1,2]B[2,2] C[2,1,2]  A[2,2]B[2,1]C[2,2,2]  A[2,2]B[2,2] i j i j P 1,1,1 K = 1 P 1,2,1 P 1,1,2 P 1,2,2 K = 2 After step 1 P 2,1,1 P 2,2,1 P 2,1,2 P 2,2,2

Computer Science and Engineering Example (cont.) C[1,1,2]  C[1,1,2]+C[1,1,1]C[1,2,2]  C[1,2,2]+C[1,2,1] C[2,1,2]  C[2,1,2]+C[2,1,1]C[2,2,2]  C[2,2,2]+C[2,2,1] i j P 1,1,2 P 1,2,2 K = 2 After step 2 P 2,1,2 P 2,2,2 Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

Computer Science and Engineering Matrix multiplication reducing the number of processors to n 3 /log n Processors are arranged in n X n X n/(log n) 3-dimensional array 1. Each processors P i,j,k, where 1 <k < n/log n, computes the sum of (log n) product. This step will produce (n 3 /log n) partial sums. 2. The sum of products produced in step 1 are added to produce the resulting matrix as discussed previously. Complexity analysis Run time, T(n) = O(log n) Number of processors, P(n) = n 3 /log n Cost, c(n) = O(n 3 )

Computer Science and Engineering Searching Given A = a 1, a 2, …, a i, …, a n & x Determine whether x = a i for some i Sequential Binary Search  O(log n) Simple idea Divide the list among the processors and let each processor conduct its own binary search EREW PRAM  O(log n/p) + O(log p) = O(log n) CREW  O(log n/p)

Computer Science and Engineering Parallel Binary Search Split A into p+1 segments of almost equal length Compare x with p elements at the boundary between successive segments Either x = a i or search is restricted to only one of the p+1 segments Repeat until x is found or length of the list is <= p