1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.

Slides:



Advertisements
Similar presentations
1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.
Advertisements

Parallel Algorithms.
PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.
Instructor Neelima Gupta Table of Contents Parallel Algorithms.
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Lecture 3: Parallel Algorithm Design
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Parallel vs Sequential Algorithms
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Advanced Algorithms Piyush Kumar (Lecture 12: Parallel Algorithms) Welcome to COT5405 Courtesy Baker 05.
PRAM (Parallel Random Access Machine)
Efficient Parallel Algorithms COMP308
TECH Computer Science Parallel Algorithms  several operations can be executed at the same time  many problems are most naturally modeled with parallelism.
Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.
PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.
Simulating a CRCW algorithm with an EREW algorithm Efficient Parallel Algorithms COMP308.
Uzi Vishkin.  Introduction  Objective  Model of Parallel Computation ▪ Work Depth Model ( ~ PRAM) ▪ Informal Work Depth Model  PRAM Model  Technique:
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Introduction to Analysis of Algorithms
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.
2. Computational Models 1 MODELS OF COMPUTATION (Chapter 2) Models –An abstract description of a real world entity –Attempts to capture the essential features.
1 Lecture 6 More PRAM Algorithm Parallel Computing Fall 2008.
Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.
Design of parallel algorithms Matrix operations J. Porras.
Fundamental Techniques
Parallel Computers 1 The PRAM Model for Parallel Computation (Chapter 2) References:[2, Akl, Ch 2], [3, Quinn, Ch 2], from references listed for Chapter.
Fall 2008Paradigms for Parallel Algorithms1 Paradigms for Parallel Algorithms.
Basic PRAM algorithms Problem 1. Min of n numbers Problem 2. Computing a position of the first one in the sequence of 0’s and 1’s.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
1 Sequential Circuits Registers and Counters. 2 Master Slave Flip Flops.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
1 Lecture 2 Parallel Programming Platforms Parallel Computing Spring 2015.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Row Reduction Method Lesson 6.4.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3.
Chapter 11 Broadcasting with Selective Reduction -BSR- Serpil Tokdemir GSU, Department of Computer Science.
Parallel and Distributed Algorithms Eric Vidal Reference: R. Johnsonbaugh and M. Schaefer, Algorithms (International Edition) Pearson Education.
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.
RAM, PRAM, and LogP models
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
1 Chapter 4 Divide-and-Conquer. 2 About this lecture Recall the divide-and-conquer paradigm, which we used for merge sort: – Divide the problem into a.
Parallel Algorithms. Parallel Models u Hypercube u Butterfly u Fully Connected u Other Networks u Shared Memory v.s. Distributed Memory u SIMD v.s. MIMD.
Fixed & Floating Number Format Dr. Hugh Blanton ENTC 4337/5337.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
Module #9 – Number Theory 1/5/ Algorithms, The Integers and Matrices.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
5 PRAM and Basic Algorithms
Lecture 9 Architecture Independent (MPI) Algorithm Design
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.
3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-1.
PRAM and Parallel Computing
Lecture 3: Parallel Algorithm Design
PRAM Model for Parallel Computation
Parallel Algorithms (chap. 30, 1st edition)
Lecture 2: Parallel computational models
Lecture 22 review PRAM: A model developed for parallel machines
PRAM Algorithms.
PRAM Model for Parallel Computation
Parallel and Distributed Algorithms
Lecture 5 PRAM Algorithms (cont.)
CSE838 Lecture notes copy right: Moon Jung Chung
Unit –VIII PRAM Algorithms.
Parallel Algorithms A Simple Model for Parallel Processing
Module 6: Introduction to Parallel Computing
Presentation transcript:

1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008

2 Four Subclasses of PRAM Depending on how concurrent access to a single memory cell (of the shared memory) is resolved, there are various PRAM variants. ER (Exclusive Read) or EW (Exclusive Write) PRAMs do not allow concurrent access of the shared memory. It is allowed, however, for CR (Concurrent Read) or CW (Concurrent Write) PRAMs. Combining the rules for read and write access there are four PRAM variants: EREW: access to a memory location is exclusive. No concurrent read or write operations are allowed. Weakest PRAM model CREW Multiple read accesses to a memory location are allowed. Multiple write accesses to a memory location are serialized. ERCW Multiple write accesses to a memory location are allowed. Multiple read accesses to a memory location are serialized. Can simulate an EREW PRAM CRCW Allows multiple read and write accesses to a common memory location. Most powerful PRAM model Can simulate both EREW PRAM and CREW PRAM

3 Resolve concurrent write access (1) in the arbitrary PRAM, if multiple processors write into a single shared memory cell, then an arbitrary processor succeeds in writing into this cell. (2) in the common PRAM, processors must write the same value into the shared memory cell. (3) in the priority PRAM the processor with the highest priority (smallest or largest indexed processor) succeeds in writing. (4) in the combining PRAM if more than one processors write into the same memory cell, the result written into it depends on the combining operator. If it is the sum operator, the sum of the values is written, if it is the maximum operator the maximum is written. Note: An algorithm designed for the common PRAM can be executed on a priority or arbitrary PRAM and exhibit similar complexity. The same holds for an arbitrary PRAM algorithm when run on a priority PRAM.

4 Parallel Algorithm Assumptions Convention: In this subject we name processors arbitrarily either 0, 1,..., p − 1 or 1, 2,..., p. The input to a particular problem would reside in the cells of the shared memory. We assume, in order to simplify the exposition of our algorithms, that a cell is wide enough (in bits or bytes) to accommodate a single instance of the input (eg. a key or a floating point number). If the input is of size n, the first n cells numbered 0,..., n − 1 store the input. We assume that the number of processors of the PRAM is n or a polynomial function of the size n of the input. Processor indices are 0, 1,..., n − 1.

5 Parallel Sum (Compute x 0 + x x n−1 ) Algorithm Parallel Sum. M[0] M[1] M[2] M[3] M[4] M[5] M[6] M[7] x0 x1 x2 x3 x4 x5 x6 x7 t=0 x0+x1 x2+x3 x4+x5 x6+x7 t=1 x0+...+x3 x4+...+x7 t=2 x0+...+x7 t=3 This EREW PRAM algorithm consists of lg n steps. In step i, if j can be exactly divisible by 2 i, processor j reads shared-memory cells j and j + 2 i-1 combines (sums) these values and stores the result into memory cell j. After lgn steps the sum resides in cell 0. Algorithm Parallel Sum has T = O(lg n), P = n and W = O(n lg n), W2 = O(n). Processing node used: P0, p2, p4, p6 t=1 P0, p4 t=2 P0 t=3

6 Parallel Sum (Compute x 0 + x x n−1 ) // pid() returns the id of the processor issuing the call. begin Parallel Sum (n) 1. i = 1 ; j = pid(); 2. while (j mod 2 i == 0 ) 3. a = C[j]; 4. b = C[j + 2 i-1 ]; 5. C[j] = a + b; 6. i = i + 1; 7. end end Parallel Sum

7 Parallel Sum (Compute x 0 + x x n−1 ) A sequential algorithm that solves this problem requires n − 1 additions. For a PRAM implementation, value x i is initially stored in shared memory cell i. The sum x 0 + x x n−1 is to be computed in T = lgn parallel steps. Without loss of generality, let n be a power of two. If a combining CRCW PRAM with arbitration rule sum is used to solve this problem, the resulting algorithm is quite simple. In the first step processor i reads memory cell i storing x i. In the following step processor i writes the read value into an agreed cell say 0. The time is T = O(1), and processor utilization is P = O(n). A more interesting algorithm is the one presented below for the EREW PRAM. The algorithm consists of lg n steps. In step i, processor j < n / 2 i reads shared-memory cells 2j and 2j +1 combines (sums) these values and stores the result into memory cell j. After lgn steps the sum resides in cell 0. Algorithm Parallel Sum has T = O(lg n), P = n and W = O(n lg n), W2 = O(n).

8 Parallel Sum (Compute x 0 + x x n−1 ) // pid() returns the id of the processor issuing the call. begin Parallel Sum (n) 1. i = 1 ; j = pid(); 2. while (j < n / 2 i ) 3. a = C[2j]; 4. b = C[2j + 1]; 5. C[j] = a + b; 6. i = i + 1; 7. end end Parallel Sum

9 Parallel Sum: An example Algorithm Parallel Sum. M[0] M[1] M[2] M[3] M[4] M[5] M[6] M[7] x0 x1 x2 x3 x4 x5 x6 x7 t=0 x0+x1 x2+x3 x4+x5 x6+x7 t=1 x0+...+x3 x4+...+x7 t=2 x0+...+x7 t=3

10 Parallel Sum Algorithm Parallel Sum can be easily extended to include the case where n is not a power of two. Parallel Sum is the first instance of a sequential problem that has a trivial sequential but more complex parallel solution. Instead of operator Sum other operators like Multiply, Maximum, Minimum, or in general, any associative operator could have been used. As associative operator ⊗ is one such that (a ⊗ b) ⊗ c = a ⊗ (b ⊗ c). Exercise 1 Can you improve Parallel Sum so that T remains the same, P = O(n/ lg n), and W = O(n)? Explain. Exercise 2 What if i have p processors where p < n ? (You may assume that n is a multiple of p). Exercise 3 Generalize the Parallel Sum algorithm to any associative operator.

11 PRAM Algorithm: Broadcasting A message (say, a word) is stored in cell 0 of the shared memory. We would like this message to be read by all n processors of a PRAM. On a CREW PRAM this requires one parallel step (processor i concurrently reads cell 0). On an EREW PRAM broadcasting can be performed in O(lg n) steps. The structure of the algorithm is the reverse of parallel sum. In lg n steps the message is broadcast as follows. In step i each processor with index j less than 2 i reads the contents of cell j and copies it into cell j + 2 i. After lg n steps each processor i reads the message by reading the contents of cell i. A CR?W PRAM algorithm that solves the broadcasting problem has performance P = O(n), T = O(1), and W = O(n). The EREW PRAM algorithm that solves the broadcasting problem has performance P = O(n), T = O(lg n), and W = O(n lg n), W2 = O(n).

12 Broadcasting begin Broadcast (M) 1. i = 0 ; j = pid(); C[0]=M; 2. while (2 i < P) 3. if (j < 2 i ) 5. C[j + 2 i ] = C[j]; 6. i = i + 1; 6. end 7. Processor j reads M from C[j]. end Broadcast

13 PRAM Algorithm: Matrix Multiplication Matrix Multiplication A simple algorithm for multiplying two n × n matrices on a CREW PRAM with time complexity T = O(lg n) and P = n 3 follows. For convenience, processors are indexed as triples (i, j, k), where i, j, k = 1,..., n. In the first step processor (i, j, k) concurrently reads a ij and b jk and performs the multiplication a ij b jk. In the following steps, for all i, k the results (i, ∗, k) are combined, using the parallel sum algorithm to form c ik =  j a ij b jk. After lgn steps, the result c ik is thus computed. The same algorithm also works on the EREW PRAM with the same time and processor complexity. The first step of the CREW algorithm need to be changed only. We avoid concurrency by broadcasting element a ij to processors (i, j, ∗ ) using the broadcasting algorithm of the EREW PRAM in O(lg n) steps. Similarly, b jk is broadcast to processors ( ∗, j, k). The above algorithm also shows how an n-processor EREW PRAM can simulate an n-processor CREW PRAM with an O(lg n) slowdown.

14 Matrix Multiplication CREW EREW 1. aij to all (i,j,*) procs O(1) O(lgn) bjk to all (*,j,k) procs O(1) O(lgn) 2. aij*bjk at (i,j,k) proc O(1) O(1) 3. parallel sum j a ij *b jk (i,*,k) procs O(lgn) O(lgn) n procs participate 4. cik = sum j a ij *b jk O(1) O(1) T=O(lgn),P=O(n 3 ) W=O( n 3 lgn) W 2 = O(n 3 )

15 PRAM Algorithm: Logical AND operation Problem. Let X 1...,X n be binary/boolean values. Find X = X 1 ∧ X 2 ∧... ∧ X n. The sequential problem accepts a P = 1, T = O(n),W = O(n) direct solution. An EREW PRAM algorithm solution for this problem works the same way as the PARALLEL SUM algorithm and its performance is P = O(n), T = O(lg n),W = O(n lg n) along with the improvements in P and W mentioned for the PARALLEL SUM algorithm. In the remainder we will investigate a CRCW PRAM algorithm. Let binary value X i reside in the shared memory location i. We can find X = X 1 ∧ X 2 ∧... ∧ X n in constant time on a CRCW PRAM. Processor 1 first writes an 1 in shared memory cell 0. If X i = 0, processor i writes a 0 in memory cell 0. The result X is then stored in this memory cell. The result stored in cell 0 is 1 (TRUE) unless a processor writes a 0 in cell 0; then one of the X i is 0 (FALSE) and the result X should be FALSE, as it is.

16 Logical AND operation begin Logical AND (X1...Xn) 1. Proc 1 writ1es in cell if X i = 0 processor i writes 0 into cell 0. end Logical AND Exercise Give an O(1) CRCW algorithm for LOGICAL OR.

17 End Thank you!