A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3.

Slides:

Advertisements

Similar presentations

Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.

Advertisements

1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.

Parallel Algorithms.

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.

Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.

Lecture 3: Parallel Algorithm Design

1 Parallel Parentheses Matching Plus Some Applications.

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

Advanced Algorithms Piyush Kumar (Lecture 12: Parallel Algorithms) Welcome to COT5405 Courtesy Baker 05.

PRAM (Parallel Random Access Machine)

Efficient Parallel Algorithms COMP308

CENG536 Computer Engineering Department Çankaya University.

TECH Computer Science Parallel Algorithms  several operations can be executed at the same time  many problems are most naturally modeled with parallelism.

Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.

Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.

PRAM Models Advanced Algorithms & Data Structures Lecture Theme 13 Prof. Dr. Th. Ottmann Summer Semester 2006.

Simulating a CRCW algorithm with an EREW algorithm Efficient Parallel Algorithms COMP308.

Uzi Vishkin.  Introduction  Objective  Model of Parallel Computation ▪ Work Depth Model ( ~ PRAM) ▪ Informal Work Depth Model  PRAM Model  Technique:

Data Parallel Algorithms Presented By: M.Mohsin Butt

Spring 2006Parallel Processing, Extreme ModelsSlide 1 Part II Extreme Models.

1 Lecture 11 Sorting Parallel Computing Fall 2008.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Parallel Computers 1 The PRAM Model for Parallel Computation (Chapter 2) References:[2, Akl, Ch 2], [3, Quinn, Ch 2], from references listed for Chapter.

1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.

Fall 2008Paradigms for Parallel Algorithms1 Paradigms for Parallel Algorithms.

Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.

Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.

Chapter 6-2 Multiplier Multiplier Next Lecture Divider

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.

Chapter 11 Broadcasting with Selective Reduction -BSR- Serpil Tokdemir GSU, Department of Computer Science.

-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.

1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.

1 PRAM Algorithms Sums Prefix Sums by Doubling List Ranking.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.

4-1 Chapter 4 - The Instruction Set Architecture Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Reduced slides for CSCE 3030 To accompany the text ``Introduction.

06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],

Parallel Processing & Distributed Systems Thoai Nam Chapter 2.

Data Structures and Algorithms in Parallel Computing Lecture 1.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:

2016/1/6Part I1 A Taste of Parallel Algorithms. 2016/1/6Part I2 We examine five simple building-block parallel operations and look at the corresponding.

5 PRAM and Basic Algorithms

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

Chapter 20 Computer Operations Computer Studies Today Chapter 20.

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-1.

Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.

PRAM and Parallel Computing

Lecture 3: Parallel Algorithm Design

PRAM Model for Parallel Computation

Parallel Algorithms (chap. 30, 1st edition)

Lecture 2: Parallel computational models

PRAM Algorithms.

PRAM Model for Parallel Computation

Pipelining and Vector Processing

Parallel and Distributed Algorithms

Lecture 5 PRAM Algorithms (cont.)

Unit –VIII PRAM Algorithms.

Module 6: Introduction to Parallel Computing

Presentation transcript:

A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3 Semigroup or Fan-in Computation 5.4 Parallel Prefix Computation 5.5 Ranking the Elements of a Linked List 5.6 Matrix Multiplication

5.1 PRAM Submodels and Assumptions A.Broumandnia, 2 Fig. 4.6 Conceptual view of a parallel random-access machine (PRAM). Processor i can do the following in three phases of one cycle: 1.Fetch a value from address s i in shared memory 2.Perform computations on data held in local registers 3.Store a value into address d i in shared memory

5.1 PRAM Submodels and Assumptions Because the addresses si and di are determined by Processor i, independently of all other processors, it is possible that several processors may want to read data from the same memory location or write their values into a common location. Hence, four submodels of the PRAM model have been defined based on whether concurrent reads (writes) from (to) the same location are allowed. The four possible combinations, depicted in Fig. 5.1, are  EREW: Exclusive-read, exclusive-write  ERCW: Exclusive-read, concurrent-write  CREW: Concurrent-read, exclusive-write  CRCW: Concurrent-read, concurrent-write A.Broumandnia, 3

5.1 PRAM Submodels and Assumptions A.Broumandnia, 4 Fig. 5.1 Submodels of the PRAM model.

Types of CRCW PRAM CRCW PRAM is further classified according to how concurrent writes are handled. Here are a few example submodels based on the semantics of concurrent writes in CRCW PRAM:  Undefined: In case of multiple writes, the value written is undefined (CRCW-U).  Detecting: A special code representing “detected collision” is written (CRCW-D).  Common: Multiple writes allowed only if all store the same value (CRCW-C). This is sometimes called the consistent-write submodel.  Random: The value written is randomly chosen from among those offered (CRCWR). This is sometimes called the arbitrary-write submodel.  Priority: The processor with the lowest index succeeds in writing its value (CRCW-P).  Max/Min: The largest/smallest of the multiple values is written (CRCW-M).  Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or some other combination of the multiple values is written. A.Broumandnia, 5

Power of PRAM Submodels These submodels are all different from each other and from EREW and CREW. One way to order these submodels is by their computational power. The following relationships have been established between some of the PRAM submodels: A.Broumandnia, 6 EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

Some Elementary PRAM Computations A.Broumandnia, 7 Initializing an n-vector (base address = B) to all 0s: for j = 0 to  n/p  – 1 processor i do if jp + i < n then M[B + jp + i] := 0 endfor Adding two n-vectors and storing the results in a third (base addresses B, B , B) Convolution of two n-vectors: W k =  i+j=k U i  V j (base addresses B W, B U, B V )  n / p  segments p elements

5.2 Data Broadcasting Simple, or one-to-all, broadcasting is used when one processor needs to send a data value to all other processors. In the CREW or CRCW submodels, broadcasting is trivial, as the sending processor can write the data value into a memory location, with all processors reading that data value in the following machine cycle. Thus, simple broadcasting is done in Θ(1) steps. Multicasting within groups is equally simple if each processor knows its group membership(s) and only members of each group read the multicast data for that group. All-to-all broadcasting, where each of the p processors needs to send a data value to all other processors, can be done through p separate broadcast operations in Θ(p) steps, which is optimal. The above scheme is clearly inapplicable to broadcasting in the EREW model. A.Broumandnia, 8

5.2 Data Broadcasting A.Broumandnia, 9

5.2 Data Broadcasting A.Broumandnia, 10 Fig. 5.2 Data broadcasting in EREW PRAM via recursive doubling. Making p copies of B[0] by recursive doubling for k = 0 to  log 2 p  – 1 Proc j, 0  j < p, do Copy B[j] into B[j + 2 k ] endfor

5.2 Data Broadcasting A.Broumandnia, 11 EREW PRAM algorithm for broadcasting by Processor i Processor i write the data value into B[0] s := 1 while s < p Processor j, 0 ≤ j < min(s, p – s), do Copy B[j] into B[j + s] s := 2s endwhile Processor j, 0 ≤ j < p, read the data value in B[j] Fig. 5.3 EREW PRAM data broadcasting without redundant copying.

All-to-All Broadcasting on EREW PRAM To perform all-to-all broadcasting, so that each processor broadcasts a value that it holds to each of the other p – 1 processors, we let Processor j write its value into B[j], rather than into B[0]. Thus, in one memory access step, all of the values to be broadcast are written into the broadcast vector B. Each processor then reads the other p – 1 values in p – 1 memory accesses. To ensure that all reads are exclusive, Processor j begins reading the values starting with B[j + 1], wrapping around to B[0] after reading B[p – 1]. A.Broumandnia, 12

All-to-All Broadcasting on EREW PRAM A.Broumandnia, 13 EREW PRAM algorithm for all-to-all broadcasting Processor j, 0  j < p, write own data value into B[j] for k = 1 to p – 1 Processor j, 0  j < p, do Read the data value in B[(j + k) mod p] endfor This O(p)-step algorithm is time-optimal j p – 1 0

All-to-All Broadcasting on EREW PRAM A.Broumandnia, 14 Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j, 0  j < p, write 0 into R[ j ] for k = 1 to p – 1 Processor j, 0  j < p, do l := (j + k) mod p if S[ l ] < S[ j ] or S[ l ] = S[ j ] and l < j then R[ j ] := R[ j ] + 1 endif endfor Processor j, 0  j < p, write S[ j ] into S[R[ j ]] This O(p)-step sorting algorithm is far from optimal; sorting is possible in O(log p) time Because each data element must be given a unique rank, ties are broken by using the processor ID. In other words, if Processors i and j (i < j) hold equal data values, the value in Processor i is deemed smaller for ranking purposes. This sorting algorithm is not optimal in that the O(p²) computational work involved in it is significantly greater than the O(p log p) work required for sorting p elements on a single processor.

5.3 Semigroup or Fan-in Computation This computation is trivial for a CRCW PRAM of the “reduction” variety if the reduction operator happens to be ⊗. For example, computing the arithmetic sum (logical AND, logical XOR) of p values, one per processor, is trivial for the CRCW-S (CRCW-A, CRCW-X) PRAM; it can be done in a single cycle by each processor writing its corresponding value into a common location that will then hold the arithmetic sum (logical AND, logical XOR) of all of the values. A.Broumandnia, 15

5.3 Semigroup or Fan-in Computation A.Broumandnia, 16 EREW PRAM semigroup computation algorithm Proc j, 0  j < p, copy X[j] into S[j] s := 1 while s < p Proc j, 0  j < p – s, do S[j + s] := S[j]  S[j + s] s := 2s endwhile Broadcast S[p – 1] to all processors Fig. 5.4 Semigroup computation in EREW PRAM. This algorithm is optimal for PRAM, but its speedup of O(p / log p) is not

5.3 Semigroup or Fan-in Computation When each of the p processors is in charge of n/p elements, rather than just one element, the semigroup computation is performed by each processor 1.first combining its n/p elements in n/p steps to get a single value. 2.Then, the algorithm just discussed is used, with the first step replaced by copying the result of the above into S[j]. It is instructive to evaluate the speed-up and efficiency of the above algorithm for an n-input semigroup computation using p processors. (n=p and n>p)(page 97) A.Broumandnia, 17

5.4 Parallel Prefix Computation We see in Fig. 5.4 that as we find the semigroup computation result in S[p – 1], all partial prefixes are also obtained in the previous elements of S. Figure 5.6 is identical to Fig. 5.4, except that it includes shading to show that the number of correct prefix results doubles in each step. A.Broumandnia, 18 Fig. 5.6 Parallel prefix computation in EREW PRAM via recursive doubling. Same as the first part of semigroup computation (no final broadcasting)

5.4 Parallel Prefix Computation The previous algorithm is quite efficient, but there are other ways of performing parallel prefix computation on the PRAM. In particular, the divide-and-conquer paradigm leads to two other solutions to this problem. In the following, we deal only with the case of a p-input problem, where p (the number of inputs or processors) is a power of 2. As in Section 5.3, the pair of integers u:v represents the combination (e.g., sum) of all input values from xu to xv. Figure 5.7 depicts our first divide-and-conquer algorithm. We view the problem as composed of two subproblems: computing the odd-indexed results s1, s3, s5,... And computing the even-indexed results s0, s2, s4,.... A.Broumandnia, 19

First Divide-and-Conquer Parallel-Prefix Algorithm A.Broumandnia, 20 In hardware, this is the basis for Brent-Kung carry- lookahead adder T(p) = T(p/2) + 2 T(p)  2 log 2 p Fig. 5.7 Parallel prefix computation using a divide-and-conquer scheme. Each vertical line represents a location in shared memory

Second Divide-and-Conquer Parallel-Prefix Algorithm Figure 5.8 depicts a second divide-and-conquer algorithm. We view the input list as composed of two sublists: the even-indexed inputs x0, x2, x4,... and the odd-indexed inputs x1, x3, x5,.... Parallel prefix computation is performed separately on each sublist, leading to partial results as shown in Fig. 5.8 (a sequence of digits indicates the combination of elements with those indices). The final results are obtained by pairwise combination of adjacent partial results in a single PRAM step. The total computation time is given by the recurrence T(p) = T(p/2) + 1 Even though this latter algorithm is more efficient than the first divide-and-conquer scheme, it is applicable only if the operator ⊗ is commutative (why?). A.Broumandnia, 21

Second Divide-and-Conquer Parallel-Prefix Algorithm A.Broumandnia, 22 Strictly optimal algorithm, but requires commutativity T(p) = T(p/2) + 1 T(p) = log 2 p Each vertical line represents a location in shared memory Fig. 5.8 Another divide-and-conquer scheme for parallel prefix computation.