Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8

Computer Science and Engineering Contents  Computing sum on EREW PRAM  Computing all partial sums on EREW PRAM  Matrix Multiplication on CREW  Other Algorithms

Computer Science and Engineering Recall (PRAM Model) Synchronized Read Compute Write Cycle EREW ERCW CREW CRCW Complexity: T(n), P(n), C(n) Control Private Memory P1P1 Private Memory P2P2 Private Memory PpPp Global Memory

Computer Science and Engineering Sum on EREW PRAM  Compute the sum of an array A[1..n]  We use n/2 processors  Summation will end up in location A[n]  For simplicity, we assume n is an integral power of 2  Work is done in log n iterations. In the first iteration, all processors are active. In the second iteration, only half the processors will be active, and so on.

Computer Science and Engineering Example Sum of an array of numbers on the EREW model Example of algorithm Sum_EREW when n=8 521018127357101182071057 18820730571018820748 Active processors P1, P2, P3, P4 P2, P4 P4 A[1] A[2]A[3] A[4] A[5]A[6] A[7] A[8]

Computer Science and Engineering Group Work 1- Discuss the algorithm with your neighbor 2- Design the main loops 3- Discuss the Complexity

Computer Science and Engineering Algorithm sum_EREW for i =1 to log n do forall P j, where 1 < j < n/2 do in parallel if (2j mod 2 i ) = 0 then A[2j]  A[2j] + A[j – 2 i-1 ] endif endfor

Computer Science and Engineering Complexity Run time: T(n) = O(log n) Number of processors: P(n) = n/2 Cost: c(n) = O(n log n) Is it cost optimal?

Computer Science and Engineering All partial sums - EREW PRAM Compute all partial sums of an array A[1..n] These are A[1], A[1]+A[2], A[1]+A[2]+A[3], …, A[1]+A[2]+… + A[n]. At first glance you might think it is inherently sequential because one must add up the first k elements before adding in element k+1 We’ll see that it can be parallelized Let’s extend sum_EREW to do that

Computer Science and Engineering All partial sums (cont.)  We noticed that in sum_EREW most processors are idle most of the time  By exploiting these idle processors, we should be able to compute all partial sums in the same amount of time it takes to compute the single sum

Computer Science and Engineering All partial sums (cont.)  Compute all partial sums of A[1..n]  We use n-1 processors (P 2, P 3, …, P n )  A[k] will be replaced by the sum of all elements preceding and including A[k]  In algorithm sum_EREW, at iteration i, only n/2 i processors were active, while in allsums_EREW, nearly all processors will be in use.

Computer Science and Engineering Example All partial sums on EREW PRAM Example of algorithm allsums_EREW when n=8 521018127357 1192019105717182131283057171826384548 Active processors P2, P3, …, P8 P3, P4, …, P8 P5, P6, P7, P8 A[1] A[2]A[3] A[4] A[5]A[6] A[7] A[8]

Computer Science and Engineering Group Work 1- Discuss the algorithm with your neighbor 2- Design the main loops 3- Discuss the Complexity

Computer Science and Engineering Algorithm allsums_EREW for i =1 to log n do forall P j, where 2 i-1 + 1 < j < n do in parallel a[j]  A[j] + A[j – 2 i-1 ] endfor

Computer Science and Engineering Complexity Run time: T(n) = O(log n) Number of processors: P(n) = n-1 Cost: c(n) = O(n log n)

Computer Science and Engineering Matrix Multiplication  Two n X n matrices  For clarity, we assume n is power of 2  We use CREW to allow concurrent read  Two matrices in the shared memory A[1..n,1..n], B[1..n,1..n].  We will use n 3 processors  We will also show how to reduce the number of processors

Computer Science and Engineering Matrix Multiplication (cont)  The n 3 processors are arranged in a three dimensional array. Processor P i,j,k is the one with index (i,j,k)  We will use the 3 dimensional array C[1..n,1..n,1..n] in the shared memory as working space.  The resulting matrix will be stored in locations C[i,j,n], where 1<= i,j <= n

Computer Science and Engineering Two steps 1. All n 3 processors operate in parallel to compute n 3 multiplications. (For each of the n 2 cells in the output matrix, n products are computed) 2. The n products are summed to produce the final value of each cell

Computer Science and Engineering Matrix multiplication Using n 3 processors Two steps of the Algorithms 1. Each processors P i,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k]. 2. The idea of Algorithm Sum_EREW is applied along the k dimension n 2 times in parallel to compute C[i,j,n], where 1<i, j<n. Each processors P i,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

Computer Science and Engineering Algorithm MatMult_CREW /* step 1 */ forall P i,j,k, where 1 < i, j, k<n do in parallel C[i,j,k]  A[i,k] * B[k,j] Endfor /* step 2 */ for i=1 to log n do forall P i,j,k, where 1 < i, j<n & 1<k<n/2 do in parallel if (2k mod 2 l ) = 0 then C[i,j,2k]  C[i,j,2k] + C[i,j, 2k-2 l-1 ] endif endfor /* the output matrix is stored in locations C[i,j,n], where l<i, j<n */ endfor

Computer Science and Engineering Complexity Run time: T(n) = O(log n) Number of processors: P(n) = n 3 Cost: c(n) = O(n 3 log n) Is it cost optimal?

Computer Science and Engineering Example Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW C[1,1,1]  A[1,1]B[1,1]C[1,2,1]  A[1,1]B[1,2] C[2,1,1]  A[2,1]B[1,1]C[2,2,1]  A[2,1]B[1,2] C[1,1,2]  A[1,2]B[2,1]C[1,2,2]  A[1,2]B[2,2] C[2,1,2]  A[2,2]B[2,1]C[2,2,2]  A[2,2]B[2,2] i j i j P 1,1,1 K = 1 P 1,2,1 P 1,1,2 P 1,2,2 K = 2 After step 1 P 2,1,1 P 2,2,1 P 2,1,2 P 2,2,2

Computer Science and Engineering Example (cont.) C[1,1,2]  C[1,1,2]+C[1,1,1]C[1,2,2]  C[1,2,2]+C[1,2,1] C[2,1,2]  C[2,1,2]+C[2,1,1]C[2,2,2]  C[2,2,2]+C[2,2,1] i j P 1,1,2 P 1,2,2 K = 2 After step 2 P 2,1,2 P 2,2,2 Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

Computer Science and Engineering Matrix multiplication reducing the number of processors to n 3 /log n Processors are arranged in n X n X n/(log n) 3-dimensional array 1. Each processors P i,j,k, where 1 <k < n/log n, computes the sum of (log n) product. This step will produce (n 3 /log n) partial sums. 2. The sum of products produced in step 1 are added to produce the resulting matrix as discussed previously. Complexity analysis Run time, T(n) = O(log n) Number of processors, P(n) = n 3 /log n Cost, c(n) = O(n 3 )

Computer Science and Engineering Searching Given A = a 1, a 2, …, a i, …, a n & x Determine whether x = a i for some i Sequential Binary Search  O(log n) Simple idea Divide the list among the processors and let each processor conduct its own binary search EREW PRAM  O(log n/p) + O(log p) = O(log n) CREW  O(log n/p)

Computer Science and Engineering Parallel Binary Search Split A into p+1 segments of almost equal length Compare x with p elements at the boundary between successive segments Either x = a i or search is restricted to only one of the p+1 segments Repeat until x is found or length of the list is <= p

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

Similar presentations

Presentation on theme: "Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

Similar presentations

Presentation on theme: "Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8."— Presentation transcript:

Similar presentations

About project

Feedback