PRAM and Parallel Computing Jie Liu, Ph.D. Professor Computer Science Division Western Oregon University Monmouth, Oregon, USA liuj@wou.edu
outline The fastest computers The PRAM model The O(1) algorithm that finds the max An elegant parallel merge sorting algorithm A practical parallel sorting algorithm we developed Amdahl’s Law and Gustafson-Barsis’ Law Technologies we should pay attention Q&A session
World top 10 fastest computers
Multi-Core Programming Sequential Parallel
PRAM
More About PRAM Each PRAM processor can either Perform the prescribed operation (the same for all processors), Carry out an I/O operation, Idle, or Activate another processor So, it takes n processors to activate another n processors, then we have 2n active processors Now two questions What happens if two processors write to the same memory location? How many steps does it take to activate p processors
Handling Writing Conflicts in PRAM EREW (Exclusive Read Exclusive Write) CREW (Concurrent Read Exclusive Write) CRCW (Concurrent Read Concurrent Write) Common– all the values are the same Arbitrary – pick a value and set it Priority – the processors with the highest priority is the winner A multi-core computer is which one of the above?
Activating n Processors What is the complexity – O(log n)? It forms a binomal tree
Finding Max in a constant time O(1) Input: an array of n integers arrA[0..n-1] Output: the largest of numbers in arrA[0..n-1] Global variable arrB[0..n-1], i, j Assume the computer is a CRCW/Common with n2 CPUs FindignMax(arrA[0..n-1]) { for all where 0 <= i < n-1 arrB[i] = 1 for all where 0 <= i, j < n-1 if (arrA[i] < arrA[j]) arrB[i] = 0 for all where 0 <= i < n-1 if (arrB[i] = 1) return arrA[i] }
Finding Max – how does it work After line 2, every B[i] is 1 for all where 0 <= i < n-1 arrB[i] = 1 for all where 0 <= i, j < n-1 if (arrA[i] < arrA[j]) arrB[i] = 0 for all where 0 <= i < n-1 if (arrB[i] = 1) return arrA[i] Write a 0 to B[i] if A[i] is smaller than any element in A because it is CRCW/Common
Finding Max questions How to do it sequentially, and what is the complexity then? How to do it in parallel, and what is the complexity? How many processors are needed? On the PRAM, what is the min amount of time required to run the algorithm, assuming only is activated? What is the cost? Cost of a parallel algorithm is defined to be the Number of processors X execution time For our algorithm, the cost is O(n2), or even O(n2 log n) while the sequential one is O(n), so ours is NOT cost optimal
Merging Two Sorted Arrays The problem: n is an even number. An array of size n stores two sorted sequence of integers of size n/2, we need to merge the two sorted segment in O(log (n)) steps.
Merging Two Sorted Arrays (2) The sequential approach uses two yardsticks and has no concurrency to exploit Calling for new algorithms Key idea: if we know there are k elements smaller than A[i], we can copy A[i] to A[k] in one step. If i<n/2, then there are i -1 elements smaller than A[i] (assuming array is 1 based). Now how can we find the number of elements in the second half of A that is also smaller than A[i] binary search (a log (n) algorithm)! The sequential algorithm identifies a spot and find the element to occupy the spot. The parallel algorithm find identifies an element and find the spot it needs to occupy
Merging Two Sorted Arrays In Parallel //A[1] to A[n/2] and A[n/2 +1] to A[n] are two sorted sections MergeArray(A[1..n]) { int x, low, high, index for all where 1 <= i <= n // The lower half search the upper half, the upper half search for the lower half { high = 1 // assuming it is the upper half low = n/2 If i <= (n/2) { high = n low = (n/2) + 1} x = A[i] // perform binary search Repeat { index = If x < A[index] high = index – 1 else low = index + 1 } until low > high A[high + I – n/2] = x
A practical parallel sorting algorithm Sorting on a real shared memory parallel computers has its uniqueness The entire array is accessible The number of processors is much much less than the number of elements Generally there must be some partitioning of data Data move distance is irrelevant to costs We developed a practical algorithms also used this “Move to” idea My students called our algorithm the Jie-Sort, I call it the J-Sort
J-Sort through an example The array 5 17 42 3 9 22 51 26 15 32 19 99 Marking S1 1 Prefix Sum S1 2 4 6 7 Marking S2 Prefix Sum S2 Partitioned Array 52 What if you have only 4 processors
When fix the number of processors First, divide in to the p = 4 chunks Find sizes of the S1 and S2 for each chunk Perform prefix sum on size arrays, Copy the elements
J-Sort is cost optimal We proved that Which means the cost is O(n log n), which is the cost of merge sort, and the lower bound of comparison bases sorting algorithms, so J-Sort is cost optimal!!!
J-Sort Perform
Amdahl’s Law and Gustafson-Barsis’ Law Amdahl’s Law: Let s be the fraction of operations in a computation that must be performed sequentially, where 0≤ s ≤ 1. The maximum speedup achievable by a parallel computer with p processors performing the computation is Gustafson-Barsis’s Law: Given a parallel program solving a problem using p processors, let s denote the fraction of the total execution performed sequentially. The maximum speedup achievable by this program is These two laws contradict with each other. How can we explain this contradiction?
10 technologies we should pay attention 5G + Cloud + “As a service” model Big Data/BI/ML + deep learning DBMS for analytics Autonomous Vehicles Block chain Artificial Intelligence Virtual & Augmented Reality Internet of Things Parallel Processing Mobile software development Android surpassed Windows and is the most popular OS