CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21

Slides:



Advertisements
Similar presentations
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Advertisements

Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Sorting.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
CSC 2300 Data Structures & Algorithms January 30, 2007 Chapter 2. Algorithm Analysis.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Algorithms. Problems, Algorithms, Programs Problem - a well defined task. –Sort a list of numbers. –Find a particular item in a list. –Find a winning.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
CALTECH CS137 Winter DeHon 1 CS137: Electronic Design Automation Day 12: February 6, 2006 Sorting.
CS 6068 Parallel Computing Fall 2013 Lecture 10 – Nov 18 The Parallel FFT Prof. Fred Office Hours: MWF.
Analysis of Algorithms
1 Lecture 16: Lists and vectors Binary search, Sorting.
Targil 6 Notes This week: –Linear time Sort – continue: Radix Sort Some Cormen Questions –Sparse Matrix representation & usage. Bucket sort Counting sort.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
CSC 211 Data Structures Lecture 13
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
1 Section 2.1 Algorithms. 2 Algorithm A finite set of precise instructions for performing a computation or for solving a problem.
CSC508 Convolution Operators. CSC508 Convolution Arguably the most fundamental operation of computer vision It’s a neighborhood operator –Similar to the.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
Algorithms 2005 Ramesh Hariharan. Divide and Conquer+Recursion Compact and Precise Algorithm Description.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Arrays Department of Computer Science. C provides a derived data type known as ARRAYS that is used when large amounts of data has to be processed. “ an.
April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.
Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.
CS/EE 217 – GPU Architecture and Parallel Programming
Advanced Sorting 7 2  9 4   2   4   7
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Algorithms.
CSC 427: Data Structures and Algorithm Analysis
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2008
Analysis of Algorithms
Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.
David Kauchak CS52 – Spring 2015
Quick-Sort 9/13/2018 1:15 AM Quick-Sort     2
Teach A level Computing: Algorithms and Data Structures
Parallel Sorting Algorithms
Description Given a linear collection of items x1, x2, x3,….,xn
CS 179: GPU Programming Lecture 7.
Algorithm design and Analysis
Unit-2 Divide and Conquer
CS/EE 217 – GPU Architecture and Parallel Programming
Y. Kotidis, S. Muthukrishnan,
Data Structures Review Session
Topic: Divide and Conquer
GPGPU: Parallel Reduction and Scan
Parallel Sorting Algorithms
Parallel Sorting Algorithms
Digital Image Processing Week IV
ECE 498AL Lecture 15: Reductions and Their Implementation
CSC 427: Data Structures and Algorithm Analysis
Parallel build blocks.
Sorting Algorithms - Rearranging a list of numbers into increasing (strictly non-decreasing) order. Sorting number is important in applications as it can.
David Kauchak cs161 Summer 2009
Parallel Sorting Algorithms
Data Parallel Pattern 6c.1
CS203 Lecture 15.
Data Parallel Computations and Pattern
Data Parallel Computations and Pattern
Presentation transcript:

CS 6068 Parallel Computing Fall 2015 Week 4 – Sept 21 Prof. Fred Annexstein @proffreda fred.annexstein@uc.edu Office Hours: 12:30 MW or by appointment Tel: 513-556-1807 Meeting: Mondays 6:00-8:50PM Baldwin

Week 4 : Plan • Basic Parallel Operations Scatter, Gather, Reduce, Prefix-Scan, Histogram • More Common Communication Patterns Compact/Filter Segmented Scan with Apps Sparse Matrix Products CSR format •Homework

Locality Maintaining the “Locality of References” is critical for performance of both CPUs and GPUs. GPU has much higher sequential memory access performance than the CPU. To get peak memory performance, computation must be structured around sequential memory accesses. This is good news for accelerating linear algebra operations such as adding two large vectors, where data is naturally accessed sequentially. However, algorithms that do a lot of pointer chasing are going to suffer from poor random-access performance.

Scatter and Gather Communication Operations Scatter is an data communication operation that writes or permutes data based on an array of location addresses. Quicksort has scatter operation underlying partition step. Gather is a operation that specifies that each thread does a computation using data read from one or multiple random locations

Stencil Communication Operation Given a fixed sized stencil window (sometimes called filter) to cover and centered on every array element GPU Applications often use small 2-D stencil applied to large 2D image array. Common class of applications related to convolution filtering: blurring, sharpening, edge detection, embossing, etc. Digital audio and electronic filters work with convolution as well, but in 1D.

Udacity-HW#2 Gaussian Blurring Gaussian blurring is defined as exponentially weighted averaging, typically using square stencil at every pixel. In Udacity-HW#2 you will write a kernel that applies a Gaussian filter that provides a weighted averaging of each color channel. One kernel call per color.

Compact Compact operation applies predicate filter to each item in sequence and returns a sequence consisting of those items from the sequence for which predicate(item) is true. For example, to compute some small primes: >>>  def pred(x): return (x % 2 != 0 and x % 3 != 0) >>>  compact(pred, range(2, 24)) [5, 7, 11, 13, 17, 19, 23]

Implementing Compact using Scatter We assume that the predicate filter can be applied to each element in the list in parallel constant (fast) time, resulting in a bit vector of Boolean values. 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 1 Let’s look at the running sum of these Boolean values.... 000112222334444556 We can use these values as scatter addresses

Scatter Operations Recall that Scatter is an data communication operation that moves or permutes data based on an array of location addresses The index of each element in the filtered or compacted list is the running sum, or more precisely, the exclusive sum scan. For example, we see that 17 is in the 4th list position since its running sum value is 5 which is 1 greater than its left neighbor which is 4.

Questions What is the parallel complexity of doing the running sum? What is the Work and Step Complexity of the general compact operation on arrays?

The Scan Operation One of most important data parallel primitives Inclusive and Exclusive versions of running sum 1 2 3 4 5 6 7 8 Coding the Scan: Iterative and Recursive versions PRAM Algorithms of Hillis-Steele and Blelloch

Coding the Scan Operation: Recursive def scan(add,x): if len(x)== 1: return x else: firsthalf = scan(add, x[0:len(x)/2]) secondhalf = scan(add, x[len(x)/2:]) s = map(add(firsthalf[-1]), secondhalf) res = firsthalf + s return res

Hillis and Steele PRAM version PRAM is like CUDA Block with unlimited threads For i= 0 to log n steps: if tid-x >= 2^i then thread adds its current held value to value held by thread with id x - (2^i)

Complexity of Hillis-Steele

Blelloch PRAM version

Comparing PRAM Scan Algorithms Hillis-Steele Work = O(n log n) Steps = log n Blelloch Work = O(n) Steps = 2 log n Work efficient versus step efficient – when to choose?

Histograms / Data Binning Sequential algorithm Loop through all data items compute appropriate bin for each item increment bin value +1 Parallel algorithm Must deal with race conditions Use atomics? Use reduce or scan? Local bins for each thread

HW#3 Tone Mapping Use of reduce, scan, and histogram To map pixel intensities via tone mapping. Start with array of brightness values Compute minimum and maximum (reduce) Compute a histogram based on these values

Review Complexity Analysis of Scan Recursive Version def prefix(add,x): if len(x)== 1: return x else: firsthalf = prefix(add, x[0:n/2]) secondhalf = prefix(add, x[n/2+1:n] ) secondhalf = map(add(x[n/2]), secondhalf) res = firsthalf + secondhalf return res Is this work efficient??

Recursive Odd/Evens Suppose we extracted the odds and even indexed elements from an array and ran scan in parallel on both. Are we near done?? Improved Complexity? def extract_even_indexed (x): result = range(len(x)/2) indx = range(0,len(x)-1,2) for i in indx: result[i/2]=x[i] return result

A Work-efficient Solution def prefix(f,x): if len(x) == 1: return x else: #begin parallel e = extract_even_indexed(x) o = extract_odd_indexed(x) #end parallel s = map(f,e,o) r1 = [0] + prefix(f,s) r2 = map(f,e+[0],r1) return interleave(r2[0:len(r2)-1],r1[1:])

Segmented Scan Indication of segments within array Apply scan operation to segments independently Work an example using both inclusive and exclusive scans. Next: application to sparse matrices

Sparse Matrix Dense Vector Products Compressed Sparse Row CSR format Value = nonzero data one array Column= identifies column for each Rowptr= pointers to location of each 1st data in each row

Sorting Studied for classic patterns of data communication First look at favorite sequential algorithms and discuss parallelization BubbleSort QuickSort MergeSort Three phase strategy

Sorting Networks Oblivious algorithms easily mapped to GPUs Bitonic Sorting Ken Batcher (Kent State, Goodyear (Akron)) Bitonic Sequence (defined as a sequence of numbers with one direction change)

Recursive Bitonic Sorting def bitonic_sort(up, x): if len(x) <= 1: return x else: first = bitonic_sort(True, x[:len(x) / 2]) second = bitonic_sort(False, x[len(x) / 2:]) return bitonic_merge(up, first + second)

Bitonic Merge def bitonic_merge(up, x): # assume input x is bitonic # sorted list is returned if len(x) == 1: return x else: bitonic_compare(up, x) first = bitonic_merge(up, x[:len(x) / 2]) second = bitonic_merge(up, x[len(x) / 2:]) return first + second def bitonic_compare(up, x): dist = len(x) / 2 for i in range(dist): if (x[i] > x[i + dist]) == up: x[i], x[i + dist] = x[i + dist], x[i]

Proof that Bitonic Sorting is correct Assume given as input 0/1 sequence (applying Knuth’s 0/1 Sorting Principle) Assume that the length of the sequence is a power of 2 If the sequence is of length 1, do nothing Otherwise, proceed as follows: – Split the bitonic 0/1 sequence of length n into the first half and the second half i.e. 0000…01111…100000…0 – Perform n/2 compare interchange operations in parallel of the form (i, i + n/2), 0 · i < n/2 (i.e., between corresponding items of the two halves) – Claim: Either the first half is all 0’s and the second half is bitonic, or the first half is bitonic and the second half is all 1’s – Therefore, it is sufficient to apply the same construction recursively on the two halves - Done!

Radix Sort – High Performance Sort Relies on numerical representation Example: Binary numbers Start with LSB move to MSB Each stage split into numbers into 2 sets Work Complexity Step Complexity Optimizations

HW#4 Remove Red Eye Effect Stencil – normalized cross correlation scoring (done!) Focus of HW4: Sort all pixels using ncc scores Map Operation: to remove red from highest scoring pixels (done!)