Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter 39 - http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

Parallel Algorithms.

Section 5: More Parallel Algorithms

MATH 224 – Discrete Mathematics

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Lecture 3: Parallel Algorithm Design

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Chapter 6: Arrays Java Software Solutions for AP* Computer Science

Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.

CS 240A: Parallel Prefix Algorithms or Tricks with Trees

CUDA Tricks Presented by Damodaran Ramani. Synopsis Scan Algorithm Applications Specialized Libraries CUDPP: CUDA Data Parallel Primitives Library Thrust:

Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.

Parallel Sorting – Odd-Even Sort David Monismith CS 599 Notes based upon multiple sources provided in the footers of each slide.

Lecture 5: Linear Time Sorting Shang-Hua Teng. Sorting Input: Array A[1...n], of elements in arbitrary order; array size n Output: Array A[1...n] of the.

CMPT 225 Priority Queues and Heaps. Priority Queues Items in a priority queue have a priority The priority is usually numerical value Could be lowest.

Advanced Topics in Algorithms and Data Structures 1 Lecture 4 : Accelerated Cascading and Parallel List Ranking We will first discuss a technique called.

Accelerated Cascading Advanced Algorithms & Data Structures Lecture Theme 16 Prof. Dr. Th. Ottmann Summer Semester 2006.

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

Lecture 5: Master Theorem and Linear Time Sorting

Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.

DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…

MergeSort Source: Gibbs & Tamassia. 2 MergeSort MergeSort is a divide and conquer method of sorting.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

1 Chapter 1 Analysis Basics. 2 Chapter Outline What is analysis? What to count and consider Mathematical background Rates of growth Tournament method.

Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.

Analysis of Algorithms

Big Oh Algorithms are compared to each other by expressing their efficiency in big-oh notation Big O notation is used in Computer Science to describe the.

Analysis of Algorithms These slides are a modified version of the slides used by Prof. Eltabakh in his offering of CS2223 in D term 2013.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Data Structure Introduction.

Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.

Introduction to Analysis of Algorithms CS342 S2004.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

CS 361 – Chapter 5 Priority Queue ADT Heap data structure –Properties –Internal representation –Insertion –Deletion.

Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,

Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: CLRS “Intro.

Recurrences (in color) It continues…. Recurrences When an algorithm calls itself recursively, its running time is described by a recurrence. When an algorithm.

Parallel Programming - Sorting David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta,

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

B-Trees Katherine Gurdziel 252a-ba. Outline What are b-trees? How does the algorithm work? –Insertion –Deletion Complexity What are b-trees used for?

1/16 CALCULATING PREFIX SUMS Vladimir Jocovi ć 2012/0011.

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 3 Parallel Prefix, Pack, and Sorting Steve Wolfman, based on work by Dan.

1 CS4402 – Parallel Computing Lecture 7 - Simple Parallel Sorting. - Parallel Merge Sort.

1/16 CALCULATING PREFIX SUMS Vladimir Jocovi ć 2012/0011.

Arrays Department of Computer Science. C provides a derived data type known as ARRAYS that is used when large amounts of data has to be processed. “ an.

Parallel Prefix, Pack, and Sorting. Outline Done: –Simple ways to use parallelism for counting, summing, finding –Analysis of running time and implications.

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.

Priority Queues and Heaps. John Edgar  Define the ADT priority queue  Define the partially ordered property  Define a heap  Implement a heap using.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

David Kauchak cs062 Spring 2010

Lecture 3: Parallel Algorithm Design

Introduction to CUDA Programming

CS 179: GPU Programming Lecture 7.

Parallel Computation Patterns (Scan)

GPGPU: Parallel Reduction and Scan

Mattan Erez The University of Texas at Austin

© 2012 Elsevier, Inc. All rights reserved.

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

Parallel build blocks.

David Kauchak cs302 Spring 2012

CS203 Lecture 14.

Presentation transcript:

Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter 39 - http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

Work Complexity In terms of data how many or what order of operations (Big-Oh) are required to perform a particular algorithm Example – summation Given an array of n elements, it takes O(n) operations to find the sum of those elements on a single core processor. We want parallel algorithms to be work efficient – to take the same amount of work (operations) as the serial version. This means that a parallelized version of sum should take O(n) operations (additions) to finish.

Reduction Work Complexity Recall – we have studied reduction In class we have seen that a sum reduction takes O(n) work to run to completion. The sum reduction takes the same order of work (operations) to complete as the serial version. Therefore, we say a sum reduction is work-efficient.

Prefix Sum (Scan) The prefix sum or scan operation is useful for computing cumulative sums. Example: Given the following array, compute the prefix sum. +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | 0 1 2 3 4 5

Uses for Scan Sorting Lexical Analysis String Comparison Polynomial Evaluation Stream Compaction Building Histograms and Data Structures

Exclusive Prefix Sum (Exclusive Scan) Example: Given the following array, compute the prefix sum. +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | 0 1 2 3 4 5 | 0 | 7 | 9 | 12| 13| 22|

Inclusive Prefix Sum (Inclusive Scan) Example: Given the following array, compute the prefix sum. +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | 0 1 2 3 4 5 | 7 | 9 | 12| 13| 22| 24|

Naïve Prefix Sum (Exclusive Scan) Algorithm int arr[N] = {7, 2, 3, 1, 9, 2}; int prefixSum[N]; int i, j; prefixSum[0] = 0; for(i = 1; i < N; i++) prefixSum[i]=prefixSum[i-1]+arr[j-1];

Naïve Parallel Inclusive Scan (Hillis and Steele 1986) +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | 0 1 2 3 4 5 | \ | \ | \ | \ | \ | V V V V V V | 7 | 9 | 5 | 4 | 10| 11| \ \ | \ | \ | | \ \| \| \| | \ \ \ \ | \ |\ |\ |\ | \ | \ | \ | \ | \| \| \| \| V V V V | 7 | 9 | 12| 13| 15| 15|

Parallel Scan Example Continued +---+---+---+---+---+---+ | 7 | 9 | 12| 13| 15| 15| 0 1 2 3 4 5 \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \| | \ \ | \ |\ | \ | \ | \| \| V V | 7 | 9 | 12| 13| 22| 24|

Naïve Parallel Scan const int N = 6; int log_of_N = log2(8); int arr[N] = {7, 2, 3, 1, 9, 2}; int scanSum[2][N]; int i, j; int in = 1, out = 0; memcpy(scanSum[out], arr, N*sizeof(int)); memset(scanSum[in], 0, int two_i = 0, two_i_1 = 0;

Naïve Parallel Scan for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; for(j = 0; j < N; j++) if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; }

Adding OpenMP Parallelism for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; #pragma omp parallel for private(j) shared(scanSum, in, out) for(j = 0; j < N; j++) if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; }

Naïve Parallel Scan Analysis Not work efficient Performs O(n lg n) additions Note that the serial version performs O(n) additions Requires O(lg n) steps This means we can complete a scan in O(lg n) time if we have O(n lg n) processors

Work Efficient Parallel Scan (Blelloch 1990) Goal – build a scan algorithm that avoids the extra O(lg n) work Solution – make use of balanced binary trees Build a balanced binary tree on the input data and sweep the tree to compute the prefix sum Consists of two parts Up Sweep Traverse tree from leaves to root computing partial sums during the traversal The root node (last element in the array) will hold the sum of all elements Down Sweep Traverse from root to leaves using the partial sums to compute the cumulative sums in place. For the exclusive prefix sum, we replace the root with zero

+---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 13| 9 | 11| 8 | 36| ^ /| / | / | / | / | | 7 | 9 | 3 | 13| 9 | 11| 8 | 23| ^ ^ /| /| / | / | / | / | | 7 | 9 | 3 | 4 | 9 | 11| 8 | 12| ^ ^ ^ ^ / | / | / | / | / | / | / | / | | 7 | 2 | 3 | 1 | 9 | 2 | 8 | 4 | 0 1 2 3 4 5 6 7 Blelloch Up Sweep

Blelloch Algorithm – Up Sweep for(i = 0; i <= log_of_N_1; i++) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) scanSum[j + two_i_p1 - 1] = scanSum[j + two_i - 1] + scanSum[j + two_i_p1 - 1]; }

+---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 13| 9 | 11| 8 | 36| zero| V | 7 | 9 | 3 | 13| 9 | 11| 8 | 0 | \ / | \ / | / \ | zero / \ | V V | 7 | 9 | 3 | 0 | 9 | 11| 8 | 13| \ / | \ / | / \ | / \ | V V V V | 7 | 0 | 3 | 9 | 9 | 13| 8 | 24| \ /| \ /| \ /| \ /| / \| / \| / \| / \| V V V V V V V V | 0 | 7 | 9 | 12| 13| 22| 24| 32| 0 1 2 3 4 5 6 7 Blelloch Down Sweep

Blelloch Algorithm – Down Sweep scanSum[N-1] = 0; for(i = log_of_N_1; i >= 0; i--) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) { long t = scanSum[j + two_i - 1]; scanSum[j + two_i - 1] = scanSum[j + two_i_p1 - 1]; scanSum[j + two_i_p1 - 1] = t + }

Analysis This algorithm does complete the summation in O(n) operations, giving the appearance of efficiency. Actually, the algorithm requires 2*(n-1) additions and n-1 swaps to run to completion. For large arrays, this algorithm will indeed out-perform the Hillis and Steele Algorithm.