IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad.

Slides:

Advertisements

Similar presentations

SHREYAS PARNERKAR. Motivation Texture analysis is important in many applications of computer image analysis for classification or segmentation of images.

Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

K Means Clustering , Nearest Cluster and Gaussian Mixture

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Complexity Analysis (Part I)

Parallel Prefix Sum (Scan) GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.

“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Parallel Computation of the Minimum Separation Distance of Bezier Curves and Surfaces Lauren Bissett, Nicholas Woodfield,

Topic Overview One-to-All Broadcast and All-to-One Reduction

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Lecture 4. RAM Model, Space and Time Complexity

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Digital Image Processing CCS331 Relationships of Pixel 1.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

Dense Image Over-segmentation on a GPU Alex Rodionov 4/24/2009.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Sorting CS 105 See Chapter 14 of Horstmann text. Sorting Slide 2 The Sorting problem Input: a collection S of n elements that can be ordered Output: the.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

IIIT Hyderabad Scalable Clustering for Vision using GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

SIFT DESCRIPTOR K Wasif Mrityunjay

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

VISUAL C++ PROGRAMMING: CONCEPTS AND PROJECTS Chapter 7A Arrays (Concepts)

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

CS/EE 217 – GPU Architecture and Parallel Programming

Complexity Analysis (Part I)

26 - File Systems.

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Mean Shift Segmentation

Recitation 2: Synchronization, Shared memory, Matrix Transpose

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

CS/EE 217 – GPU Architecture and Parallel Programming

Parallel Computation Patterns (Scan)

Multivector and SIMD Computers

GPGPU: Parallel Reduction and Scan

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Parallel Computation Patterns (Reduction)

Multithreaded Programming

Parallel build blocks.

Unsupervised Learning and Clustering

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Parallel Programming in C with MPI and OpenMP

6- General Purpose GPU Programming

Presentation transcript:

IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad

Lessons from GPU Computing Massively multithreaded: several thousands to millions of threads for good performance Good performance depends on a lot – Resource utilization: shared memory, registers – Memory access: locality, arithmetic intensity Optimum point may change with architecture – Retuning infeasible for every developer Solution: Use standard libraries or primitives – Implemented well keeping the trade-offs in mind – Used by everyone: build your algorithms using them

IIIT, Hyderabad What are the primitives? Standard data-parallel primitives – scan, reduce – sort, split But also: – segmented split – scatter, gather, data-copy – Transpose Could have domain-specific primitives – Graph theory, numerical algorithms – Computer vision, Image processing

IIIT, Hyderabad Computing Using Primitives A typical program will/should have 75-80% of the work done through such primitives Application developer writes glue kernels to connect and clean up the components – Code for this simple and perhaps unchanging – Even inefficient implementations non-critical Example: A program with running time T uses primitives for 75% of operations. A new architecture doubles performance New running time: (with no speedup for non-primitive part) – 0.5 * (0.75 T) T = T, instead of ideal 0.5 T. – 0.6 Tif 80% was using primitives and 0.55 T if 90%

IIIT, Hyderabad Primitive vs Library Both motivated by similar thinking: Reuse! Primitive is typically an algorithmic step, which finds diverse use – Used as a low-level step of an algorithm A library function provides an end-to-end functionality – Used to achieve a high-level functionality – Could be a “primitive” at a sufficiently high level! Use a library if available. Avoids development even using primitives!

IIIT, Hyderabad K-Means Clustering An iteration (with N vectors of d dimensions and K clusters) – Each vector finds distances to each cluster center O (N K d) operations – Attach itself to the closest centre; take its label O (N K) operations to find the minimum distance – Compute the mean of each cluster or vectors with the same label O (N d) operations to find K means. GPU implementation of clustering of 128-dimensional SIFT vectors, a frequent problem in Computer Vision. Recompute Cluster Means Assign New Labels Compute Distances

IIIT, Hyderabad SIFT Clustering Problem: Cluster a few (4-8) million, (128 dimensional) SIFT vectors into a few (1-2) thousand clusters using K- Means Representation: row major. That is, the N components of each of the 128 dimensions stored together, tightly. (N rows of d each) Given: initial cluster means (could be random vectors) Output: K cluster means and N labels, one for each input vector giving cluster membership Large amount of computations; well suited to a GPU- like architecture

IIIT, Hyderabad Data Representation 1 2 N 123d Input Vectors in Row Major 1 2 K 12d Cluster Centers in Row Major 123N Cluster Labels

IIIT, Hyderabad Distance Computation 1.Loop over K clusters, loading c cluster centers to shared memory at a time 2.A block of t threads loops over all d components of t input vectors, loading component v i and accumulating (C i – v i ) 2 3.Write distances in a K x N array, with K distances for a vector stored consecutively. Shared memory used to the maximum and all memory accesses are perfectly coalesced.

IIIT, Hyderabad After Distance Evaluations 1 2 K 123N Vector to Cluster Distance Matrix

IIIT, Hyderabad Finding Closest Center We need to know the index of the centre that gave the minimum distance. A block of t threads load t distances for a particular centre. Keep track of the minimum distance and the corresponding index across the K centers. Write index into a new labels array of length N. All memory accesses are perfectly coalesced.

IIIT, Hyderabad New Cluster Centers The new labels are given in the input vector order. Next step: Find the mean of all vectors with same label. Find their sum first. Rearrange input vectors so that vectors of each category are placed together. Column major storage makes the memory accesses non-coalesced and inefficient. Rearrange and convert to row major. Summing is easy thereafter!

IIIT, Hyderabad Finding New Centers 1.gIndex = splitGatherIndex(new Labels) 2.dCopy = gather(inputVectors, gIndex) 3.temp = transpose(dCopy) 4.Perform segmented add reduce of temp with segments at label boundaries. Store results in an dx K array newCenters 5.inputVectors = transpose(dCopy) 6.centers = transpose(newCenters) Now, input vectors are rearranged with new cluster centers. (Need to also keep track of a composition of gIndex values to maintain connection to input vectors)

IIIT, Hyderabad Input : Input vectors, n, Cluster centers, dim, k Output :New Membership array(n*1), New cluster centers(k*dim), Global Index(n*1).

IIIT, Hyderabad Storage per Block dim Four Input Vectors dim Center on shared memory 3 Four input vectors loaded per block and their corresponding differences are stored in shared memory which consumes 2*2048 bytes of memory, also the center is on shared memory. on the difference we perform tree based addition for each vector.

IIIT, Hyderabad Algorithm Flow Perform distance evaluations between input and current centers to generate new membership array Apply split sort on membership array sorting as per cluster center ids. Create flag and perform segmented scan to get histogram for each cluster Rearrange data as per cluster ids Perform transpose on rearranged data for coalesced access Use CUDPP segmented scan on rearranged data followed by CUDPP compact to extract the summation Divide the summation by histogram generated for each cluster to get new cluster centers Update the global Index

IIIT, Hyderabad The Global Index is initialized by Global Index[i]=i After sorting the membership array, we have sorted_membership_index[] i.e. the order in which vectors are supposed to be arranged The sorted membership index after split sort is used to get global index Global Index[sorted_membership_index[i]] =i In the final Global Index, i is the actual vector id of Input vectors and Global Index[i] is the position of i’th vector id in the final rearranged input data.

IIIT, Hyderabad Distance Evaluation Sequential approach takes O(dim) steps Simple tree based parallel approach Takes O(log(dim) )steps to evaluate the net distance In a block only 256/2 i threads are active during i’th iteration of an distance evaluation Effectively performed on the shared memory Reduces the complexity by a factor of log

IIIT, Hyderabad Distance Evaluation Tree based addition in log 8 steps 2 3 itr

IIIT, Hyderabad Algorithm for Distance evaluation Algorithm (Input: d_input, d_centers, dim, no_centers) for i=0 to no_centers do shared[threadIdx.x]= (d_input[id]-d_centers[i]) 2 for j= dim/2 to 0 do If(threadIdx.x<j) then shared[threadIdx.x]+=shared[2*threadIdx.x+j] end if j=j/2 __syncthreads() end of inner for loop if min > shared[0] Min=shared[0] end if end of outer for loop

IIIT, Hyderabad Kernel Level Execution Final iteration 12Dim =128 Every iteration number of active threads reduce by a factor of 2 Threads Id 128

IIIT, Hyderabad Kernel Functions Distance – Evaluates the distance between vectors (block 128,4, grid n/4p,p) Get_long_membership – creates a variable of type long consisting membership id and corresponding vector id. SplitSort – Sorts membership array as per cluster ids CUDPPSegmented Scan – Scan operation on sorted membership array Get_flag – Generate flag for CUDPP operations(block 256,1) Gather_histogram – Gathers the final values after scan Rearrange_data – Arrange input as per clusters ids (block 128,4, grid n/4p,p) Transpose – Performs transpose on rearranged data CUDPPCompact – Extracts summed up center values

IIIT, Hyderabad Rearranging data 1 2 N 123d Input Vectors in Row Major k k k k k k d Rearranged Vectors in Row Major based on Sorted Membership array Vec id

IIIT, Hyderabad Center Evaluation dim Rearranged Input Vectors 1 2 dim Transposed Vectors Vec Id Vec ID We may apply segmented scan on transposed vectors which is a coalesced operation, flag values can be got with the help of histograms generated for each cluster.

IIIT, Hyderabad Global Index Final iteration 12n Updating the global Index array after every iteration Global Index[membership_sorted_index[i]] =i Vector Id

IIIT, Hyderabad Why use Split Sort, Transpose? New centers evaluation requires concurrent writes which is not easily parallelizable Sorts membership array grouping vector ids belonging to same cluster together Helpful for rearranging entire input vectors as per their clusters Transpose provides coalesced access for center evaluation using segmented scan

IIIT, Hyderabad Issues Major time is consumed by distance evaluations as input size increases. Input size and number of clusters majorly control the performance

IIIT, Hyderabad Result Kmeans++ to generate initial centers Time taken to generate initial cluster centers Input sizeCluster centersCPU (P4, 2.4Ghz)GPU( GTX 280) 1, ms ms 10, ms ms 1,00, ms ms 1 Million ms ms

IIIT, Hyderabad Results Variation with number of input vectors (128 dimension) Time taken per iteration to generate new membership array and new cluster centers (excluding time for kmeans++) Input sizeCluster centersCPU (P4, 2.4Ghz)GPU( GTX 280) 1, ms9.91 ms 10, ms487.3 ms 1,00, ms ms 1 Million ms ms

IIIT, Hyderabad Result Variation of cluster centers N = 10000, Dimension =128 Input sizeCluster centersGPU( GTX 280) 1,00, ms 1,00, ms 1,00, ms 1,00, ms 1,00, ms

IIIT, Hyderabad Result Variation with dimension of SIFT vector N = 10000, Cluster centers =8000 Input sizeDimensionGPU( GTX 280) 1,00, ms 1,00, ms 1,00, ms 1,00, ms

IIIT, Hyderabad Result Coalesced vs Non-Coalesced Coalesced involves transpose followed by segmented scan and non- coalesced involves gather followed by segmented scan Input size – Cluster centers Non CoalescedCoalesced 1,000 – ms0.077 ms 10,000 – ms0.217 ms 1,00, ms1.955 ms 1,00, ms ms

IIIT, Hyderabad Result Membership vs New Centers The membership generation consumes major chunk of time Input size – Cluster centers MembershipNew centers 1,000 – ms5.68 ms 10,000 – ms ms 1,00, ms ms 10,00, ms ms