CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

More on threads, shared memory, synchronization
L16: Sorting and OpenGL Interface
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Split Primitive on the GPU
CS6963 L19: Dynamic Task Queues and More Synchronization.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CS6963 L14: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6963 Administrative STRSM due March 17 (EXTENDED) Midterm coming -In class April 4, open.
CS6235 L16: Libraries, OpenCL and OpenAcc. L16: Libraries, OpenACC, OpenCL CS6235 Administrative Remaining Lectures -Monday, April 15: CUDA 5 Features.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CS6963 L17: Lessons from Particle System Implementations.
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
CS6963 L15: Design Review and CUBLAS Paper Discussion.
CIS 565 Fall 2011 Qing Sun
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CS6963 L17: Asynchronous Concurrent Execution, Open GL Rendering.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CS6963 L14: Design Review, Projects, Performance Cliffs and Optimization Benefits.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
CS533 Concepts of Operating Systems Jonathan Walpole.
Synchronization These notes introduce:
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
© David Kirk/NVIDIA and Wen-mei W
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
CS/EE 217 – GPU Architecture and Parallel Programming
CUDA C/C++ Basics Part 2 - Blocks and Threads
L16: Dynamic Task Queues and Synchronization
Basic CUDA Programming
Recitation 2: Synchronization, Shared memory, Matrix Transpose
CS 179: GPU Programming Lecture 7.
CS/EE 217 – GPU Architecture and Parallel Programming
L16: Feedback from Design Review
Parallel Computation Patterns (Scan)
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Parallel Computation Patterns (Reduction)
ECE 498AL Lecture 15: Reductions and Their Implementation
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 498AL Lecture 10: Control Flow
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Lecture 5: Synchronization and ILP
Synchronization These notes introduce:
Presentation transcript:

CS6963 L18: Global Synchronization and Sorting

L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design review written feedback later today TODAY: Cross-cutting systems seminar, Monday, April 20, 12:15-1:30PM, LCR: “Technology Drivers for Multicore Architectures,” Rajeev Balasubramonian, Mary Hall, Ganesh Gopalakrishnan, John Regehr Deadline Extended to May 4: Symposium on Application Accelerators in High Performance Computing Final Reports on projects -Poster session April 29 with dry run April 27 -Also, submit written document and software by May 6 -Invite your friends! I’ll invite faculty, NVIDIA, graduate students, application owners,..

L18: Synchronization and Sorting 3 CS6963 Final Project Presentation Dry run on April 27 -Easels, tape and poster board provided -Tape a set of Powerpoint slides to a standard 2’x3’ poster, or bring your own poster. Final Report on Projects due May 6 -Submit code -And written document, roughly 10 pages, based on earlier submission. -In addition to original proposal, include -Project Plan and How Decomposed (from DR) -Description of CUDA implementation -Performance Measurement -Related Work (from DR)

L18: Synchronization and Sorting 4 CS6963 Sources for Today’s Lecture Global barrier, Vasily Volkov code Suggested locking code Bitonic sort (CUDA zone) Erik Sintorn, Ulf Assarson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm.Journal of Parallel and Distributed Computing, Volume 68, Issue 10, Pages , October

L18: Synchronization and Sorting 5 CS6963 Global Barrier – What does it do? / / assumes that block size equals to number of blocks __global__ void barrier( volatile int *slave2master, volatile int *master2slave, int niterations ) { for( int id = 1; id <= niterations; id++ ) { if( blockIdx.x == 0 ) { //master thread block: wait until all slaves signal, then signal if( threadIdx.x != 0 ) while( slave2master[threadIdx.x] != id ); __syncthreads( ); master2slave[threadIdx.x] = id; } else { //slave thread block: signal to the master, wait for reply if( threadIdx.x == 0 ) { slave2master[blockIdx.x] = id; while( master2slave[blockIdx.x] != id ); } __syncthreads( ); }

L18: Synchronization and Sorting 6 CS6963 Questions Safe? Why? Multiple iterations?

L18: Synchronization and Sorting 7 CS6963 Simple Lock Using Atomic Updates Can you use atomic updates to create a lock variable? Consider primitives: int lockVar; atomicAdd(&lockVar, 1); atomicAdd(&lockVar,-1);

L18: Synchronization and Sorting 8 CS6963 Sorting: Bitonic Sort from CUDA zone __global__ static void bitonicSort(int * values) { extern __shared__ int shared[]; const unsigned int tid = threadIdx.x; // Copy input to shared mem. shared[tid] = values[tid]; __syncthreads(); // Parallel bitonic sort. for (unsigned int k = 2; k <= NUM; k *= 2) { // Bitonic merge: for (unsigned int j = k / 2; j>0; j /= 2) { unsigned int ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); } else { if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } // Write result. values[tid] = shared[tid]; }

L18: Synchronization and Sorting 9 CS6963 Features of Implementation All data fits in shared memory (sorted from there) Time complexity: O(n(log n) 2 )

L18: Synchronization and Sorting 10 CS6963 New Sorting Algorithm (Sintorn and Assarsson) Each pass: -Merge 2L sorted lists into L sorted lists -When L is less than the number of 2*SMs, switch to Three parts: -Histogramming: to split input list into L independent sublists for Pivot Points -Bucketsort: to split into lists than can be sorted using next step -Vector-Mergesort: -Elements are grouped into 4-float vectores and a kernel sorts each vector internally -Repeat until sublist is sorted Results: -20% improvement over radix sort, best GPU algorithm times faster than quicksort on CPU