© David Kirk/NVIDIA and Wen-mei W

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 17 Atomic.

Advertisements

GPU Programming Lecture 7 Atomic Operations and Histogramming

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

ECE408/CS483 Applied Parallel Programming Lecture 9: Tiled Convolution

L8: Control Flow CS6963. Administrative Next assignment on the website – Description at end of class – Due Wednesday, Feb. 17, 5PM (done?) – Use handin.

©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.

CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.

1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

1 GPU Programming Lecture 5: Convolution © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

Recitation 2: Synchronization, Shared memory, Matrix Transpose

L18: CUDA, cont. Memory Hierarchy and Examples

Parallel Computation Patterns (Scan)

Parallel Computation Patterns (Reduction)

Memory and Data Locality

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 8823A GPU Architectures Module 5: Execution and Resources - I

ECE 498AL Lecture 10: Control Flow

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

ECE 498AL Spring 2010 Lecture 10: Control Flow

Parallel Computation Patterns (Histogram)

© David Kirk/NVIDIA and Wen-mei W

Presentation transcript:

ECE408 Fall 2015 Applied Parallel Programming Lecture 19: Atomic Operations and Histogramming © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

CUDA Atomic Operations __syncthreads(); int atomicAdd(int *address, int val); int atomicSub(int *address, int val); int atomicExch(int *address, int val); int atomicMin(int *address, int val); int atomicMax(int *address, int val); unsigned int atomicInc(int *address, int val); unsigned int atomicDec(int *address, int val); int atomicCAS(int *address, int compare, int val); Also atomicAnd(), atomicOr(), atomicXor()… int *address can be in shared memory or global memory © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

A Histogram Kernel The kernel receives a pointer to the input buffer Each thread process the input in a strided pattern __global__ void hist(unsigned char *buffer, int size, unsigned int *histo) { int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; // All threads handle blockDim.x * gridDim.x // consecutive elements while (i < size) { atomicAdd( &(histo[buffer[i]]), 1); i += stride; } © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomic Operations on DRAM An atomic operation starts with a read, with a latency of a few hundred cycles The atomic operation ends with a write, with a latency of a few hundred cycles During this whole time, no one else can access the location © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomic Operations on DRAM Each Load-Modify-Store has two full memory access delays All atomic operations on the same variable (RAM location) are serialized time Operation DRAM delay DRAM delay transfer delay atomic operation N © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Latency determines throughput of atomic operations Throughput of an atomic operation is the rate at which the application can execute an atomic operation on a particular location. The rate is limited by the total latency of the read-modify-write sequence, typically more than 1000 cycles for global memory (DRAM) locations. This means that if many threads attempt to do atomic operation on the same location (contention), the memory bandwidth is reduced to < 1/1000! © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Hardware Improvements (cont.) Atomic operations on Fermi L2 cache medium latency, but still serialized Global to all blocks “Free improvement” on Global Memory atomics time .. data transfer data transfer atomic operation N atomic operation N+1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Hardware Improvements Atomic operations on Shared Memory Very short latency, but still serialized Private to each thread block Need algorithm work by programmers (more later) time .. data transfer atomic operation N atomic operation N+1 © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

Atomics in Shared Memory Requires Privatization __global__ void hist(unsigned char *buffer, int size, unsigned int *histo) { __shared__ unsigned int histo_private[256]; if (threadIdx.x < 256) histo_private[threadidx.x] = 0; __syncthreads(); int i = threadIdx.x + blockIdx.x * blockDim.x; // stride is total number of threads int stride = blockDim.x * gridDim.x; while (i < size) { atomicAdd( &(private_histo[buffer[i]), 1); i += stride; } // wait for all other threads in the block to finish if (threadIdx.x < 256) atomicAdd( &(histo[threadIdx.x]),private_histo[threadIdx.x] ); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

More on Privatization Privatization is a powerful and frequently used techniques for parallelizing applications The operation needs to be associative and commutative Histogram add operation is associative and commutative The histogram size needs to be small Fits into shared memory What if the histogram is too large to privatize? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012

ANY MORE QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012