© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.

Slides:



Advertisements
Similar presentations
List Ranking and Parallel Prefix
Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.
L8: Control Flow CS6963. Administrative Next assignment on the website – Description at end of class – Due Wednesday, Feb. 17, 5PM (done?) – Use handin.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
ME964 High Performance Computing for Engineering Applications “There are two ways of constructing a software design: one way is to make it so simple that.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA Programming Scans Andreas Moshovos Winter 2009 Based on slides from: Wen Mei Hwu (UIUC) and David Kirk (NVIDIA) White Paper/Slides.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 16 - Floating.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
L8: Control Flow CS6235. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
Data Structure and Algorithms
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
© David Kirk/NVIDIA and Wen-mei W
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
ME964 High Performance Computing for Engineering Applications
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Introduction to CUDA Programming
L18: CUDA, cont. Memory Hierarchy and Examples
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Scan)
Mattan Erez The University of Texas at Austin
Parallel Computation Patterns (Reduction)
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 498AL Lecture 10: Control Flow
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
ECE 498AL Spring 2010 Lecture 10: Control Flow
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires hit the road… (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 2 Build Scan From Partial Sums T T T Iterate log(n) times. Each thread adds value stride elements away to its own value, and sets the value stride elements away to its own previous value. Iteration 2 2 threads Stride 4 Stride 2 Each corresponds to a single thread.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 3 Build Scan From Partial Sums T T T T Done! We now have a completed scan that we can write out to device memory. Total steps: 2 * log(n). Total work: 2 * (n-1) adds = O(n) Work Efficient! Iteration log(n) n/2 threads Stride 2 Stride 4 Stride 1 Each corresponds to a single thread.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 4 Bank Conflicts Current scan implementation has many shared memory bank conflicts –These really hurt performance on hardware We can add some simple modifications to our memory addressing to save a lot of cycles

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 5 Bank Conflicts Occur when multiple threads access the same shared memory bank with different addresses No penalty if all threads access different banks –Or if all threads access exact same address Access costs 2*N cycles if there is a conflict –Where N is max number of threads accessing single bank

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 6 Build the Sum Tree T T Stride 1 Iteration 1, n/2 threads Iterate log(n) times. Each thread adds value stride elements away to its own value Each corresponds to a single thread.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 7 Initial Bank Conflicts on Load Each thread loads two shared mem data elements Tempting to interleave the loads temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; Threads:(0,1,2,…,8,9,10,…)  banks:(0,2,4,…,0,2,4,…) Better to load one element from each half of the array temp[thid] = g_idata[thid]; temp[thid + (n/2)] = g_idata[thid + (n/2)];

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 8 Bank Conflicts in the tree algorithm When we build the sums, each thread reads two shared memory locations and writes one: Th(0,8) access bank … … Bank: First iteration: 2 threads access each of 8 banks. … Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 9 When we build the sums, each thread reads two shared memory locations and writes one: Th(1,9) access bank 2, etc. Bank Conflicts in the tree algorithm … … Bank: First iteration: 2 threads access each of 8 banks. … Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 10 2 nd iteration: even worse! –4-way bank conflicts; for example: Th(0,4,8,12) access bank 1, Th(1,5,9,13) access Bank 5, etc. Bank Conflicts in the tree algorithm … … Bank: 2 nd iteration: 4 threads access each of 4 banks. … Each corresponds to a single thread. Like-colored arrows represent simultaneous memory accesses

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 11 Bank Conflicts in the tree algorithm We can use padding to prevent bank conflicts –Just add a word of padding every 16 words: No more conflicts! … P457… P497… Bank: Now, within a 16-thread half-warp, all threads access different banks. (Note that only arrows with the same color happen simultaneously.) …

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 12 Use Padding to Reduce Conflicts This is a simple modification to the last exercise After you compute a shared mem address like this: Address = 2 * stride * thid; Add padding like this: Address += (address >> 4); // divide by NUM_BANKS This removes most bank conflicts –Not all, in the case of deep trees

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 13 Scan Bank Conflicts (1) A full binary tree with 64 leaf nodes: Multiple 2-and 4-way bank conflicts Shared memory cost for whole tree –1 32-thread warp = 6 cycles per thread w/o conflicts Counting 2 shared mem reads and one write (s[a] += s[b]) –6 * ( ) = 102 cycles –36 cycles if there were no bank conflicts (6 * 6)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 14 Scan Bank Conflicts (2) It’s much worse with bigger trees! A full binary tree with 128 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: –12*2 + 6*( ) = 186 cycles –48 cycles if there were no bank conflicts! 12*1 + (6*6)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 15 Scan Bank Conflicts (3) A full binary tree with 512 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) Cost for whole tree: –48*2+24*4+12*8+6* ( ) = 570 cycles –120 cycles if there were no bank conflicts!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 16 Fixing Scan Bank Conflicts Insert padding every NUM_BANKS elements const int LOG_NUM_BANKS = 4; // 16 banks on G80 int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) { int a = s*(2*tid); int b = s*(2*tid+1) a += (a >> LOG_NUM_BANKS); // insert pad word b += (b >> LOG_NUM_BANKS); // insert pad word shared[a] += shared[b]; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 17 Fixing Scan Bank Conflicts A full binary tree with 64 leaf nodes No more bank conflicts! –However, there are ~8 cycles overhead for addressing For each s[a] += s[b] (8 cycles/iter. * 6 iter. = 48 extra cycles) –So just barely worth the overhead on a small tree 84 cycles vs. 102 with conflicts vs. 36 optimal

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 18 Fixing Scan Bank Conflicts A full binary tree with 128 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) No more bank conflicts! –Significant performance win: 106 cycles vs. 186 with bank conflicts vs. 48 optimal

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 19 Fixing Scan Bank Conflicts A full binary tree with 512 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) Wait, we still have bank conflicts –Method is not foolproof, but still much improved –304 cycles vs. 570 with bank conflicts vs. 120 optimal

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 20 Fixing Scan Bank Conflicts It’s possible to remove all bank conflicts –Just do multi-level padding –Example: two-level padding: const int LOG_NUM_BANKS = 4; // 16 banks on G80 int tid = threadIdx.x; int s = 1; // Traversal from leaves up to root for (d = n>>1; d > 0; d >>= 1) { if (thid <= d) { int a = s*(2*tid); int b = s*(2*tid+1) int offset = (a >> LOG_NUM_BANKS); // first level a += offset + (offset >>LOG_NUM_BANKS); // second level offset = (b >> LOG_NUM_BANKS); // first level b += offset + (offset >>LOG_NUM_BANKS); // second level temp[a] += temp[b]; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 21 Fixing Scan Bank Conflicts A full binary tree with 512 leaf nodes –Only the last 6 iterations shown (root and 5 levels below) –No bank conflicts But an extra cycle overhead per address calculation Not worth it: 440 cycles vs. 304 with 1-level padding –With 1-level padding, bank conflicts only occur in warp 0 Very small remaining cost due to bank conflicts Removing them hurts all other warps