L14: Application Case Studies II CS6963. Administrative Issues Project proposals – Due 5PM, Wednesday, March 17 (hard deadline) – handin cs6963 prop CS6963.

Slides:



Advertisements
Similar presentations
L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.
Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE498AL Programming Massively Parallel Processors Lecture16-17:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
L13: Application Case Studies CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ” Project.
L10: Floating Point Issues and Project CS6963. Administrative Issues Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) –
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall,
L9: Next Assignment, Project and Floating Point Issues CS6963.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
L9: Project Discussion and Floating Point Issues CS6235.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, Taiwan CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
L13: Application Case Studies CS6235. Outline Application Case Studies – Advanced MRI Reconstruction Reading: Kirk and Hwu, Chapter 8 or From the sample.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
L12: Application Case Studies CS6235. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 5 – handin cs6235 lab 3 ” Project.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L18: CUDA, cont. Memory Hierarchy and Examples
ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
2008 Taiwan CUDA Course Programming Massively Parallel Processors: the CUDA experience Lecture 8: Application Case Study - Quantitative MRI Reconstruction.
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Mattan Erez The University of Texas at Austin
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
(non-Cartesian) trajectory with linear-solver-based reconstruction.
Presentation transcript:

L14: Application Case Studies II CS6963

Administrative Issues Project proposals – Due 5PM, Wednesday, March 17 (hard deadline) – handin cs6963 prop CS L14: Application Case Studies II

Outline How to approach your projects Application Case Studies – Advanced MRI Reconstruction Reading: Kirk and Hwu, Chapter 7, /Chapter7-MRI-Case-Study.pdf CS L14: Application Case Studies II

Approaching Projects/STRSM/Case Studies 1.Parallelism? How do dependences constrain partitioning strategies? 2.Analyze data accesses for different partitioning strategies Start with global memory: coalesced? Consider reuse: within a thread? Within a block? Across blocks? 3.Data Placement (adjust partitioning strategy?) Registers, shared memory, constant memory, texture memory or just leave in global memory 4.Tuning Unrolling, fine-tune partitioning, floating point, control flow, … 4 L14: Application Case Studies II CS6963

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 5 Reconstructing MR Images Cartesian Scan DataSpiral Scan Data Gridding FFTLS Cartesian scan data + FFT: Slow scan, fast reconstruction, images may be poor

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 6 Reconstructing MR Images Cartesian Scan DataSpiral Scan Data Gridding FFTLeast-Squares (LS) Spiral scan data + LS Superior images at expense of significantly more computation

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 7 Least-Squares Reconstruction Compute Q for F H F Acquire Data Compute F H d Find ρ Q depends only on scanner configuration F H d depends on scan data ρ found using linear solver Accelerate Q and F H d on G80 – Q: 1-2 days on CPU – F H d: 6-7 hours on CPU – ρ: 1.5 minutes on CPU

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 8 for (m = 0; m < M; m++) { phiMag[m] = rPhi[m]*rPhi[m] + iPhi[m]*iPhi[m]; for (n = 0; n < N; n++) { expQ = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); rQ[n] +=phiMag[m]*cos(expQ); iQ[n] +=phiMag[m]*sin(expQ); } (a) Q computation for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } }(b) F H d computation Q v.s. F H D

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 9 for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } Algorithms to Accelerate Scan data – M = # scan points – kx, ky, kz = 3D scan data Pixel data – N = # pixels – x, y, z = input 3D pixel data – rFhD, iFhD= output pixel data Complexity is O(MN) Inner loop – 13 FP MUL or ADD ops – 2 FP trig ops – 12 loads, 2 stores

Step 1. Consider Parallelism to Evaluate Partitioning Options for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } What about M total threads? Note: M is O(millions) (Step 2) What happens to data accesses with this strategy?

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 11 __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int N) { int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } One Possibility This code does not work correctly! The accumulation needs to use atomic operation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 12 Back to the Drawing Board – Maybe map the n loop to threads? for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; }

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 13 for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } (a) F H d computation for (m = 0; m < M; m++) { rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; } for (m = 0; m < M; m++) { for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } }(b) after loop fission

© David Kirk/NVIDIA and Wen- mei W. Hwu, University of Illinois, Urbana- Champaign 14 __global__ void cmpMu(float* rPhi, iPhi, rD, iD, rMu, iMu) { int m = blockIdx.x * MU_THREAEDS_PER_BLOCK + threadIdx.x; rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m]; iMu[m] = rPhi[m]*iD[m] – iPhi[m]*rD[m]; } A Separate cmpMu Kernel

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 15 for (m = 0; m < M; m++) { for (n = 0; n < N; n++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (a) before loop interchange for (n = 0; n < N; n++) { for (m = 0; m < M; m++) { expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]); cArg = cos(expFhD); sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } } (b) after loop interchange Figure 7.9 Loop interchange of the F H D computation

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 16 __global__ void cmpFHd(float* kx, ky, kz, x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; for (m = 0; m < M; m++) { float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhD[n] += rMu[m]*cArg – iMu[m]*sArg; iFhD[n] += iMu[m]*cArg + rMu[m]*sArg; } 6 Step 2. New FHd kernel

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 17 __global__ void cmpFHd(float* rPhi, iPhi, phiMag, kx, ky, kz, x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n]; float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n]; for (m = 0; m < M; m++) { float expFhD = 2*PI*(kx[m]*xn_r+ky[m]*yn_r+kz[m]*zn_r); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhDn_r += rMu[m]*cArg – iMu[m]*sArg; iFhDn_r += iMu[m]*cArg + rMu[m]*sArg; } rFhD[n] = rFhD_r; iFhD[n] = iFhD_r; } Step 3. Using Registers to Reduce Global Memory Traffic Still too much stress on memory! Note that kx, ky and kz are read- only and based on m

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 18 Tiling of Scan Data LS recon uses multiple grids – Each grid operates on all pixels – Each grid operates on a distinct subset of scan data – Each thread in the same grid operates on a distinct pixel for (m = 0; m < M/32; m++) { exQ = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]) rQ[n] += phi[m]*cos(exQ) iQ[n] += phi[m]*sin(exQ) } 12 Thread n operates on pixel n:

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 19 __constant__ float kx_c[CHUNK_SIZE], ky_c[CHUNK_SIZE], kz_c[CHUNK_SIZE]; … __ void main() { int i; for (i = 0; i < M/CHUNK_SIZE; i++); cudaMemcpyToSymbol(kx_c,&kx[i*CHUNK_SIZE],4*CHUNK_SIZE); cudaMemcpyToSymbol(ky_c,&ky[i*CHUNK_SIZE],4*CHUNK_SIZE); cudaMemcpyToSymbol(kz_c,&ky[i*CHUNK_SIZE],4*CHUNK_SIZE); … cmpFHD >> (rPhi, iPhi, phiMag, x, y, z, rMu, iMu, int M); } /* Need to call kernel one more time if M is not */ /* perfect multiple of CHUNK SIZE */ } Tiling k-space data to fit into constant memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 20 __global__ void cmpFHd(float* x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n]; float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n]; for (m = 0; m < M; m++) { float expFhD = 2*PI*(kx_c[m]*xn_r+ky_c[m]*yn_r+kz_c[m]*zn_r); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhDn_r += rMu[m]*cArg – iMu[m]*sArg; iFhDn_r += iMu[m]*cArg + rMu[m]*sArg; } rFhD[n] = rFhD_r; iFhD[n] = iFhD_r; } Revised Kernel for Constant Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 21 Sidebar: Cache-Conscious Data Layout kx, ky, kz, and phi components of same scan point have spatial and temporal locality –Prefetching –Caching Old layout does not fully leverage that locality New layout does fully leverage that locality 25

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 22 struct kdata { float x, float y, float z; } k; __constant__ struct kdata k_c[CHUNK_SIZE]; … __ void main() { int i; for (i = 0; i < M/CHUNK_SIZE; i++); cudaMemcpyToSymbol(k_c, k, 12*CHUNK_SIZE); cmpFHD >> (); } 6 Adjusting K-space Data Layout

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 23 __global__ void cmpFHd(float* rPhi, iPhi, phiMag, x, y, z, rMu, iMu, int M) { int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x; float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n]; float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n]; for (m = 0; m < M; m++) { float expFhD = 2*PI*(k[m].x*xn_r+k[m].y*yn_r+k[m].z*zn_r); float cArg = cos(expFhD); float sArg = sin(expFhD); rFhDn_r += rMu[m]*cArg – iMu[m]*sArg; iFhDn_r += iMu[m]*cArg + rMu[m]*sArg; } rFhD[n] = rFhD_r; iFhD[n] = iFhD_r; } 6 Figure 7.16 Adjusting the k-space data memory layout in the F H d kernel

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 24 Overcoming Mem BW Bottlenecks Old bottleneck: off-chip BW –Solution: constant memory –FP arithmetic to off-chip loads: 421 to 1 Performance –22.8 GFLOPS (F H d) New bottleneck: trig operations 17

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Recall from L9: Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4 cycles per warp – int multiply (*) is by default 32-bit requires multiple cycles / warp – Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are expensive – Compiler will convert literal power-of-2 divides to shifts – Be explicit in cases where compiler can’t tell that divisor is a power of 2! – Useful trick: foo % n == foo & (n-1) if n is a power of 2 25 L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Recall from L9: Arithmetic Instruction Throughput Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp – These are the versions prefixed with “__” – Examples:__rcp(), __sin(), __exp() Other functions are combinations of the above – y / x == rcp(x) * y == 20 cycles per warp – sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp 26 L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 27 Using Super Function Units Old bottleneck: trig operations –Solution: SFUs Performance –92.2 GFLOPS (F H d) New bottleneck: overhead of branches and address calculations

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 28 Sidebar: Effects of Approximations Avoid temptation to measure only absolute error (I 0 – I) –Can be deceptively large or small Metrics –PSNR: Peak signal-to-noise ratio –SNR: Signal-to-noise ratio Avoid temptation to consider only the error in the computed value –Some apps are resistant to approximations; others are very sensitive 20 A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995).

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 29 Experimental Tuning: Tradeoffs In the Q kernel, three parameters are natural candidates for experimental tuning –Loop unrolling factor (1, 2, 4, 8, 16) –Number of threads per block (32, 64, 128, 256, 512) –Number of scan points per grid (32, 64, 128, 256, 512, 1024, 2048) Can’t optimize these parameters independently –Resource sharing among threads (register file, shared memory) –Optimizations that increase a thread’s performance often increase the thread’s resource consumption, reducing the total number of threads that execute in parallel Optimization space is not linear –Threads are assigned to SMs in large thread blocks –Causes discontinuity and non-linearity in the optimization space

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 30 Step 4: Overcoming Bottlenecks (Overheads) Old bottleneck: Overhead of branches and address calculations –Solution: Loop unrolling and experimental tuning Performance –145 GFLOPS (F H d) 21

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Summary of Results QFHdFHd ReconstructionRun Time (m)GFLOPRun Time (m) GFLOPLinear Solver (m) Recon. Time (m) Gridding + FFT (CPU, DP) N/A 0.39 LS (CPU, DP) LS (CPU, SP) LS (GPU, Naïve) LS (GPU, CMem) LS (GPU, CMem, SFU) LS (GPU, CMem, SFU, Exp) X228X 357X

What’s Coming Before Spring Break – Complete MRI application study + more STRSM – MPM/GIMP application study from last year (read GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall, Phil Wallstedt, and James Guilkey ( GPU Acceleration of the Generalized Interpolation Material Point Methodhttp://saahpc.ncsa.illinois.edu/09/papers/Chiang_paper.pdf – Review for Midterm (week after spring break) CS L14: Application Case Studies II