CS 179: Lecture 4 Lab Review 2
Groups of Threads (Hierarchy) (largest to smallest) “Grid”: All of the threads Size: (number of threads per block) * (number of blocks) “Block”: Size: User-specified Should at least be a multiple of 32 (often, higher is better) Upper limit given by hardware (512 in Tesla, 1024 in Fermi) Features: Shared memory Synchronization
Groups of Threads “Warp”: Group of 32 threads Execute in lockstep (same instructions) Susceptible to divergence!
Divergence “Two roads diverged in a wood… …and I took both”
Divergence What happens: Executes normally until if-statement Branches to calculate Branch A (blue threads) Goes back (!) and branches to calculate Branch B (red threads)
“Divergent tree” … 506, 508, 510 Assume 512 threads in block… … 500, 504, 508 … 488, 496, 504 … 464, 480, 496
“Divergent tree” //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0): add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output Assumes block size is power of 2…
“Non-divergent tree” Example purposes only! Real blocks are way bigger!
“Non-divergent tree” //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output Assumes block size is power of 2…
“Divergent tree” Where is the divergence? Two branches: Accumulate Do nothing If the second branch does nothing, then where is the performance loss?
“Divergent tree” – Analysis First iteration: (Reduce 512 -> 256): Warp of threads 0-31: (After calculating polynomial) Thread 0: Accumulate Thread 1: Do nothing Thread 2: Accumulate Thread 3: Do nothing … Warp of threads 32-63: (same thing!) … (up to) Warp of threads Number of executing warps: 512 / 32 = 16
“Divergent tree” – Analysis Second iteration: (Reduce 256 -> 128): Warp of threads 0-31: (After calculating polynomial) Threads 0: Accumulate Thread 1-3: Do nothing Thread 4: Accumulate Thread 5-7: Do nothing … Warp of threads 32-63: (same thing!) … (up to) Warp of threads Number of executing warps: 16 (again!)
“Divergent tree” – Analysis (Process continues, until offset is large enough to separate warps)
“Non-divergent tree” – Analysis First iteration: (Reduce 512 -> 256): (Part 1) Warp of threads 0-31: Accumulate Warp of threads 32-63: Accumulate … (up to) Warp of threads Then what?
“Non-divergent tree” – Analysis First iteration: (Reduce 512 -> 256): (Part 2) Warp of threads : Do nothing! … (up to) Warp of threads Number of executing warps: 256 / 32 = 8 (Was 16 previously!)
“Non-divergent tree” – Analysis Second iteration: (Reduce 256 -> 128): Warp of threads 0-31, …, : Accumulate Warp of threads , …, Do nothing! Number of executing warps: 128 / 32 = 4 (Was 16 previously!)
What happened? “Implicit divergence”
Why did we do this? Performance improvements Reveals GPU internals!
Final Puzzle What happens when the polynomial order increases? All these threads that we think are competing… are they?
The Real World
In medicine… More sensitive devices -> more data! More intensive algorithms Real-time imaging and analysis Most are parallelizable problems!
MRI “k-space” – Inverse FFT Real-time and high-resolution imaging
CT, PET Low-dose techniques Safety! 4D CT imaging X-ray CT vs. PET CT Texture memory!
Radiation Therapy Goal: Give sufficient dose to cancerous cells, minimize dose to healthy cells More accurate algorithms possible! Accuracy = safety! 40 minutes -> 10 seconds
Notes Office hours: Kevin: Monday 8-10 PM Ben: Tuesday 7-9 PM Connor: Tuesday 8-10 PM Lab 2: Due Wednesday (4/16), 5 PM