Download presentation
Presentation is loading. Please wait.
Published byRoland Lawrence Modified over 9 years ago
1
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009
2
Outline Introduction 7 implementations Work plan
3
Parallel Reduction Common and important data parallel primitive; Best example to learn optimization –Easy to implement but hard to get a high efficiency. –NVIDIA supplies 7 versions for computing the sum of the array. –I learn them one by one.
4
Parallel Reduction In order to deal with large arrays, the algorithm need to use multiple thread blocks. –Each block reduces a portion of the array How to communicate partial results between thread blocks? –No global synchronization: expensive, deadlock; –Decompose into multiple kernels
5
Optimization Goal Reach GPU peak performance –GLOP/s: for compute-bound kernels –Bandwidth: for memory-bound kernels Reductions have low arithmetic intensity –1 flop per element loaded –Try to achieve peak bandwidth!
6
Reduction 1: Interleaved Addressing
8
Hardware My computer –GeForce 8500GT NVIDIA
9
Performance for 4M element reduction NVIDIA’s results On my computer, the same code~ KernelTime (ms)Bandwidth (GB/s) 1530.31
10
Reduction 2: Interleaved Addressing
11
Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup 1530.31 2250.662.12
12
Reduction 3: Sequential Addressing
13
Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup 1530.31 2250.662.12 3121.372.014.42
14
Reduction 3: Sequential Addressing
15
Reduction 4: First Add During Load
16
Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup 1530.31 2250.662.12 3121.372.014.42 46.852.441.757.74
17
Instruction Bottleneck Address arithmetic and loop overhead –17GB far from bandwidth bound –Ancillary instructions that are not loads, stores, or arithmetic for the core computation Strategy: unroll loops –When s<=32, only one warp left –Instructions are SIMD synchronous within a warp –If (tid<s) is useless Unroll the last 6 iterations
18
Reduction 5: Unroll the last Warp
19
Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms)Bandwidth (GB/s)Step SpeedupCumulative speedup 1530.31 2250.662.12 3121.372.014.42 46.852.441.757.74 54.1 1.6713
20
Further optimization Complete Unrolling Multiple Adds/Thread
21
Other works Read two papers about matrix multiplication; Begin to read parallel computing books.
22
Work plan Learn the last two Reduction algorithms; Re-read the programming user guide.
23
Thanks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.