Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009

Outline Introduction 7 implementations Work plan

Parallel Reduction Common and important data parallel primitive; Best example to learn optimization –Easy to implement but hard to get a high efficiency. –NVIDIA supplies 7 versions for computing the sum of the array. –I learn them one by one.

Parallel Reduction In order to deal with large arrays, the algorithm need to use multiple thread blocks. –Each block reduces a portion of the array How to communicate partial results between thread blocks? –No global synchronization: expensive, deadlock; –Decompose into multiple kernels

Optimization Goal Reach GPU peak performance –GLOP/s: for compute-bound kernels –Bandwidth: for memory-bound kernels Reductions have low arithmetic intensity –1 flop per element loaded –Try to achieve peak bandwidth!

Reduction 1: Interleaved Addressing

Hardware My computer –GeForce 8500GT NVIDIA

Performance for 4M element reduction NVIDIA’s results On my computer, the same code~ KernelTime (ms)Bandwidth (GB/s) 1530.31

Reduction 2: Interleaved Addressing

Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup 1530.31 2250.662.12

Reduction 3: Sequential Addressing

Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup 1530.31 2250.662.12 3121.372.014.42

Reduction 3: Sequential Addressing

Reduction 4: First Add During Load

Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms) Bandwidth (GB/s) Step Speedup Cumulative speedup 1530.31 2250.662.12 3121.372.014.42 46.852.441.757.74

Instruction Bottleneck Address arithmetic and loop overhead –17GB far from bandwidth bound –Ancillary instructions that are not loads, stores, or arithmetic for the core computation Strategy: unroll loops –When s<=32, only one warp left –Instructions are SIMD synchronous within a warp –If (tid<s) is useless Unroll the last 6 iterations

Reduction 5: Unroll the last Warp

Performance for 4M element reduction NVIDIA’s results On my computer KernelTime (ms)Bandwidth (GB/s)Step SpeedupCumulative speedup 1530.31 2250.662.12 3121.372.014.42 46.852.441.757.74 54.1 1.6713

Further optimization Complete Unrolling Multiple Adds/Thread

Other works Read two papers about matrix multiplication; Begin to read parallel computing books.

Work plan Learn the last two Reduction algorithms; Re-read the programming user guide.

Thanks

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Similar presentations

Presentation on theme: "Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Similar presentations

Presentation on theme: "Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009."— Presentation transcript:

Similar presentations

About project

Feedback