Download presentation
Presentation is loading. Please wait.
Published byGilbert Marshall Modified over 8 years ago
1
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng
2
Institute of Software,Chinese Academy of Sciences Motivation Modern GPU architectures More and more diversified, nVidia GPU, AMD GPU. Optimizing GPU kernels A Challenging task, detailed underlying hardware knowledge Explicit parallelization and explicit memory hierarchy Performance portability of GPU programs more and more difficult for common programmers Programmers with limited hardware knowledge of GPU Implementing high performance GPU kernels directly Identifying performance bottlenecks Choosing which optimization be adopted and their order
3
Institute of Software,Chinese Academy of Sciences OpenCL and GPU Architecture OpenCL Open Computing Language An open industry standard for general purpose parallel programming across platforms Providing portable and efficient access to the power of heterogeneous computing platforms GPU Two major GPU vendors: NVIDIA and AMD Adopting different architectures Sharing some kind of architectural similarities
4
Institute of Software,Chinese Academy of Sciences Hierarchical Architecture GPU -> Compute Unit -> Process cores. Hierarchical Memory Model Off-chip memory (global and constant memory) On-chip memory (local memory, cache and register) Programming Model STMD (Single Thread Multiple Data) Multi-threading scheduling unit (warp or wavefront) Thread Organization Work-item(thread) -> work-group(block) -> Grid Scheduling Strategy zero overhead scheduling strategy Warps/wavefronts execute interleaved to tolerate intra-warp stall NVIDIA and AMD GPU: Architecture Similarities
5
Institute of Software,Chinese Academy of Sciences NVIDIA and AMD GPU: Architecture Differences Design of Process Core and Register Files NVIDIA GPU: Scalar Architecture AMD GPU: Vector Architecture Leading to different program optimization techniques Focusing on NVIDIA C2050 GPU and AMD HD 5850 GPU
6
Institute of Software,Chinese Academy of Sciences Memory-bound and Computation-bound kernels Computation Intensity Definition: number of single-precision floating-point ops performed per byte of off-chip memory traffic Kernel Computation Intensity F Total amount of computation divides total amount of data required to transfer from off-chip memory Hardware Computation Intensity F Throughput of arithmetic instruction divides throughput of memory access instruction F For simplicity, use peak performance divides peak memory bandwidth GPU kernels Memory-bound kernel F Kernel Computation Intensity > Specific Hardware Computation Intensity F The most effective optimization method to improve the utilization of memory bandwidth Computation-bound kernel F Kernel Computation Intensity < Specific Hardware Computation Intensity F The most effective optimization method is to improve the utilization of computing resource
7
Institute of Software,Chinese Academy of Sciences Performance Optimization Chain Threshold Chain Utilization of off-chip memory bandwidth Tradeoff Chain Utilization of computation resources Data locality Each architecture and kernel has a different balance requirement between them Performance depend on how well the kernel characteristics mapped onto the architecture hardware characteristics
8
Institute of Software,Chinese Academy of Sciences Threshold Chain Set of optimization methods to improve the utilization of off-chip memory bandwidth Optimization Space Eliminating channel conflict (ECC) F Continuous memory access Reducing Memory Transaction(RMT) F Continuous and alignment memory access F Vector memory access Using FastPath (UFP, for AMD GPU only) F Vector memory access F AMD has CompletePath and FastPath Performance aspects must be satisfied or mitigated in order to achieve good performance
9
Institute of Software,Chinese Academy of Sciences Threshold Chain Comparison of the utilization of off-chip memory bandwidth with different vector lengths NVIDIA C2050 GPUAMD HD5850 GPU
10
Institute of Software,Chinese Academy of Sciences Threshold Chain Comparison of the utilization of off-chip memory bandwidth with various strides and offsets NVIDIA C2050 GPU AMD HD5850 GPU Threshold Chain NVIDIA C2050 GPU F Continuous -> alignment -> vector AMD HD5850 GPU F Continuous -> vector -> alignment
11
Institute of Software,Chinese Academy of Sciences Tradeoff Chain Set of optimization methods that can make full use of computation resources Not clear whether we should maximize or minimize a particular performance aspect for an application on a given architecture Only providing insights for performance improvement but not accurate. Optimization Space Improving thread-level parallelism (TLP) Improving instruction-level parallelism (ILP) Reducing dynamic instruction count per thread (RDIC) Instruction Selection Optimizations (INS)
12
Institute of Software,Chinese Academy of Sciences Tradeoff Chain Comparison of the performance with different ILP NVIDIA C2050 GPU AMD HD5850 GPU Only run one work-group on one computation unit Varying block sizes and ILPs
13
Institute of Software,Chinese Academy of Sciences Data Locality Data Locality Computation is cheap, data movement is expensive Maximize locality to minimize data movement Computation Intensity Wall Computation intensity can constrain performance like a wall Improving data locality to increase computation intensity Optimization Methods Storing read-only data at cache or constant memory Improving data reuse Loop re-order Rewrite Data Structure Data padding
14
Institute of Software,Chinese Academy of Sciences Insightful Optimization Chain Using Roofline model to make optimization chain insightful NVIDIA C2050 GPU AMD HD5850 GPU Computation intensity of a kernel determines its optimization region Node of optimization chain suggests the corresponding method Order of the nodes suggests the optimization order Ridge point marks the minimum computation intensity required to achieve peak performance
15
Institute of Software,Chinese Academy of Sciences Experimental Evaluation NVIIDA C2050 AMD HD 5850 Clock Rate1.15 GHZ 0.725 GHZ #PEs448 288 #CUs14 18 Peak Perf.1030 GFlops 2090 GFlops Memory3.0 GB 1.0 GB Peak Bandwidth144 GB/s 128 GB/s #Register/CU16K #Local Memory/CU48K 32K SDK versionSDK 4.1 SDK 2.6 Configuration of GPUs Case studies Matrix Transpose Laplace Transform Image Integral
16
Institute of Software,Chinese Academy of Sciences 1. Matrix Transpose Algorithm Input and output matrices address at separate memory location Offset by 4 bytes to test performance impact of alignment Computation Intensity on Char is 2 * 4 / 8 = 1 Optimization chain Char4 instead of Char using FastPath on AMD GPU Using Local memory to re-map the thread to tile elements Diagonal block reordering (Eliminating channel conflict) Setting offset value to 0
17
Institute of Software,Chinese Academy of Sciences 1. Matrix Transpose Bottleneck is off-chip memory channel conflict Vector memory access can improve performance better for AMD HD5850 GPU than NVIDIA C2050 GPU Alignment has an important influence on performance for NVIDIA GPU Performance improved by 26.1 and 42.4 times on AMD GPU and NVIDIA GPU respectively
18
Institute of Software,Chinese Academy of Sciences 2. Laplace Transform Laplace transform calculates the Laplace value of the source matrix by adding up the second x and y derivatives calculated using Laplacian Computation intensity is 67/(9*4)=1.8 Calculation of each element performs 67 calculations F 8 additions + 9 multiplications + 10 * 4 address calculations + 9 iterations Calculation of each element needs 9 operators After data locality improvement, computation intensity increased to 12.6 Utilizing the Local memory Moving Laplacian matrix to Constant memory
19
Institute of Software,Chinese Academy of Sciences 2. Laplace Transform Increasing Data locality can improve performance significantly Increasing ILP has a tradeoff impact on performance because of the register restriction Using ILP is more efficient for AMD GPU than NVIDIA GPU Performance improved by 14.1 and 7.8 times on AMDGPU and NVIDIA GPU respectively
20
Institute of Software,Chinese Academy of Sciences 3. Image Integral Rapid feature evaluation: face detection Implementing scan algorithm twice in turn on matrix rows and columns respectively. Two phases: Up-sweep phase: traversing the tree from leaves to root computing partial sums at internal nodes of the tree. Down-sweep phase: traversing back up the tree from the root, using the partial sums to build the scan in place on the array using the partial sums computed by the up-sweep phase Up-sweep phaseDown-sweep phase
21
Institute of Software,Chinese Academy of Sciences 3. Image Integral Increasing Data locality can improve performance significantly Work-effective algorithm can improve performance through reducing algorithm complexity Increasing ILP has a tradeoff impact on performance because of the register restriction Performance improved by 8.2 and 7.1 times on AMD GPU and NVIDIA GPU respectively
22
Institute of Software,Chinese Academy of Sciences Conclusions Proposed an insightful and quantitative optimization chain for both NVIDIA and AMD GPU Our performance optimization chain has captured almost all the primary performance factors. With the help of this optimization chain, common programmers can write high performance kernel directly and easily.
23
Institute of Software,Chinese Academy of Sciences Thank You ! Q&A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.