The Problem Finding a needle in haystack An expert (CPU) A group of non-experts (GPU)
Micro-benchmarking GPU micro-architectures Suhas Thejaswi Muniyappa Department of Computer Science Aalto University
Overview Micro-processor trend Micro-benchmarking CPU micro-processor trend GPU micro-processor trend Micro-benchmarking Pointer chase Fine-grain pointer chase Piecewise linear fine-grain pointer chase Hardware characteristics
CPU micro-processor trend Hardware support for advanced instructions. Availability of hardware documentation. Expensive hardware. No significant change in per-core performance over the decade. Parallelize the execution to achieve the speedup.
GPU micro-processor trend Low hardware cost. High arithmetic and memory bandwidth. Thousands of cores. Built for processing graphics. No hardware support for advanced instructions. Limited documentation of memory hierarchy. How to overcome the limitations of GPUs?
Micro-benchmarking Hacking into the system to reveal hardware details. Using access latency to determine hardware architecture. Details of memory system is necessary to achieve optimal hardware performance.
Pointer chase Saavedra et al. (1996) benchmarking approach for CPUs. Array element is initialized with index of next memory access. Access latency depends on the stride size. Average memory access latency is stored.
Fine-grain pointer chase Record and analyze every memory access latency. Mei and Chu (2016) designed fine-grain benchmarks for GPUs. Access latency stored in shared memory. Shared memory not sufficient for large arrays.
Piecewise fine-grain pointer chase Disk storage After each iteration shared memory contents are stored into disk. Sliding window approach to record access latency.
Hardware characteristics L1 cache Using the access latency the hardware characteristics are deduced.
Summary GPUs can be used for general purpose computations. GPUs provide an environment for executing algorithms which can scale. Details of memory system is necessary to achieve optimal hardware performance. Benchmarking reveals characteristics of the hardware, which is not revealed by the hardware manufacturers.
References [1] Mei, X., and Chu, X. Dissecting memory hierarchy through microbenchmarking. IEEE Transaction on Parallel and Distributed Systems Preprint, 99 (2016), 1. [2] Mei, X., Zhao, C., and Chu, X. Benchmarking the memory hierarchy of modern GPUs. Network and Parallel Computing: 11th IFIPWG 10.3 International Conference Proceedings (NPC) (2014), 144-156. [3] Saavedra, R.H., and Smith, A.J. Measuring cache and TLB performance and their effect on benchmark runtimes. IEEE transactions on computers 44, 10 (1995), 1223-1235. [4] Saavedra, R.H. CPU performance evaluation and execution time prediction using narrow spectrum benchmarking. PhD thesis, university of California, Berkley, 1992.
Questions ?
Thank you