Performance and Power Analysis on ATI GPU: A Statistical Approach Ying Zhang, Yue Hu, Bin Li, and Lu Peng Department of Electrical and Computer Engineering.

Slides:

Advertisements

Similar presentations

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

Advertisements

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Optimization on Kepler Zehuan Wang

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

st International Conference on Parallel Processing (ICPP)

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Systematic Energy Characterization of CMP/SMT Processor Systems.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Sunpyo Hong, Hyesoon Kim

SAGE: Self-Tuning Approximation for Graphics Engines

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Computational Biology 2008 Advisor: Dr. Alon Korngreen Eitan Hasid Assaf Ben-Zaken.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Heterogeneous CPU/GPU co- processor clusters Michael Fruchtman.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

The Effects of Parallel Programming on Gaming Anthony Waterman.

Performance and Energy Efficiency Evaluation of Big Data Systems Presented by Yingjie Shi Institute of Computing Technology, CAS

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.

CS 732: Advance Machine Learning

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Copyright © Curt Hill SIMD Single Instruction Multiple Data.

Sunpyo Hong, Hyesoon Kim

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

Cmpe 589 Spring Fundamental Process and Process Management Concepts Process –the people, methods, and tools used to produce software products. –Improving.

Tuning Threaded Code with Intel® Parallel Amplifier.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Performance modeling in GPGPU computing Wenjing xu Professor: Dr.Box.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Effects of Limiting Numerical Precision on Neural Networks

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Resource Aware Scheduler – Initial Results

Kiran Subramanyam Password Cracking 1.

Presentation transcript:

Performance and Power Analysis on ATI GPU: A Statistical Approach Ying Zhang, Yue Hu, Bin Li, and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, LA, USA 1

GPUs are important nowadays Entertainment Sophisticated computer games High Definition Videos Scientific Computation Biology Aerography Astronomy Lots of ….. cpu Gpu 2

Prior studies on GPUs Performance –[1] [2] explore Nvidia GTX 280 using microbenchmarks –[3] [4] analyze GPU performance with well-built models Power & Energy –[5] introduce an integrated model for performance and power analysis –[6] predicts power from performance metrics –[7][8]attempts to investigate the energy efficiency of different computing platforms 3

Our study Performance –[1] [2] explore Nvidia GTX 280 using microbenchmarks –[3] [4] analyze GPU performance with well-built models Power & Energy –[5] introduce an integrated model for performance and power analysis –[6] predicts power from performance metrics –[7][8]attempts to investigate the energy efficiency of different computing platforms Most of previous work focus on Nvidia’s design! ATI GPUs are different. Can we obtain new findings? Our target: a recent ATI GPU Microbenchmarking based study usually focuses on few well-known components Statistical analysis tool GPU performance/power profile Overall Picture Microbench Detailed investigation on key factors 4

Contributions Correlating the computation throughput and performance metrics –Relative importance of different metrics –Partial dependence between the throughput and metrics Identifying decisive factors to GPU power consumption –Find out variables that pose significant impact on GPU power Extracting instructive principles –Propose possible solutions for software optimization –Point out hardware components that need to be further upgraded 5

Target GPU- ATI Radeon HD 5870 SIMD Engine Thread Processor 6

Random Forest Model Accurately capture the decisive factors from numerous input variables Ensemble model consisting of several regression trees Provides useful tools for analysis –Relative variable importance –Partial dependence plot Use Leave-one-out-cross-validation –Repeatedly choose one sample as validation and others as training 7

Experiment setup Testbed –A computer equipped with an ATI Radeon HD 5870 –ATI Stream Profiler v2.1 integrated in MS Visual Studio 2010 BenchMarks –OpenCL benchmarks from ATI Stream SDK Other equipments –Yokogawa WT210 digital power meter 8

Overall Procedure Target system Power meter Performance profile Power consumption Random Forest Performance model Power model 9

Performance Characterization 10

Make better use of the FastPath Both Paths are write path Fast Path Efficient Support non-atomic 32-bit ops Complete Path Much slower Support atomic and other ops 11

Power Consumption Analysis 12

Case study on packing ratio Packing ratio - Utilization of the 5-way VLIW processor x y z w t The tuning of kernel packing ratio can be achieved by changing Ops in the for loop More power- consuming ? % packing ratio80% packing ratio

Results linear increase 4 ALUs consume same power SFU consumes more power 5 ADD operations What if SFU performs other operations? 14

Results – cont’d SFU consume identical power regardless of op type Can we save energy? 15

Results – cont’d Reducing the usage of SFU can save power Performance will be degraded Power reduction can not compensate the performance degradation SFU power should be Further decreased (reducing idle power, etc) 16

Hardware and Software optimization Performance –Enhance special components (Completepath & Fastpath) –Efficiently use data fetched from global memory –Make best use of the FastPath Power/Energy –Optimize SFU to reduce its power consumption –Appropriately tuning work-flow to reduce SFU usage 17

Summary Performance Characterization –Relative importance of different metrics –Partial dependence between the throughput and metrics Analysis on Power consumption –Find out variables that pose significant impact on GPU power –Study the difference between FUs in the VLIW Extracting instructive principles –Propose possible solutions for performance optimization and energy saving 18

19

References [1] H. Wong, M. Papadopoulou, M, Alvandi, and A. Moshovos,“Demistifying GPU microarchitecture through microbenchmarking”, in ISPASS [2] Y. Zhang and J. Owens, “A quantitative performance analysis modelfor GPU architectures,” in HPCA [3] S. Baghosorkhi, M. Delahaye, S. Patel, W.Gropp and W. Hwu, “An adaptive performance modeling tool for GPU architectures”, in PPoPP [4] S. Hong and H. Kim, “An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness,” in ISCA [5] S. Hong and H. Kim, “An integrated gpu power and performance model,” in ISCA [6] H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Matsuoka,“Statistical power modeling of gpu kernels using performance counters,”, in GreenComp [7] D. Ren and R. Suda, “Investigation on the power efficiency of multicore and gpu processing element in large scale SIMD computation with CUDA”, in GreenComp [8] M. Rofouei, T. Stathopulous, S. Ryffel, W. Kaiser, and M.Sarrafzadeh, “Energy-aware high performance computing with graphics processing units”, in HotPower