Sunpyo Hong, Hyesoon Kim

Slides:

Advertisements

Similar presentations

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

An Analytical Model for a GPU. Overview SVM Kernel Behavior: Need for other metrics.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

University of Central Florida Understanding Software Approaches for GPGPU Reliability Martin Dimitrov* Mike Mantor† Huiyang Zhou* *University of Central.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

GPU Programming with CUDA – Optimisation Mike Griffiths

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Offloading to the GPU: An Objective Approach

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won.

Performance modeling in GPGPU computing Wenjing xu Professor: Dr.Box.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Microbenchmarking the GT200 GPU

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

RegLess: Just-in-Time Operand Staging for GPUs

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

General Purpose Graphics Processing Units (GPGPUs)

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Sunpyo Hong, Hyesoon Kim An Integrated GPU Power and Performance Model Sunpyo Hong, Hyesoon Kim

Outline Motivation Integrated Performance and Power (IPP) Modules Results Conclusion

Motivation GPUs have become popular for running general applications GFLOPS significantly increasing Cost: Increasing number of cores 2010 2010 2009 2007 2008 2008

Motivation Power consumption has also been increasing 2010 2007 2009 2008 Effects of increased number of cores and power ? Relationship with respect to performance ?

IPP System Propose the Integrated Performance and Power system (IPP) H/W Dynamic Power Manager Performance Prediction Performance / Watt Prediction GPU Kernel Programmer Number of Active Cores Prediction Power, Temperature Prediction Compiler Performance Prediction (ISCA 2009) Power, Temperature Prediction Performance / Watt, Number of Active Cores Prediction

Performance, Power, Perf / Power What kinds of applications will get benefit with fewer cores ? # Active Cores Performance Power (W) Type 1 Type 2 Performance / Watt GOAL: To maximize the performance/watt metric by using fewer cores for Type II applications

Outline Motivation Integrated Performance and Power (IPP) Modules Results Conclusion

IPP System Modules Performance Prediction Performance / Watt Prediction Number of Active Cores Prediction Power, Temperature Prediction IPP System Performance Prediction Power, Temperature Prediction Performance / Watt, Number of Active Cores Prediction

Power Prediction Module Methodology is based on the empirical CPU power model (Isci’03) GPU_Power = Runtime_power + Idle_Power Σ Runtime_power = Runtime_power_component Σ = AccessRate x MaxPower How often unit is accessed Maximum allowable power consumption Divide GPU into architectural units and know which unit is active

GPU Architectural Units Which architectural unit is accessed ? GPU SM SM SM SM SM No speculative execution Memory Other Logics Associate each instruction with a specific arch. unit PTX Instruction Arch. Unit All Instructions FDS (Fetch/ Decode/ Scheduler) Add_fp Sub_fp Mul_fp Fma_fp Neg_fp Min_fp Lg2_fp Ex2_fp Mad_fp Div_fp Abs_fp FP Unit Sin_fp Cos_fp Rcp_fp Sqrt_fp Rsqrt_fp SFU SM Fetch/Decode /Schedule Register File ALU INT Units FP Units Texture Cache Constant Cache SFU APP

Access Rate Runtime_power_component = AccessRate x MaxPower How often the architectural unit is accessed per unit of time AccessRate = DAC_per_th x Warps_per_SM _________________________ Exec_cycles / 4 Predicted execution cycles Active number of threads allocated per SM Data access count (DAC) per thread: Total number of accesses to a specific unit

Max Power Parameters (I) Runtime_power_component = AccessRate x MaxPower Allowable maximum power consumption per arch. unit Microbechmarks P1 P2 P3 . Pn AR1A AR2A AR3A ARnA MA MB MC MZ AR1B AR2B AR3B ARnB . . . AR1Z AR2Z AR3Z ARnZ = X FP SFU INT Global Memory Shared Memory Texture Cache Constant Cache Measured Power AccessRate Maximum power per arch. unit Solve for the set of MaxPower values (MA, MB, … , MZ)

Max Power Parameters (II) Empirical Power Parameters Units MaxPower FP 0.2 REG 0.3 ALU SFU 0.5 INT 0.25 FDS (Fetch/Dec/Sch) Shared memory 1 Texture cache 0.9 Constant cache 0.4 Const_SM 0.813 Global memory 52 Local memory Power Consumption vs. Model (GTX280) The prediction error of microbenchmarks is 2.7%

Σ Runtime Power GPU GPU_Power = Runtime_power_component + Idle_Power AccessRate x MaxPower Runtime_power_component = GPU Application Accessing Fetch / Decode / Schedule Registers FP Unit Memory System 5 W 5 W 5 W 40 W 5 W 5 W 2W 15 W 27 W =

Power vs. Number of Active SMs Power consumption also depends on the number of Active SMs Modeled by using an logarithmic function Power_SMs = Max_SM x log10(α x Active_SMs + β) α = (10 - β) / Num_SMs, β = 1.1

Temperature Effects ΔPower ΔTemperature Temperature effects on power consumption Power delta: 14 watts Temperature (C) GPU Power (W) Temperature delta: 22 degrees ΔPower ΔTemperature Higher power consumption due to increased temperature is also modeled

IPP System Modules Performance Prediction Performance / Watt Prediction Number of active cores prediction Power / Temperature Prediction IPP System Performance Prediction Power, Temperature Prediction Performance / Watt, Number of Active Cores Prediction

Performance Metrics MWP: Metric of Memory-level Parallelism Maximum number of warps that can overlap memory accesses 1 1 2 2 3 3 4 4 CWP: Computation Warp Parallelism MWP = 2 1 1 3 Memory Waiting period Number of warps that execute during a memory access period 2 2 4 3 4 CWP = 4

Performance Predictions Case 1: MWP> CWP Case 2: CWP > MWP Case 3: Not Enough Warps (N=CWP and N=MWP) Performance is dominated by the computation cycles Performance is dominated by the memory cycles Under-utilized cycles

IPP System Modules Performance Prediction Performance / Watt Prediction Number of Active Cores Prediction Power, Temperature Prediction IPP System Performance Prediction Power / Temperature Prediction Performance / Watt, Number of Active Cores Prediction

IPP System Performance / Watt Prediction Performance Prediction Performance / Watt Prediction Number of Active Cores Prediction Power, Temperature Prediction IPP System -> Use the outputs from the performance and power modules Performance prediction ____________________ Power prediction =

IPP System Number of Active Cores Prediction -> First check if it’s not possible to saturate bandwidth 1) Not Enough Warps Not enough warps to saturate BW 2) CWP < MWP Computationally-intensive code 3) MWP < MWP_peak_BW Does not reach the machine’s peak BW -> Choose the maximum # of cores if any condition is true

IPP System Otherwise, find the number of cores that saturates BW Memory Bandwidth _________________ BW_per_warp x N = Number of Cores Machine specified BW parameter (GB/s) Bandwidth that each core uses BW_per_warp : Bandwidth each warp consumes N : # warps allocated per core

Outline Motivation Integrated Performance and Power (IPP) Modules Results Conclusion

Benchmarks Evaluated various GPGPU Benchmarks Description Bandwidth (GB/s) SVM Kernel from a SVM-based algorithm 54.679 Binomial (Bino) American option pricing 3.689 Sepia Filter for artificially aging images 12.012 Convolve (Conv) 2D Separable image convolution 16.208 Blackscholes (Bs) European option pricing 51.033 Cmem Matrix add FP operations 64.617 Matrixmul (Nmat) Naive version of matrix multiplication 123.33 Dotp Matrix dotproduct 111.313 Madd Matrix multiply-add 115.058 Dmadd Matrix double memory multiply add 109.996 Mmul Matrix single multiply 114.997 Evaluated various GPGPU Benchmarks Bandwidth limited vs. Non-bandwidth limited

Power Decomposition Power breakdown graph for each benchmark Average power prediction error for GPGPU apps is 8.94%

Performance Per Watt Performance per watt vs. Number of active cores IPP IPP IPP Number of Active Cores Number of Active Cores Performance per watt vs. Number of active cores IPP achieves 85% of the best manual mapping for saving energy

Energy Savings The energy saving measured for runtime is 10.9% With power-gating, the energy saving prediction is 25.8%

Outline Motivation Integrated Performance and Power (IPP) Modules Results Conclusion

Conclusions Introduced IPP (Integrated Power and Performance System) for GPU architecture and GPGPU kernels IPP extends the empirical CPU power model to GPU side IPP predicts the number of cores that will achieve the energy savings Power Prediction For GPGPU benchmarks, the prediction error is 8.9% Saved 10.9% energy for bandwidth-limited apps with fewer cores Estimated the energy saving for power-gated system is 25.8%

Thank you