The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.
Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Skewed Compressed Cache
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
Sunpyo Hong, Hyesoon Kim
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Temperature Aware Load Balancing For Parallel Applications Osman Sarood Parallel Programming Lab (PPL) University of Illinois Urbana Champaign.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:
Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
20th May 2008 Presented by Mitesh Meswani. Outline  Problem Description  FPU Availability  FXU Availability.
1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Oindrila.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
Sunpyo Hong, Hyesoon Kim
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
ISPASS th April Santa Rosa, California
Parallel Processing - introduction
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
5.2 Eleven Advanced Optimizations of Cache Performance
RegLess: Just-in-Time Operand Staging for GPUs
The University of Texas at Austin
Presented by: Isaac Martin
Some challenges in heterogeneous multi-core systems
Milad Hashemi, Onur Mutlu, Yale N. Patt
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Address-Value Delta (AVD) Prediction
Managing GPU Concurrency in Heterogeneous Architectures
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Presentation transcript:

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015

Dynamic Voltage & Frequency Scaling DVFS Power Energy Temperature Reliability Variability Performance Power Energy Temperature Reliability Variability Power Energy Temperature Reliability Variability Performance What is the performance impact of DVFS? 2

 DVFS Opportunities in GPGPU -GPGPU chips consume more power than CPU chips -Provision for DVFS -Voltage range is high -Recent research shows energy saving opportunities  Challenges -SIMD -SIMT DVFS Performance Model for GPGPUs 3

 DVFS Performance Models for CPUs  Limitation of Existing Models  Critical Stalled Path (CRISP) −Model Component −Model Parameterization −Hardware mechanism and overhead  Experiment −Execution Time Prediction −Energy Savings Outline  DVFS Performance Models for CPUs  Limitation of Existing Models  Critical Stalled Path (CRISP) −Model Component −Model Parameterization −Hardware mechanism and overhead  Experiment −Execution Time Prediction −Energy Savings 4

DVFS Performance Model for CPUs  Proportionate  Sampling  Empirical  Analytical β estimated from aggregate metrics e.g., LLC miss counts Does not account for MLP  Proportionate  Sampling  Empirical  Analytical ×1 ×2 5

Existing Analytical Models for CPUs  Stall counter [CF 2010]  Miss model [CF 2010]  Leading loads [TOC 2010]  Critical path [Micro 2012]  Fundamental Assumption −T memory doesn’t scale with core frequency −Cores never stall for stores 6

Outline  DVFS Performance Models for CPUs  Limitation of Existing Models  Critical Stalled Path (CRISP) −Model Component −Model Parameterization −Hardware mechanism and overhead  Experiment −Execution Time Prediction −Energy Savings 7

 L1 Cache Miss  Memory/Computation Overlap  Store Stalls  Complex Stall Classification  L1 Cache Miss  Memory/Computation Overlap  Store Stalls  Complex Stall Classification  Fundamental Assumption of CPU based model −T memory doesn’t scale with core frequency −Cores never stall for stores Limitation of CPU Models on GPGPUs Challenges in GPGPU SIMD & SIMT 8

Limitation of CPU Models on GPGPUs  L1 Cache Miss  Memory/Computation Overlap  Store Stalls  Complex Stall Classification Overlapped computations may make the kernel fully compute bound at a lower frequency 9 At frequency f

Limitation of CPU Models on GPGPUs  L1 Cache Miss  Memory/Computation Overlap  Store Stalls  Complex Stall Classification 10 Ignoring the scaling of overlapped computation causes under prediction of execution time At frequency f

Ignoring the scaling of overlapped computation causes under prediction of execution time Limitation of CPU Models on GPGPUs  L1 Cache Miss  Memory/Computation Overlap  Store Stalls  Complex Stall Classification Performance prediction 15 Core 48 Warp/Core transition from memory bound to compute bound 11 frequency is reduced from left to right prediction baseline frequency is 700 MHz

Limitation of CPU Models on GPGPUs  L1 Cache Miss  Memory/Computation Overlap  Store Stalls  Complex Stall Classification 1 Core 1 Warp(32 Thread) 1 Core 1 Thread Performance prediction Settings LSQ Full (Cycle%) 1 Core 1 Thread0 1 Core 1 Warp66 Ignoring store stall cycles causes over prediction of execution time 1 SIMD store may fork into 32 stores transition from memory bound to compute bound 12 frequency is reduced from left to right prediction baseline frequency is 700 MHz

Outline  DVFS Performance Models for CPUs  Limitation of Existing Models  Critical Stalled Path (CRISP) −Model Component −Model Parameterization −Hardware mechanism and overhead  Experiment −Execution Time Prediction −Energy Savings 13

CRItical Stalled Path (CRISP)  GPGPU kernel has 3 different phases −load outstanding, pure compute, store stall  Execution time= Load critical path + compute store path  Both LCP and CSP scale independently with frequency 14

Load Critical Path (LCP) Portion  LCP is the longest sequence of dependent load latency [CRIT]  An LCP cycle is an overlapped computation or load stall Overlapped computation Load Stall Load critical path 15 A=8B=5D=5 A=8 C=12 Dependent Loads A  B A  C A  D B  D

Load Critical Path (LCP) Portion lm – kernel Overlapped computation Load Stall Load critical path Prediction of LEAD or CRIT load Compute store path Frequency Scaling 16 transition memory bound to compute bound frequency is reduced from left to right

Compute Store Path (CSP) Portion  Cycles outside LCP belongs to CSP  All the pure compute and store stall phases belongs to CSP Non-overlapped computation Store Stall Compute store path 17 ` Load critical path

Compute Store Path (CSP) Portion Non-overlapped computation Store Stall Compute store path cfd – kernel Frequency Scaling 18 transition Memory bound  Compute bound frequency is reduced from left to right

CRISP Components LCP20CSP11 Load Stall 3 Store Stall 1 Compute17Compute10 CRISP Example Model Prediction (units 1/f) T Memory T Compute Time (at f/2) STALL42758 MISS24738 LEAD CRIT CRISP2054 Existing Analytical Models 54 = max(17*2, 20) + max(11,10*2) CRISP Model

Hardware Mechanism & Overhead of CRISP  Hardware requirements −3 counters: Global LCP, Load Stall, Store Stall −One time stamp register (ts) per MSHR 20 Global LCP = MSHRtsLts+L Load Stall= Store Stall= 20

Hardware Mechanism & Overhead of CRISP 21 Global LCP = 21 MSHRtsLts+L A088 B8513 C81220 D13518 Load Stall= 4 Store Stall= 1 21  On load miss – Update time stamp register (ts) with global LCP  On load stall −Increment load stall counter and global LCP counter  On load returns after L – Update global LCP with max(LCP, ts+L)  On store stall −Increment store stall counter

Outline  DVFS Performance Models for CPUs  Limitation of Existing Models  Critical Stalled Path (CRISP) −Model Component −Model Parameterization −Hardware mechanism and overhead  Experiment −Execution Time Prediction −Energy Savings 22

Execution Time Prediction  Reduces maximum error by 3.66 ×  Average prediction error 4% vs 11% with best alternative Prediction error for all 6 target frequencies  Example prediction kernel – lm  Kernel makes transition between memory bound and compute bound at 300 MHz 23 frequency is reduced from left to right prediction baseline frequency is 700 MHz

Energy Savings  EDP optimum: −EDP savings 10.72% vs. 6.72% −Energy savings 12.87% −Performance overhead 3.44%  ED 2 P optimum: −ED 2 P 8.98% vs. 4.91% 24 lm- prediction frequency is reduced from left to right

Conclusion  Two fundamentals for performance models in GPGPUs −Memory / computation overlaps −Store related stalls  A runtime analytical model for DVFS in GPGPUs −Better performance prediction accuracy −Brings more energy savings 25 At frequency f