Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Advertisements

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.

Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.

On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

CS 7810 Lecture 13 Pipeline Gating: Speculation Control For Energy Reduction S. Manne, A. Klauser, D. Grunwald Proceedings of ISCA-25 June 1998.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Mrinmoy Ghosh Weidong Shi Hsien-Hsin (Sean) Lee

TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

Fine-grained minimal overhead value- based core power gating on GPGPU Christopher Fritz CSE691, May 2015

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

1 Estimating the Worst-Case Energy Consumption of Embedded Software Ramkumar Jayaseelan Tulika Mitra Xianfeng Li School of Computing National University.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won.

Zorua: A Holistic Approach to Resource Virtualization in GPUs

Gwangsun Kim, Jiyun Jeong, John Kim

Adaptive Cache Partitioning on a Composite Core

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

SECTIONS 1-7 By Astha Chawla

ISPASS th April Santa Rosa, California

Microarchitectural Techniques for Power Gating of Execution Units

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

GPU Computing Architecture

RegLess: Just-in-Time Operand Staging for GPUs

Milad Hashemi, Onur Mutlu, Yale N. Patt

A High Performance SoC: PkunityTM

Operation of the Basic SM Pipeline

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

Presentation transcript:

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs Supported by Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs Mohammad Abdel-Majeed* Daniel Wong* Murali Annavaram Ming Hsieh Department of Electrical Engineering University of Southern California * Equal Contribution MICRO-2013

Component Energy Breakdown for GTX480[1] Problem Overview Execution unit accounts for majority of energy consumption in GPGPU, even more than Mem and Reg! Leakage energy is becoming a greater concern with technology scaling Component Energy Breakdown for GTX480[1] Traditional microprocessor power gating techniques are ineffective in GPGPUs [1] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, “GPUWattch: enabling energy optimizations in GPGPUs,” presented at the ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013. Overview | 2

64KB shared Memory/L1 cache GPGPU Overview (GTX480) SM V Dec_INST R Fetch and decode I_Buffer Instruction Cache SFU LD/ST C C INT Unit Operands Result Queue FP Unit Warp Scheduler (2-level) Register File 128KB Execution Units 64KB shared Memory/L1 cache SFU LD/ST SP0 SP1 SP accounts for 98% of Execution Unit Leakage Energy Execution units account for 68% of total on chip area Overview | 3

Cumulative energy savings Power Gating Overview Cuts off leakage current that flows through a circuit block Power gate at SP granularity Important Parameters: Wakeup Delay – Time to return to Vdd (3 cycles) Breakeven Time – # of consecutive power gated cycles required to compensate PG energy overhead (9-24 cycles) Idle Detect - # of idle cycles before power gating[2] Idle_detect Busy Uncompensated Cycle 1 Cycle 1+BET Cycles>idle_detect Static Energy t0 t1 t3 t4 Cumulative energy savings Wakeup Cycles>wakeup_delay Eoverhead Overhead to sleep and Wakeup Overhead to Sleep t2 [2] Z. Hu, et. al. Microarchitectural techniques for power gating of execution units. In ISLPED ’04. Compensated Cycles>BET time Breakeven Time Time Overview | 4

Power Gating Challenges in GPGPUs

Power Gating Challenges in GPGPUs Traditional microprocessors experience idle periods many 10s of cycles long[3] Int. Unit Idle period length distribution for hotspot Assume 5 idle detect, 14 BET Energy Loss or Neutral Lost Opportunity Energy Savings Need to increase idle period length [3] S. Dropsho, et. al. Managing static leakage energy in microprocessor functional units. In Proceedings of the MICRO 35, 2012 Challenges | 6

Warp Scheduler Effect on Power Gating Idle periods interrupted by instructions that are greedily scheduled INT Need to coalesce warp issues by resource type FP INT INT FP INT Ready Warps Idle Periods INT FP Challenges | 7

Gating Aware Two-level Scheduler GATES: Gating Aware Two-level Scheduler Issue warps based on execution unit resource type GATES | 8

Gating Aware Two-level Scheduler (GATES) Idle periods are coalesced INT INT INT INT FP FP Ready Warps Idle Period INT FP GATES | 9

Gating Aware Two-level Scheduler (GATES) Per instruction type active warps subset Instruction Issue Priority Dynamic priority switching Switch highest priority when it out of ready warps Two-level GATES GATES | 10

Effect of GATES on Idle Period Length ~3x increase in positive power gating events ~2x increase in negative power gating events Need to further stretch idle periods Two-level GATES GATES | 11

Blackout Power Gating Forced idleness of execution units to meet BET

Blackout Power Gating X X Force idleness until break even time has passed Even when there are pending instructions Would this not cause performance loss? No, because of GPGPU-specific large heterogeneity of execution units and good mix of instruction types Idle_detect Busy Uncompensated Cycle 1 Cycle 1+BET Cycles>idle_detect Wakeup Cycles>wakeup_delay X X Compensated Cycles>BET time Blackout | 13

Blackout Power Gating ~2.4x increase in positive PG events over GATES (GATES ~3x w.r.t. baseline) GATES GATES + Blackout Blackout | 14

Blackout Policies X Naïve Blackout GATES and Blackout is independent Can lead to overaggressive power gating Idle Detect Warp Scheduler (GATES) C C INT X SP0 SP1 Blackout | 15

Active Warp Count Based Blackout Policies Coordinated Blackout PG only when active warps count = 0 Idle Detect Warp Scheduler (GATES) C C Active Warp Count Based Dynamic priority switching is Blackout aware ✓ INT SP0 SP1 Blackout | 16

Impact of Blackout Some benchmarks still show poor performance Not enough active warps to hide forced idleness Goal is as close to 0% overhead as possible

Adaptive Idle Detect Reducing Worst Case Blackout Impact

High Correlation vs Runtime Adaptive Idle Detect Dynamically change idle detect to avoid aggressive PG Infer performance loss due to Blackout “Critical Wakeup” – Wakeup that occur the moment blackout period ends High Correlation vs Runtime Adaptive Idle Detect | 19

Adaptive Idle Detect Warped Gates Independent idle detect values for INT and FP pipelines Break execution time into epoch (1000 cycles) If critical wakeup > threshold, idleDetect++ Conservatively decrement idleDetect every 4 epochs Bound idle detect between 5 – 10 cycles GATES Adaptive Idle Detect Blackout Warped Gates Adaptive Idle Detect | 20

Architectural Support 2-bit type indicator 2 counters keep track of number of INT/FP instr in active subset. Used to determine dynamic priority Architectural Support | 21

Evaluation Evaluation | 22

Evaluation Methodology GPGPU-Sim v3.0.2 Nvidia GTX480 GPUWattch and McPAT for Energy and Area estimation 18 Benchmarks from ISPASS, Rodinia, Parboil Power Gating parameters Wakeup delay – 3 cycles Breakeven time – 14 cycles Idle detect – 5 cycles Evaluation | 23

Power Gating Wakeups / Overhead Coalescing idle periods – fewer, but longer, idle periods Blackout reduces PG overhead by 26% Warped Gates reduces PG overhead by 46% Evaluation | 24

Integer Unit Static Energy Savings Blackout/Warped Gates is able to save energy when ConvPG cannot Warped Gates saves ~1.5x static energy w.r.t. ConvPG Evaluation | 25

FP Unit Static Energy Savings Warped Gates save ~1.5x static energy w.r.t. ConvPG (Ignores Integer only benchmarks) Evaluation | 26

Performance Impact Naïve Blackout has high overhead due to aggressive PG Both ConvPG and Warped Gates has ~1% overhead Evaluation | 27

Conclusion Execution units – largest energy usage in GPGPUs Static energy becoming increasingly important Traditional microprocessor power gating techniques ineffective in GPGPUs due to short idle periods GATES – Scheduler level technique to increase idle periods by coalescing instruction type issues Blackout – Forced idleness of execution unit to avoid negative power gating events Adaptive Idle Detect – Limit performance impact Warped Gates able to save 1.5x more static power than traditional microprocessor techniques, with negligible performance loss Conclusion | 28

Thank you! Questions? Conclusion | 29