Multi-Story Power Distribution Networks for GPUs

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
Gossip Scheduling for Periodic Streams in Ad-hoc WSNs Ercan Ucan, Nathanael Thompson, Indranil Gupta Department of Computer Science University of Illinois.
1/42 Changkun Park Title Dual mode RF CMOS Power Amplifier with transformer for polar transmitters March. 26, 2007 Changkun Park Wave Embedded Integrated.
1 A Variation-tolerant Sub- threshold Design Approach Nikhil Jayakumar Sunil P. Khatri. Texas A&M University, College Station, TX.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
ACES Workshop 3-4 March, 2009 W. Dabrowski Serial power circuitry in the ABC-Next and FE-I4 chips W. Dabrowski Faculty of Physics and Applied Computer.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Adopting Multi-Valued Logic for Reduced Pin-Count Testing Baohu Li, Bei Zhang and Vishwani Agrawal Auburn University, ECE Dept., Auburn, AL 36849, USA.
Power Reduction for FPGA using Multiple Vdd/Vth
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Low-Power Wireless Sensor Networks
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Jan, 2001CMS Tracker Electronics1 Hybrid stability studies Multi – chip hybrid stability problem when more then ~ 2 chips powered up -> common mode oscillation.
Weak SRAM Cell Fault Model and a DFT Technique Mohammad Sharifkhani, with special thanks to Andrei Pavlov University of Waterloo.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,
Proximity Optimization for Adaptive Circuit Design Ang Lu, Hao He, and Jiang Hu.
SOFT START OF 3 PHASE INDUCTION MOTOR BY USING 2 NUMBERS BACK TO BACK SCRS IN EACH PHASE Submitted by:
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
ARCHITECTURE-ADAPTIVE CODE VARIANT TUNING
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Zorua: A Holistic Approach to Resource Virtualization in GPUs
Lynn Choi School of Electrical Engineering
Introduction to Load Balancing:
Ioannis E. Venetis Department of Computer Engineering and Informatics
Microarchitecture.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
A Study of Group-Tree Matching in Large Scale Group Communications
Parallel Computing Lecture
Kristof Blutman† , Hamed Fatemi† , Andrew B
Active/3D Packaging Value and Applications
“Temperature-Aware Task Scheduling for Multicore Processors”
Parallel Algorithm Design
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Department of Electrical & Computer Engineering
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Rahul Boyapati. , Jiayi Huang
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Experiment Evaluation
Rachata Ausavarungnirun
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
Class 1: An Introduction to Low Power Systems
Using Packet Information for Efficient Communication in NoCs
Chapter 4: Threads.
CARP: Compression-Aware Replacement Policies
Adaptive Single-Chip Multiprocessing
Energy Efficient Power Distribution on Many-Core SoC
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
Department of Electrical Engineering Joint work with Jiong Luo
Introduction to Fog Computing Sadoon Azizi s. ac
Performance-Robust Parallel I/O
ITAP: Idle-Time-Aware Power Management for GPU Execution Units
EdgeWise: A Better Stream Processing Engine for the Edge
Restrictive Compression Techniques to Increase Level 1 Cache Capacity
Presentation transcript:

Multi-Story Power Distribution Networks for GPUs Qixiang Zhang, Liangzhen Lai, Mark Gottscho, and Puneet Gupta Electrical Engineering Department nanocad.ee.ucla.edu mgottscho@ucla.edu

Problem: GPU Power Delivery is Expensive GPUs draw large currents due to high power consumption at low voltage Power loss & voltage noise in the power distribution network (PDN) Many supply and ground pins required Design consequences High package cost Reduced I/O pin availability Inefficient PDN Aging & wearout Goal: Reduce overhead of power delivery in GPUs 16-Mar-2016 Mark Gottscho / UCLA

Previous Work: Multi-Story PDNs to the Rescue? Idea: Stack multiple voltage planes for logic! [Gu ISLPED’05] Challenge: How to partition logic such that current demand of each story is matched? Gate-level, functional units, cores, …? Multi-Story PDN concept [Gu ISLPED’05] We adapt this idea to GPUs at the core level and improve it further 16-Mar-2016 Mark Gottscho / UCLA

Proposal: Multi-Story Approach is Ideal for GPUs We propose multi-story PDNs for GPUs! These low-cost techniques can help stabilize the voltage rails: Hardware Auxiliary regulator On-chip supercapacitors Dynamic Current Compensation (DCC) Software Static SIMT Thread Scheduling (SSTS) GPUs are good for current matching at the core level Architectural homogeneity Regular layout SIMT model: single-instruction, multiple-thread Minimal communication between threads NVIDIA Fermi Block Diagram [NVIDIA] Motivational Results: GPGPU-Sim (NVIDIA GTX 480 with 14 cores) + HSPICE NQU LPS STO RAY 16-Mar-2016 Mark Gottscho / UCLA

Conventional 1-Story PDN for GPUs All cores in a voltage domain share common off-chip and 1-Story PDN has high off-chip current demand 16-Mar-2016 Mark Gottscho / UCLA

Proposed 2-Story PDNs for GPUs GPU cores divided across two stacked voltage domains. New node : virtual ground for upper story virtual supply for bottom story 𝑉 𝑖𝑛𝑡𝑒𝑟 2-Story: off-chip current demand 1/2X, resistive power losses 1/4X, power pins 1/2X! 16-Mar-2016 Mark Gottscho / UCLA

Proposed 2-Story, 1-Regulator GPU Problem: nodes are floating! Sensitive to minor current imbalances between cores. 𝑉 𝑖𝑛𝑡𝑒𝑟 16-Mar-2016 Mark Gottscho / UCLA

Proposed 2-Story, 2-Regulator GPU nodes stabilized by the auxiliary regulator, but costs extra pins and power. 𝑉 𝑖𝑛𝑡𝑒𝑟 16-Mar-2016 Mark Gottscho / UCLA

Results: Conventional 1-Story vs. 2-Stories 2-story, 1-regulator design is most efficient and cheapest, BUT is unreliable without fixes: target 10% MVS 16-Mar-2016 Mark Gottscho / UCLA

On-Chip Supercapacitors Stabilize for 1-Reg. 𝑉 𝑖𝑛𝑡𝑒𝑟 Supercaps near GPU cores can filter transient voltage noise on instead of aux. regulator 𝑉 𝑖𝑛𝑡𝑒𝑟 16-Mar-2016 Mark Gottscho / UCLA

Results: 2-Story, 1-Regulator with On-Chip Supercaps RAY Benchmark 38 uF per core required for 10% MVS on 2-story, 1-regulator Assume on-chip supercapacitor density is 23 pF/um2 [Leung 2015, El-Kady 2013] and is not stackable on logic/metal Supercap area overhead est. 1.65 mm2 per core, 4.6% for chip Supercaps make the 2-story, 1-regulator design more reliable and more efficient with low overhead 16-Mar-2016 Mark Gottscho / UCLA

Dynamic Current Compensation (DCC) DCC can actively balance current demand among cores when supercaps cannot fix steady-state mismatches Voltage-controlled current source (VCCS) Ring oscillator (RO) Control latency critical to stability DCC can assist supercaps in stabilizing 𝑉 𝑖𝑛𝑡𝑒𝑟 16-Mar-2016 Mark Gottscho / UCLA

Results: 2-Story, 1-Regulator with Supercaps & DCC LPS Benchmark (Csupercap = 8 uF per core) 10% power loss & 10% MVS with approx. 1% supercap die area overhead and up to 1us VCCS latency 16-Mar-2016 Mark Gottscho / UCLA

Static SIMT Thread Scheduling (SSTS) Current profiles may be imperfectly matched for different cores Propose software-based solution Given prior knowledge of workload characteristics… Minimize average difference in top/bottom story current demand via thread placement We use a greedy thread partitioning algorithm akin to Fiduccia-Mattheyses (FM) SSTS is well suited for compensating static power offsets 16-Mar-2016 Mark Gottscho / UCLA

Results: 2-Story, 1-Regulator with Supercaps & SSTS SSTS can achieve similar result to DCC without extra hardware, but cannot manage dynamic variation 16-Mar-2016 Mark Gottscho / UCLA

Practical Considerations Multiple virtual ground planes required in silicon Triple-well or moat isolation processes between stories [Pei et al. IEDM’14] Boot time: need to control due to gate oxide breakdown Slowly ramp off-chip voltage Process variations & aging cause power mismatches Proposed techniques can compensate Memory/NoC/IO power distribution Use separate domains + level shifters 𝑉 𝑖𝑛𝑡𝑒𝑟 16-Mar-2016 Mark Gottscho / UCLA

Conclusion: Multi-Story PDNs Promising for GPUs Benefits Fewer required power pins More efficient power delivery Our innovations Application of multi-story to GPU Auxiliary regulator On-chip supercaps DCC SSTS Future Work: DVFS for multi-story GPUs 16-Mar-2016 Mark Gottscho / UCLA

Thank you!