Requirements: General, simple, and fast, and must model heating at the granularity of architectural objects  Must be able to dynamically calculate temperatures.

Slides:

Advertisements

Similar presentations

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Advertisements

Computer Organization and Architecture

Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,

Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.

Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.

1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.

Power Reduction Techniques For Microprocessor Systems

Multi Dimensional Steady State Heat Conduction P M V Subbarao Associate Professor Mechanical Engineering Department IIT Delhi It is just not a modeling.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Institute of Digital and Computer Systems 1 Fabio Garzia / Finding Peak Performance in a Process23/06/2015 Chapter 5 Finding Peak Performance in a Process.

High Dynamic Range Emeka Ezekwe M11 Christopher Thayer M12 Shabnam Aggarwal M13 Charles Fan M14 Manager: Matthew Russo 6/26/

Temperature-Aware Design Presented by Mehul Shah 4/29/04.

Lecture 7: Power.

One Dimensional Steady Heat Conduction problems P M V Subbarao Associate Professor Mechanical Engineering Department IIT Delhi Simple ideas for complex.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

CS 7810 Lecture 15 A Case for Thermal-Aware Floorplanning at the Microarchitectural Level K. Sankaranarayanan, S. Velusamy, M. Stan, K. Skadron Journal.

Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.

From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.

Computer performance.

Peripheral Busses COMP Jamie Curtis. PC Busses ISA is the first generation bus 8 bit on IBM XT 16 bit on 286 or above (16MB/s) Extended through.

1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.

Erkan Çetiner. Outline Introduction Related Works Modeling Methodology Baseline Results DTM Techniques Conclusions.

Enhancing GPU for Scientific Computing Some thoughts.

Chalmers University of Technology FlexSoC Seminar Series – Page 1 Power Estimation FlexSoc Seminar Series – Daniel Eckerbert

1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Submerged PC Cooling By: Patrick Hague Geoffrey Clark Christopher Fitzgerald Group 11.

Thermal-aware Issues in Computers IMPACT Lab. Part A Overview of Thermal-related Technologies.

1 An Improved Block-Based Thermal Model in HotSpot 4.0 with Granularity Considerations Wei Huang 1, Karthik Sankaranarayanan 1, Robert Ribando 3, Mircea.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)

Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.

GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

1 University of Virginia Computer Science 2 NVIDIA Research A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics.

Patricia Gonzalez Divya Akella VLSI Class Project.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.

The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.

Taniya Siddiqua, Paul Lee University of Virginia, Charlottesville.

CS203 – Advanced Computer Architecture

1 Hardware Reliability Margining for the Dark Silicon Era Liangzhen Lai and Puneet Gupta Department of Electrical Engineering University of California,

Computer Engg, IIT(BHU)

Overview Motivation (Kevin) Thermal issues (Kevin)

Smruti R. Sarangi IIT Delhi

GPU Architecture and Its Application

CS203 – Advanced Computer Architecture

Lecture 2: Performance Today’s topics:

Graphics Processor Graphics Processing Unit

Temperature and Power Management

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Hot Chips, Slow Wires, Leaky Transistors

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Neil Goldsman and Akin Akturk

NVIDIA Fermi Architecture

Overview Motivation (Kevin) Thermal issues (Kevin)

Chapter 1 Introduction.

Overview Motivation (Kevin) Thermal issues (Kevin)

Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.

Lev Finkelstein ISCA/Thermal Workshop 6/2004

The University of Adelaide, School of Computer Science

Technology scaling Currently, technology scaling has a threefold objective: Reduce the gate delay by 30% (43% increase in frequency) Double the transistor.

Graphics Processing Unit

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

Requirements: General, simple, and fast, and must model heating at the granularity of architectural objects  Must be able to dynamically calculate temperatures for each block in the architecture  Must be able to simulate billions of clock cycles in a few hours  Must be general enough to use for modeling a variety of processor architectures  Must be able to reason about results at the architecture level Solution: Derive an equivalent circuit of lumped thermal resistances and capacitances. This circuit must be derived at the granularity of the processor architecture. Key components:  Floorplanning  Lumped-RC circuit derivation Temperature-Aware GPU Design Jeremy W. Sheaffer, Kevin Skadron, David P. Luebke {jws9c, skadron, University of Virginia, Charlottesville, VA Cooling for graphics processors is becoming prohibitively expensive, but cooling solutions are designed for worst-case behavior. Since power dissipation is spatially non-uniform across the chip, localized heating occurs much faster than chip-wide heating, which leads to “hot spots” and spatial gradients that can cause accelerated aging and timing errors. Reducing hot spots reduces cooling requirements. In fact, as true worst-case behavior is rare, a solution designed for the worst case is overdesigned for typical operating conditions. However, a package designed for typical behavior could be overcome by some unusual application, requiring dynamic thermal management (DTM). Problem Statement Architecture-Level Thermal Modeling GPU Simulation with Qsilver To study thermal issues in a GPU, we have developed a simulator called Qsilver that: models GPU clock-cycle-by-cycle activity and power in the microarchitecture domain. uses the Chromium † system to intercept a stream of OpenGL calls, annotating it with aggregate information about the vertices and fragments, textures, lighting, and other relevant rendering state Qsilver is useful for: analyzing performance bottlenecks estimating power exploring new graphics architectural ideas We have used Qsilver to analyze a hypothetical fixed-function console-like GPU architecture. For these results, we augment Qsilver with an architectural thermal model called HotSpot ‡ that tracks temperature in each functional unit over time. Default Separating Hot UnitsHigh ResolutionPartitioned High Resolution Framebuffer control Framebuffer and Data Compression Vertex Engine Fragment Engine Rasterizer Host Interface Texture Cache 2D Video Framebuffer control Framebuffer and Data Compression Framebuffer and Data Compression Framebuffer and Data Compression Rasterizer Host Interface Host Interface Host Interface Texture Cache Texture Cache Vertex Engine Fragment Engine 2D Video Unused Floorplans In order to add thermal modeling to Qsilver, the simulator must first be instrumented with an architectural floorplan. From the left, these floorplans are: Default—based on an nVIDIA marketing photo. We use this chip to drive an 800×600, console-like display in our simulations. Separating Hot Units—based on the default floorplan. The two hottest units, framebuffer operations and the vertex engine, are separated. High Resolution—also based on the default, but modified to drive a PC display at 1280×1024. The framebuffer, fragment engine, and texture cache are enlarged to maintain reasonable power densities under higher workload. Partitioned High Resolution—this novel floorplan maintains the functional unit area of the high resolution design, but partitions units into separate blocks per pipe, and separates hot blocks from cooler ones. Floorplan → DefaultSeparating Hot UnitsHigh ResolutionPartitioned High Resolution Technique ↓ Performance Cost Maximum Temperature Performance Cost Maximum Temperature Performance Cost Maximum Temperature Performance Cost Maximum Temperature No DTM 0.0% % % %100.9 Clock Gating 62.0% % % %97.0 Fetch Gating Vertex Fetch 25.9% % % %98.1 Fetch Gating Rasterizer 90.1% % % %97.8 Dynamic Voltage Scaling 13.1% % % %97.0 Multiple Clock Domains 16.7% % % %97.4 Simulator Setup and Output Thermal Simulation Results From left to right, below: No architectural thermal management with the default floorplan yields a very hot vertex engine; the hot units moved apart, combined with DVS make the chip cooler with a less profound thermal spatial gradient; fetch gating on the high resolution system; and DVS on the redesigned high-res chip, where the affect of separating hotspots on spatial gradient is more obvious—combining static and dynamic techniques is a double win. Note that to better illustrate their full dynamic range, these thermal maps are not all on the same scale. For these results, our simulator is configured to model a system: Built on a 180nm process at 1.8V and 300MHz Using an aluminum cooling solution with no fan With a temperature sensor on each functional unit block. We assume that the vendor specifies a 100°C maximum safe operating temperature and enable dynamic thermal management at 97°C to account for sensor imprecision. We have implemented the following DTM techniques on Qsilver: Clock Gating—the clock is stopped until the chip drops below the threshold temperature. Fetch Gating—a single stage in the pipeline is slowed down. We implement this in both the vertex fetch and rasterization stages. Dynamic Voltage Scaling—DVS scales the core voltage, and with it frequency, yielding a cubic reduction in power. Multiple Clock Domains—MCD also scales voltage and frequency, but on the granularity of individual functional units. Both DVS and MCD require a sync time ‘penalty’ when they are enabled and disabled. † ‡