Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Zhou Peng, Zuo Decheng, Zhou Haiying Harbin Institute of Technology 1.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

Power Management in Multicores Minshu Zhao. Outline Introduction Review of Power management technique Power management in Multicore ◦ Identify Multicores.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

VOLTAGE SCHEDULING HEURISTIC for REAL-TIME TASK GRAPHS D. Roychowdhury, I. Koren, C. M. Krishna University of Massachusetts, Amherst Y.-H. Lee Arizona.

Erkan Çetiner. Outline Introduction Related Works Modeling Methodology Baseline Results DTM Techniques Conclusions.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Low-Power Wireless Sensor Networks

Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,

1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1.

An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget Represented by: Majid Malaika Authors:

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:

1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.

Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.

An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.

CMT OS scheduling summary Yipkei Kwok 03/18/2008.

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Evaluating the Impact of Job Scheduling and Power Management on Processor Lifetime for Chip Multiprocessors (SIGMETRICS 2009) Authors: Ayse K. Coskun,

Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction University of California MICRO ’03 Presented by Jinho Seol.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

CMP Design Space Exploration Subject to Physical Constraints Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, Kevin Skadron HPCA’06 01/27/2010.

Best detection scheme achieves 100% hit detection with

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Processor Level Parallelism 1

Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker ： Chun-Chung Chen Single-ISA.

Temperature and Power Management

Intel’s Core i7 Processor

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

CARP: Compression-Aware Replacement Policies

CS510 - Portland State University

Request Behavior Variations

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Presentation transcript:

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren

Techniques for Multicore Thermal Management Overview and comparison of techniques Plus determining the critical thread DVFS details Thread movement

Taxonomy Stop & Go vs DVFS –Stop & Go : suspend core operation for 30 millisecs when temperature above threshold –DVFS : dynamic voltage and frequency scaling, from control theory Distributed vs Global –Apply above to all cores or individually –Performance asymmetry : different demands on different cores

Taxonomy (cont.) Migration –Moving threads between cores –Timescale on order of a millisecond, much slower than DVFS –Migration is “outer loop” or control, riding on top of DVFS or Stop-Go Migrate “critical” thread –Measure criticality with heat sensor –Or with cache misses as a proxy

Aside : Criticality In separate paper, Abhishek et. al. defines “critical” as slowest thread If we know which is critical: –Task stealing from critical thread –Guide DVFS to prefer critical thread Explored proxies 13-32% performance boost in task stealing on 32-core machine

Criticality (cont.) Cache misses an excellent proxy

Donald and Martonosi : comparison of techniques Goal : maximize performance subject to temperature constraint Measure performance in BIPS and “duty cycle”, i.e. % useful time, scaled for DVFS frequency Run on SPEC benchmarks Simulated 4-core processor

Results All normalized to distributed Stop-Go

Stop-Go was terrible! –Why didn’t they try with lower frequency? –Was 30 milliseconds the right time to stop? They subsequently focus solely on DVFS, even though the hardware is trickier

Migration Policies

Summary & Conclusion DVFS far superior to Stop-Go Distributed control helps, esp. for Stop-Go Migration helps for Stop-Go Counter and Sensor-based migration comparable

DVFS Dynamic voltage and frequency scaling (per core). Dynamic voltage scaling is a power management technique in computer architecture, where the voltage used in a component is increased or decreased Dynamic frequency scaling (also known as CPU throttling) is a technique in computer architecture where a processor is run at a less- than-maximum frequency in order to conserve power.

Challenge Multiple cores may need to be manipulated simultaneously to control both power and temperature for a CMP chip. Require a Multi-Input-Multi-Output (MIMO) control Application software is always designed for single-core processors. Power shifting needed. Heterogeneous cores Workload of a CMP processor is unpredictable at design time and may vary significantly at runtime

DFVS

Open-Loop Control P(k+1) = P (k) + A Δ f(k)

Using Feedback (Close-loop) Dynamically change matrix A.

Thread Motion: Fine-Grained Power Management for Multi- Core Systems

Limitations of DVFS –Coarse grained Initiated by OS in milliseconds Voltage transition delay ~ 10 microseconds Too slow to respond fine variations in program behavior (Cache miss ~ nanoseconds) –Per-core DVFS with multiple VF settings High cost of off-chip regulators Bad scalability with a large number of cores Motivation

Idea of Thread Motion –Moving threads between cores with two VF domains –Threads experience virtually continuous Voltage Thread Motion

TM Manager –A separate embedded microcontroller running TM algorithm Effective IPC –maintain a table of IPC for each application –high IPC – compute-intensive –low IPC –cache miss, memory access latency Thread Motion

Movement Policy –Assign a thread in a compute-intensive phase to a high VF core –Intra-cluster movement considered first Trigger point: –TM-interval : fixed intervals ~ 200 cycles –Miss-driven : move a cache-missed thread Thread Motion: Algorithm

Thread Motion Better Quality