Compiler-Directed Power Density Reduction in NoC-Based Multi-Core Designs Sri Hari Krishna Narayanan, Mahmut Kandemir, Ozcan Ozturk Embedded Mobile Computing.

Slides:

Advertisements

Similar presentations

THERMAL-AWARE BUS-DRIVEN FLOORPLANNING PO-HSUN WU & TSUNG-YI HO Department of Computer Science and Information Engineering, National Cheng Kung University.

Advertisements

International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.

Fault-Tolerance for Distributed and Real-Time Embedded Systems

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

© 2013 IBM Corporation Use of Hierarchical Design Methodologies in Global Infrastructure of the POWER7+ Processor Brian Veraa Ryan Nett.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

Power Reduction Techniques For Microprocessor Systems

Current-Mode Multi-Channel Integrating ADC Electrical Engineering and Computer Science Advisor: Dr. Benjamin J. Blalock Neena Nambiar 16 st April 2009.

Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.

Reducing NoC Energy Consumption Through Compiler-Directed Channel Voltage Scaling Guangyu Chen, Feihui Li, Mahmut Kandemir, Mary Jane Irwin Microsystems.

1 Temperature-Aware Resource Allocation and Binding in High Level Synthesis Authors: Rajarshi Mukherjee, Seda Ogrenci Memik, and Gokhan Memik Presented.

Analytical Thermal Placement for VLSI Lifetime Improvement and Minimum Performance Variation Andrew B. Kahng †, Sung-Mo Kang ‡, Wei Li ‡, Bao Liu † † UC.

On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.

Temperature-Aware Design Presented by Mehul Shah 4/29/04.

CS 7810 Lecture 15 A Case for Thermal-Aware Floorplanning at the Microarchitectural Level K. Sankaranarayanan, S. Velusamy, M. Stan, K. Skadron Journal.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Slide 1 U.Va. Department of Computer Science LAVA Architecture-Level Power Modeling N. Kim, T. Austin, T. Mudge, and D. Grunwald. “Challenges for Architectural.

University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.

Chapter 1 Algorithm Analysis

Department of Computer Science Engineering SRM University

VOLTAGE SCHEDULING HEURISTIC for REAL-TIME TASK GRAPHS D. Roychowdhury, I. Koren, C. M. Krishna University of Massachusetts, Amherst Y.-H. Lee Arizona.

Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

1 Some Limits of Power Delivery in the Multicore Era Runjie Zhang, Brett H. Meyer, Wei Huang, Kevin Skadron and Mircea R. Stan University of Virginia,

Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.

The Influence of the Selected Factors on Transient Thermal Impedance of Semiconductor Devices Krzysztof Górecki, Janusz Zarębski Gdynia Maritime University.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

Hard Real-Time Scheduling for Low- Energy Using Stochastic Data and DVS Processors Flavius Gruian Department of Computer Science, Lund University Box 118.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

TSV-Constrained Micro- Channel Infrastructure Design for Cooling Stacked 3D-ICs Bing Shi and Ankur Srivastava, University of Maryland, College Park, MD,

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Software solutions for challenges in embedded systems Sri Hari Krishna Narayanan, The Pennsylvania State University, USA, Theme While.

Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.

ECE555 Topic Presentation Energy-efficient real-time scheduling Xing Fu 20 September 2008 Acknowledge Dr. Jian-Jia Chen from ETH providing PPT Slides for.

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.

University of Rostock Institute of Applied Microelectronics and Computer Engineering Monitoring and Control of Temperature in Networks-on- Chip Tim Wegner,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.

Workload Clustering for Increasing Energy Savings on Embedded MPSoCs S. H. K. Narayanan, O. Ozturk, M. Kandemir, M. Karakoy.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

CS203 – Advanced Computer Architecture

An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.

Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.

Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker ： Chun-Chung Chen Single-ISA.

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.

Selective Code Compression Scheme for Embedded System

3Boston University ECE Dept.;

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Babak Sorkhpour, Prof. Roman Obermaisser, Ayman Murshed

On-Time Network On-chip

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Jian-Jia Chen and Tei-Wei Kuo

Presentation transcript:

Compiler-Directed Power Density Reduction in NoC-Based Multi-Core Designs Sri Hari Krishna Narayanan, Mahmut Kandemir, Ozcan Ozturk Embedded Mobile Computing Center (EMC 2 ) The Pennsylvania State University International Symposium on Quality Electronic Design 03/27-29, 2006, San Jose

2 Introduction to the Problem Increasing transistor counts and rising clock frequencies leads to increased power dissipation. Increased scaling coupled with increased power dissipation has lead to increased power density. Increased power density leads to rising thermal problems which requires solutions.

3 Solutions to Thermal Issues in multiprocessor environments Dynamic Thermal Management  Heo et al. ISLPED2003 Activity Migration between two processors.  Shang et al. Micro 2003 Communication is routed away from a potential hotspot. Upon a thermal emergency communication is throttled.

4 Problems with the current solutions Repeated suspension of execution or communication leads to performance loss. So it is beneficial to reduce the number of suspensions. How?  Reduce the number of thermal emergencies by reducing the power density.  Reduce the density by changing which processors are active and how much computation they perform within certain bounds.

5 Default Code Mapping Module #define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=- 3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_nu mthreads()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */ Code Default (performance oriented) Mapping Default Mapping Performance oriented  Active processors are close to each other.  Less communication cost.  Higher power density More thermal emergencies.  We propose to change this mapping into a temperature aware one.

6 Integer Linear Programming Model Phase 1  Increases the bounding box of the active processors given a communication cost limit and hence reduces the overall power density. Initial After Phase 1

7 Integer Linear Programming Model Constraints *  The number of active processors remains constant  The amount of extra communication between active processors in the new mapping has to be under the sum of the old communication and the relaxation allowed.  The area of bounding box must be maximized. * Exact mathematical expressions are given in the paper.

8 Phase 1 Default Code Mapping Module #define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=- 3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_nu mthreads()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */ Code ILP Module Default (performance oriented) Mapping Overall power density reduced mapping Phase1 mapping Overall density is reduced Communication cost increased

9 Integer Linear Programming Model Phase 2  Given the reduced overall power density mapping from phase 1, a new mapping with reduced local power density is generated. After Phase 2 After Phase 1

10 Integer Linear Programming Model Constraints *  Each old active processor that has high power density is split.  Each split processor performs same communication as the old processor.  The area of the bounding box remains constant.  The total power spent is within the bouding box is minimized by minimizing the communication path. * Exact mathematical expressions are given in the paper.

11 Phase 1 Phase 2 Default Code Mapping Module #define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=- 3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_nu mthreads()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */ Code ILP Module Default (performance oriented) Mapping Overall power density reduced mapping Thermal aware mapping

12 Profiling #define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=- 3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_numthrea ds()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */ Cycle times Chunk sizes Proc. Energy Communication Router Energy HotSpot + Shutdown HotSpot + Shutdown Implementation HotSpot Temperature estimation tool Developed by Skadron at UVa T(i+  ) = HS(T(i), floorplan, power,cycles,  ) Shutdown Any processor or router that is too hot must be turned off to allow cooldown

13 Algorithm 1. Initially mark processors as being active 2. While (all execution is not completed) { 2.a Time_Taken = Time_Taken b If a processor was active 2.b.i. Reduce the chunks that it has to execute by 1 2.c Calculate the new current temperature for all processors. T(i+  ) = HS(T(i), floorplan, power,cycles,  ) 2.d If a processor is too hot 2.d.i. Mark it as inactive 2.e If a router is too hot 2.e.i. Mark all processors communicating though it as inactive. 2.f Determine all the active processors and routers for the next scheduling step. } 3. Return Time_Taken

14 NoC Multi-core Model Routers are roughly 1/5 th the area of the processors Processors communicate using x-y routing  Used to estimate the cost of communication

15 Parameters used ParameterBrief Explanation Processor300MHz single issue Chip Area8.2 mm * 7 mm Threshold Temperature86.12 C W1 Mesh Size5 * 5 Grid Processor Area1.4 mm * 1.4 mm Router Area.24 mm * 1.44 mm

16 Benchmarks Used BenchmarkCycles (millions) Processor Energy (uJ) Router Energy (uJ) adi eflux tsf syntc syntc

17 Results – Thermal Emergencies

18 Results - Performance

19 Conclusions Dynamic thermal management leads to suspension of execution. We propose a novel compiler directed mechanism to reduce occurrences of thermal emergencies. By reducing the number of thermal emergencies performance is improved.

Thank you!