Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.

Slides:



Advertisements
Similar presentations
THERMAL-AWARE BUS-DRIVEN FLOORPLANNING PO-HSUN WU & TSUNG-YI HO Department of Computer Science and Information Engineering, National Cheng Kung University.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.
International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.
Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.
2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.
National Tsing Hua University Po-Yang Hsu,Hsien-Te Chen,
Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Efficient Software Performance Estimation Methods for Hardware/Software Codesign Kei Suzuki Alberto Sangiovanni-Vincentelli Present: Yanmei Li.
Low-Power and Temperature-Aware Compilation for Embedded Processors José L. Ayala Politecnica University of Madrid
Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department.
Temperature-Aware SoC Test Scheduling Considering Inter-Chip Process Variation Nima Aghaee, Zhiyuan He, Zebo Peng, and Petru Eles Embedded Systems Laboratory.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Nima Aghaee, Zebo Peng, and Petru Eles Embedded Systems Laboratory (ESLAB) Linkoping University Process-Variation and Temperature Aware SoC Test Scheduling.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Thermal-Aware SoC Test Scheduling with Test Set Partitioning and Interleaving Zhiyuan He 1, Zebo Peng 1, Petru Eles 1 Paul Rosinger 2, Bashir M. Al-Hashimi.
Temperature-Aware Design Presented by Mehul Shah 4/29/04.
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
CS 7810 Lecture 15 A Case for Thermal-Aware Floorplanning at the Microarchitectural Level K. Sankaranarayanan, S. Velusamy, M. Stan, K. Skadron Journal.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Combining High Level Synthesis and Floorplan Together EDA Lab, Tsinghua University Jinian Bian.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Thermal Aware Resource Management Framework Xi He, Gregor von Laszewski, Lizhe Wang Golisano College of Computing and Information Sciences Rochester Institute.
A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack A Thermal-Aware Mapping Algorithm for Reducing.
Low-Power Wireless Sensor Networks
Variation Aware Application Scheduling in Multi-core Systems Lavanya Subramanian, Aman Kumar Carnegie Mellon University {lsubrama,
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.
2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.
Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab ( Department.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Software solutions for challenges in embedded systems Sri Hari Krishna Narayanan, The Pennsylvania State University, USA, Theme While.
DTM and Reliability High temperature greatly degrades reliability
Thermal-aware Phase-based Tuning of Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing This work was supported.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.
Workload Clustering for Increasing Energy Savings on Embedded MPSoCs S. H. K. Narayanan, O. Ozturk, M. Kandemir, M. Karakoy.
Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Compiler-Directed Power Density Reduction in NoC-Based Multi-Core Designs Sri Hari Krishna Narayanan, Mahmut Kandemir, Ozcan Ozturk Embedded Mobile Computing.
Best detection scheme achieves 100% hit detection with
Thermal-Aware Data Flow Analysis José L. Ayala – Complutense University (Spain) David Atienza – EPFL (Switzerland) Philip Brisk – EPFL (Switzerland)
An Approach for Enhancing Inter- processor Data Locality on Chip Multiprocessors Guilin Chen and Mahmut Kandemir The Pennsylvania State University, USA.
Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.
Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1
Evaluating Register File Size
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Department of Electrical & Computer Engineering
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Die Stacking (3D) Microarchitecture -- from Intel Corporation
Code Transformation for TLB Power Reduction
Restrictive Compression Techniques to Increase Level 1 Cache Capacity
Presentation transcript:

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing Center (EMC 2 ) The Pennsylvania State University International Conference on Computer Design, 10/2-5, 2005, San Jose

2 Outline  Motivation  Related Works  Our Approach  Example  Experimental Results & Conclusion

3 Motivation Thermal Hotspots are a cause for concern  Caused due to increasing power density  Can result in the permanent chip damage How to avoid damage  Cooling techniques How to prevent HotSpots  Hardware techniques  This paper proposes a compiler directed technique to avoid hotspots in CMPs

4 Related work: Dynamic Thermal Management When one unit overheats, migrate its functionality to a distant, spare unit  Dual pipeline (Intel, ISQED ’02)  Spare register file (Skadron et al. 2003)  Separate core (CMP) (Heo et al. ISLPED 2003)  Microarchitectural clusters (Intel, ICCD 2004) Raises many interesting issues  Cost-benefit tradeoff for extra area  Use both resources (scheduling)  Run-time Thermal sensing/estimation Yesterday, UC Riverside Session 2.2 proposes a run-time thermal tracking method

5 Related work: Design-time techniques  PSU:  Thermal-Aware IP Virtualization and Placement for Networks-on-Chip Architecture, ICCD 2004  Thermal-Aware Allocation and Scheduling for MPSOC Design, DATE 2005  Thermal-Aware Floorplanning Using Genetic Algorithms ISQED 2005  Thermal-Aware Voltage-island architecting, the other paper in this session  Other groups:  Thermal-Aware High Level Synthesis (Northwestern Univ. Memik, R.Dick (ISLPED 2005, ASP-DAC 2006)  Many more in this conference  Industry:  Gradient Design Automation (a start-up showcases at DAC 2005)

6 CMP – Justin R. Rattner, Intel director of the Corporate Technology Group, Spring 2005 IDF Justin R. Rattner “Intel researchers and scientists are experimenting with "many tens of cores, potentially even hundreds of cores per die, per single processor die...” Last night, Panel discussion on CMP Industry examples:

7 This paper- compiler approach Temperature and performance sensitive loop scheduling  Schedules different loop iterations on CMP  Data locality aware and hence performance aware Intuition behind the approach  Let ‘hot” cores idle while cool cores work.  Static scheduling of parallelized loop iterations at compiler time

8 How can the compiler schedule temperature aware code? This work targets loop intensive programs run on embedded CMPs Loop nests are divided into chunks. The number of cycles in a chunk is . Let the starting temperature of a processor be T c The temperature after execution the chunk is  T c ‘ = F(T c, , floorplan, power  )  , power  are obtained by profiling the code.  Floorplan and physical parameters remain constant.

9 Thermal modeling Want a good model of chip temperature  That accounts for adjacency and package  That does not require detailed designs  That is fast enough for practical use A compact model based on thermal R, C (Hotspot)  Parameterized to automatically derive a model based on various Architectures Power models Floorplans Thermal Packages

10 Temperature Estimation The temperature of each block depends on the power consumption and the location of blocks. The thermal resistance R ij of PE i with respect to PE j can be represented by units of temperature rise at PE i due to one unit of power dissipated at PE j.

11 Running Example Basic Schedule for (i=1; i<=600; i++) for (j=1; j<=1000; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) / 4; TimeP0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P Jacobi’s Algorithm for (i=k*120+1; i<=(k+1)*120; i++) for (j=1; j<=1000; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) / 4; Parallelized Algorithm for 5 cores Parallel Schedule Iteration chunk number Core numberTime Slot

12 Analysis of Basic Schedule Analysis Great locality Uses only 5 processors Will definitely overheat TimeP0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P Assumptions in the example 1. Initial temperature is 0 2. Threshold temperature is 2 3. An idle slot reduces the temperature by 1 degree ( but  0) 4. So at most 2 active slots can be scheduled together on one core 5. The ideal number of active processors at any time is Due to Jacobi’s algorithm consecutive iteration chunk exhibit locality

13 Pure Temperature Aware Scheduling Algorithm Start with time slot as 0 and all iterations as unscheduled While unscheduled iterations exit  Select the coolest A processors whose temperature is less than the threshold.  Schedule the chunks on those processors at current timeslot.  Reduce number of chunks to be scheduled.  Increase the time slot by 1. Analysis Poor locality 1 extra time slot is used. No temperature problems

14 Pure Temperature Aware Scheduling Original Schedule

15 Pure Locality Aware Scheduling Algorithm Start with a clean slate. For each iteration chunk  Schedule it on the processor with greatest locality with it keeping at most two chunks together.  If more slots are required (when all processors are exhausted), increase the scheduling length. Otherwise move to the next processor Analysis Very good locality However 2 extra time slots are used. No temperature problems

16 Locality and temperature aware scheduling Algorithm Use temperature aware scheduling to obtain the schedulable slots. Use locality aware scheduling to assign chunks to these slots. TimeP0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 1■■■■■ 2■■■■■ 3■■■■■ 4■■■■■ 5■■■■ 6■■■■■ 7■ C = { I 0, I 1, I 2, I 3, I 4 } TimeP0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P C = { } Analysis - Best of both worlds Great Locality No temperature problems Good performance

17 Phase1 - Profiling #define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=-3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_numthreads( )); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */ Cycle Times Chunk Sizes Energy Consumption Architecture Details Temperature Sensitive Schedule + Scheduler HotSpot Phase 2 -Temperature Sensitive Scheduling Phase 3 - Locality Based Scheduling Temperature & Locality Sensitive Schedule Scheduler #define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=-3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_numthreads()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */ Optimized, temperature sensitive code + Code Generator Phase 4 - Code Generation Omega Library

18 Experiments 5 codes loop intensive codes were tested BenchmarkCycles (millions) Energy (  J) 3step-log Adi Btrix Eflux Tsf

19 adi - Threshold Temperature 88 ºC

20 eflux - Threshold Temperature 88 ºC

21 adi - Threshold Temperature 88 ºC

22 eflux - Threshold Temperature 88 ºC

23 Sensitivity Analysis adi - Threshold Temperature 87 ºC

24 Sensitivity Analysis adi - Threshold Temperature 86 ºC

25 Sensitivity Analysis adi - Threshold Temperature 85 ºC

26 Sensitivity Analysis adi - Threshold Temperature 84 ºC

27 Experiments

28 Experiments

29 Conclusion Implemented a compiler directed combined temperature sensitive and performance aware scheduling algorithm. Achieve impressive average and peak chip temperature reductions. This allows software to take up the burden of preventing chip damage due to thermal effects.  Chips can be aggressively scaled  Cooling costs can be reduced  Lowers the need for hardware based thermal management schemes.

Thank you!