1 Understanding the Energy-Delay Tradeoff of ILP-based Compilation Techniques on a VLIW Architecture G. Pokam, F. Bodin CPC 2004 Chiemsee, Germany, July.

Slides:



Advertisements
Similar presentations
1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.
Advertisements

Computer Architecture Instruction-Level Parallel Processors
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
Be-Nice Scheduling for embedded SMT processors Apr 6 th, 2008 Boston Handong Ye.
1 Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des architectures VLIW Thèse de doctorat Gilles POKAM* 15 Juillet.
Generic Software Pipelining at the Assembly Level Markus Pister
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.
TECH 6 VLIW Architectures {Very Long Instruction Word}
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Meta Optimization Improving Compiler Heuristics with Machine Learning Mark Stephenson, Una-May O’Reilly, Martin Martin, and Saman Amarasinghe MIT Computer.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
1 Instruction Sets and Beyond Computers, Complexity, and Controversy Brian Blum, Darren Drewry Ben Hocking, Gus Scheidt.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch Dongkun Shin* and Jihong Kim Computer Architecture Lab School of Computer Science.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.
Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
A Region-Based Compilation Technique for a Java Just-In-Time Compiler Toshio Suganuma, Toshiaki Yasue and Toshio Nakatani Presenter: Ioana Burcea.
Computer science is a field of study that deals with solving a variety of problems by using computers. To solve a given problem by using computers, you.
Sunpyo Hong, Hyesoon Kim
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Computer Architecture Principles Dr. Mike Frank
CS203 – Advanced Computer Architecture
CSL718 : VLIW - Software Driven ILP
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Yingmin Li Ting Yan Qi Zhao
Suhas Chakravarty, Zhuoran Zhao, Andreas Gerstlauer
Instruction Level Parallelism (ILP)
Design of Digital Circuits Lecture 19a: VLIW
EECS 583 – Class 3 Region Formation, Predicated Execution
Predication ECE 721 Prof. Rotenberg.
Presentation transcript:

1 Understanding the Energy-Delay Tradeoff of ILP-based Compilation Techniques on a VLIW Architecture G. Pokam, F. Bodin CPC 2004 Chiemsee, Germany, July 7-9

2 Motivation n Source of complexity on high- performance VLIW processors :  hardware duplication many FUs of different types (ALUs, LSUs, FPUs, BR, etc.) need large register file n Power growth factor compiler architecture complexity

3 Motivation n Assume a fixed ; does compiling for higher ILP results in dissipating less power ? n Which issues (architecture, software, etc.) affect power when compiling for ILP ? Try to figure out what happens analytically !

4 Agenda n Motivation n Used metrics n Energy model n Tradeoff analysis n Hyperblock example n Experiments n Conclusions

5 Metric n Performance to energy ratio (PTE) [Gonzales, R. et al.] : nb. of oper. per Basic Block : average nb. of oper. per bundle : energy per Basic Block higher is better

6 Agenda n Motivation n Used metrics n Energy model n Tradeoff analysis n Hyperblock example n Experiments n Conclusions

7 Energy Model n The execution of a bundle dissipates an energy : n Consider loop intensive kernels … Energy base cost Energy due to execution of bundle Energy due to D-cache misses Energy due to I-cache misses

8 Agenda n Motivation n Used metrics n Energy model n Tradeoff analysis n Hyperblock example n Experiments n Conclusions

9 Analysis n Use as a lever for power exploration n Assume R is a CFG region to be transformed into an ILP region H a sufficient condition for this is given by

10 Analysis n Idea:  keep track of IPC values that improve energy efficiency  solve the PTE inequality at : u : avg. #oper. in transformed region u : avg. #oper. in the CFG region R

11 Analysis where f : exec. freq. N : # of oper. n : # of bundles s : # stall due to dmiss m : #of BB in region C is a measure of extra work! Shape of ILPtransform function depends on sign of C

12 vs. C < 0: exponential shape means high extra work! dependence height mismatch resource contention C = 0 linear shape negligible extra work C > 0 Optimal scenario Logarithmic shape e.g. Hyperblock: Compensation code e.g. Hyperblock: Instruction merging

13 Agenda n Motivation n Used metrics n Energy model n Tradeoff analysis n Hyperblock example n Experiments n Conclusions

14 Hyperblock framework n predication model via the select instruction slct dest = cond, src1, src2 n only hammock regions are considered n single entry – single exit hyperblock

15 Transformation heuristic 1. build the loop tree 2. traverse the loop tree from innermost to outermost loop 3. evaluate profit for each candidate loop region 4. propagate profit to CFG after transformation

16 Agenda n Motivation n Used metrics n Energy model n Tradeoff analysis n Hyperblock example n Experiments n Conclusions

17 Platform n Lx Platform from STMicroelectronics 4-issue VLIW machine 64 GPRs, 8 CBRs 4 ALUs, 1 LD/ST, 2 MULs, 1 BU n Instruction-based energy model from STMicroelectronics n Lx compiler prefetch disabled only scalar optimizations (-O2)

18 Methodology n Post-pass optimization absciss SALTO Lx Compiler.s file Instrumentation: BB frequency Dmiss per BB Hyperblock formation Hyperblock optimization instr. promotion instr. merging instr. renaming source phase 1 phase 2 original CFG selective hyperblock all hyperblock

19 Results negligible IPC improvement relative larger increase of operation count and static schedule length ?

20 Agenda n Motivation n Used metrics n Energy model n Tradeoff analysis n Hyperblock example n Experiments n Conclusions

21 Conclusions n Analytical scheme to understand the impact of ILP compilation on energy n Heuristic shows 17% energy-delay improvement on a restricted hyperblock scheme è programs suffer from limited ILP which quickly turns into wasted energy è need to go beyond compiler-centric approaches in order to overcome ILP limitations n What is missing: impact of post-optimization passes has not been determined only a restricted hyperblock scheme has been evaluate

22 Thanks!