fakultät für informatik informatik 12 technische universität dortmund Additional compiler optimizations Peter Marwedel TU Dortmund Informatik 12 Germany.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Technische universität dortmund fakultät für informatik informatik 12 Models of computation Peter Marwedel TU Dortmund Informatik 12 Graphics: © Alexandra.
Fakultät für informatik informatik 12 technische universität dortmund Resource Access Protocols Peter Marwedel Informatik 12 TU Dortmund Germany 2008/12/06.
Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Technische universität dortmund fakultät für informatik informatik 12 Specifications and Modeling Peter Marwedel TU Dortmund, Informatik 12 Graphics: ©
Evaluation and Validation
Fakult ä t f ü r informatik informatik 12 technische universit ä t dortmund Data flow models Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra.
Fakultät für informatik informatik 12 technische universität dortmund Specifications and Modeling Peter Marwedel TU Dortmund, Informatik 12 Graphics: ©
Fakultät für informatik informatik 12 technische universität dortmund Standard Optimization Techniques Peter Marwedel Informatik 12 TU Dortmund Germany.
Fakultät für informatik informatik 12 technische universität dortmund SDL Peter Marwedel TU Dortmund, Informatik 12 Graphics: © Alexandra Nolte, Gesine.
Technische universität dortmund fakultät für informatik informatik 12 Discrete Event Models Peter Marwedel TU Dortmund, Informatik 12 Germany
Technische universität dortmund fakultät für informatik informatik 12 Specifications and Modeling Peter Marwedel TU Dortmund, Informatik
fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Technische universität dortmund fakultät für informatik informatik 12 Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems Peter.
Fakultät für informatik informatik 12 technische universität dortmund Optimizing embedded software for timing-predictability and memory-awareness Peter.
Fakultät für informatik informatik 12 technische universität dortmund Classical scheduling algorithms for periodic systems Peter Marwedel TU Dortmund,
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Universität Dortmund Hardware/Software Codesign.
Technische universität dortmund fakultät für informatik informatik 12 Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems Jian-Jia.
Fakultät für informatik informatik 12 technische universität dortmund Lab 3: Scheduling Solution - Session 10 - Heiko Falk TU Dortmund Informatik 12 Germany.
Petri Nets Jian-Jia Chen (slides are based on Peter Marwedel)
Technische universität dortmund fakultät für informatik informatik 12 Specifications, Modeling, and Model of Computation Jian-Jia Chen (slides are based.
Technische universität dortmund fakultät für informatik informatik 12 Discrete Event Models Jian-Jia Chen (slides are based on Peter Marwedel) TU Dortmund,
Technische universität dortmund fakultät für informatik informatik 12 Limits of von-Neumann (thread-based) computing Jian-Jia Chen (Slides are based on.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
fakultät für informatik informatik 12 technische universität dortmund Test Peter Marwedel TU Dortmund Informatik 12 Germany Graphics: © Alexandra Nolte,
Fakultät für informatik informatik 12 technische universität dortmund Lab 4: Exploiting the memory hierarchy - Session 14 - Peter Marwedel Heiko Falk TU.
Fakultät für informatik informatik 12 technische universität dortmund Lab 3: Scheduling - Session 10 - Peter Marwedel Heiko Falk TU Dortmund Informatik.
Mapping of Applications to Platforms Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These slides.
Compiler Challenges for High Performance Architectures
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Cpeg421-08S/final-review1 Course Review Tom St. John.
Aggregating Processor Free Time for Energy Reduction Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded Computer Systems,
Reference Book: Modern Compiler Design by Grune, Bal, Jacobs and Langendoen Wiley 2000.
Fakultät für informatik informatik 12 technische universität dortmund Classical scheduling algorithms for periodic systems Peter Marwedel TU Dortmund,
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
Evaluation and Validation Peter Marwedel TU Dortmund, Informatik 12 Germany 2013 年 12 月 02 日 These slides use Microsoft clip arts. Microsoft copyright.
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
designKilla: The 32-bit pipelined processor Brought to you by: Victoria Farthing Dat Huynh Jerry Felker Tony Chen Supervisor: Young Cho.
Fakultät für informatik informatik 12 technische universität dortmund Worst-Case Execution Time Analysis - Session 19 - Heiko Falk TU Dortmund Informatik.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Chapter# 6 Code generation.  The final phase in our compiler model is the code generator.  It takes as input the intermediate representation(IR) produced.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Presented by : A best website designer company. Chapter 1 Introduction Prof Chung. 1.
Fakultät für informatik informatik 12 technische universität dortmund Prepass Optimizations - Session 11 - Heiko Falk TU Dortmund Informatik 12 Germany.
Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.
Compiler Design (40-414) Main Text Book:
Chapter 1 Introduction.
Chapter 1 Introduction.
Optimizations - Compilation for Embedded Processors -
Standard Optimization Techniques
CSCI1600: Embedded and Real Time Software
Lecture 14: Reducing Cache Misses
Evaluation and Validation
Compiler Construction
Instruction Level Parallelism (ILP)
Cache Performance Improvements
CSc 453 Final Code Generation
CSCI1600: Embedded and Real Time Software
Presentation transcript:

fakultät für informatik informatik 12 technische universität dortmund Additional compiler optimizations Peter Marwedel TU Dortmund Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003

- 2 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Structure of this course 2: Specification 3: ES-hardware 4: system software (RTOS, middleware, …) 8: Test 5: Validation & Evaluation (energy, cost, performance, …) 7: Optimization 6: Application mapping Application Knowledge Design repository Design Numbers denote sequence of chapters

- 3 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Retargetable Compilers vs. Standard Compilers Developer retargetability: compiler specialists responsible for retargeting compilers. User retargetability: users responsible for retargeting compiler. Retargetable Compiler Standard Compiler Machine model Machine model Compiler mainly manual process Machine model Machine model Compiler mainly automatic process

- 4 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Compiler structure Compiler frontend Codeselection Register allocation Scheduling Optimisations Intermediate re- presentation (IR) C-source Object code Optimisations

- 5 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Code selection = covering DFGs ** + MEM load mul add mac Does not yet consider data moves to input registers.

- 6 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Code selection by tree parsing (1) terminals: {MEM, *, +} non-terminals: {reg1,reg2,reg3} start symbol: reg1 rules: add (cost=2): reg1 -> + (reg1, reg2) mul (cost=2): reg1 -> * ( reg1,reg2) mac (cost=3): reg1 -> + (*(reg1,reg2), reg3) load (cost=1): reg1 -> MEM mov2(cost=1): reg2 -> reg1 mov3(cost=1): reg3 -> reg1 Specification of grammar for generating a iburg parser*:

- 7 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Code selection by tree parsing (2) - nodes annotated with (register/pattern/cost)-triples - load(cost=1): reg1 -> MEM mov2(cost=1): reg2 -> reg1 mov3(cost=1): reg3 -> reg1 add (cost=2): reg1 -> +(reg1, reg2) mul (cost=2): reg1 -> *(reg1,reg2) mac (cost=3): reg1->+(*(reg1,reg2), reg3) ** + reg1:load:1 reg2:mov2:2 reg3:mov3:2 reg1:mul:5 reg2:mov2:6 reg3:mov3:6 MEM reg1:add:13 reg1:mac:12

- 8 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Code selection by tree parsing (3) - final selection of cheapest set of instructions - ** + MEM reg1:add:13 reg1:mac:12 mac load mul mov2 mov3 mov2 Includes routing of values between various registers!

- 9 - technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Prefetching Prefetch instructions load values into the cache Pipeline not stalled for prefetching Prefetching instructions introduced in ~ Potentially, all miss latencies can be avoided Disadvantages: Increased # of instructions Potential premature eviction of cache line Potentially pre-loads lines that are never used Steps Determination of references requiring prefetches Insertion of prefetches (early enough!) [R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures, Morgan-Kaufman, 2002]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Results for prefetching [Mowry, as cited by R. Allen & K. Kennedy] © Morgan-Kaufman, 2002 Not very impressive!

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Optimization for exploiting processor-memory interface: Problem Definition (1) [A. Shrivastava, E. Earlie, N. Dutt, A. Nicolau: Aggregating processor free time for energy reduction, Intern. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005, pp ] XScale is stalled for 30% of time, but each stall duration is small Average stall duration = 4 cycles Longest stall duration < 100 cycles Break-even stall duration for profitable switching 360 cycles Maximum processor stall < 100 cycles NOT possible to switch the processor to IDLE mode Based on slide by A. Shrivastava

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Optimization for exploiting processor-memory interface: Problem Definition (2) CT (Computation Time): Time to execute an iteration of the loop, assuming all data is present in the cache DT (Data Transfer Time): Time to transfer data required by an iteration of a loop between cache and memory Consider the execution of a memory-bound loop (DT > CT) Processor has to stall Time Activity Processor Activity Memory Bus Activity Processor activity is dis-continuous Memory activity is dis-continuous for (int i=0; i<1000; i++) c[i] = a[i] + b[i]; Based on slide by A. Shrivastava

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Optimization for exploiting processor-memory interface: Prefetching Solution Time Activity Processor Activity Memory Bus Activity Each processor activity period increases for (int i=0; i<1000; i++) prefetch a[i+4]; prefetch b[i+4]; prefetch c[i+4]; c[i] = a[i] + b[i]; Memory activity is continuous Processor activity is dis-continuous Memory activity is continuous Total execution time reduces Based on slide by A. Shrivastava

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Memory hierarchy description languages: ArchC Consists of description of ISA and HW architecture Extension of SystemC (can be generated from ArchC): Storage class structure [P. Viana, E. Barros, S. Rigo, R. Azevedo, G. Araújo: Exploring Memory Hierarchy with ArchC, 15th Symposium on Computer Architecture and High Performance Computing, 2003, pp. 2 – 9]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Example: Description of a simple cache-based architecture

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Memory Aware Compilation and Simulation Framework (for C) MACC [M. Verma, L. Wehmeyer, R. Pyka, P. Marwedel, L. Benini: Compilation and Simulation Tool Chain for Memory Aware Energy Optimizations, Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS VI), 2006]. Application C code Source-level memory optimizer encc, ARM gcc, M5 DSP Array partitioning SPM overlay Executable binary Energy database Memory hierarchy description Compilation Framework Profile report Memory simulator Processor simulators (ARM7/M5) Profiler Simulation Framework MPSoC simulator

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Memory architecture MACCv2 Query can include address, time stamp, value, … Query can request energy, delay, stored values Query processed along a chain of HW components, incl. busses, ports, address translations etc., each adding delay & energy [R. Pyka et al.: Versatile System level Memory Description Approach for embedded MPSoCs, University of Dortmund, Informatik 12, 2007] API query to model simplifies integration into compiler External XML representation REQ Energy= ? Cycles= ? +10 Energy +5 Cycles +1 Energy +2 Cycles +1 Energy +0 Cycles CPU1 MM ASPC-1 - IFETCH - DRD - DWR - MAINAS ASPC-M - 0…ffff ASPC-B - 0 … 3ffff

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Controlling tool chain generation through an architecture description language (ADL): EXPRESSION Overall information flow [P. Mishra, A. Shrivastava, N. Dutt: Architecture description language (ADL)-driven software toolkit generation for architectural exploration of programmable SOCs, ACM Trans. Des. Autom. Electron. Syst. (TODAES), 2006, pp ]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Description of Memories in EXPRESSION Generic approach, based on the analysis of a wide range of systems; Used for verification. (STORAGE_SECTION (DataL1 (TYPE DCACHE) (WORDSIZE 64) (LINESIZE 8) (NUM_LINES 1024) (ASSOCIATIVITY 2) (READ_LATENCY 1)... (REPLACEMENT_POLICY LRU) (WRITE_POLICY WRITE_BACK) ) (ScratchPad (TYPE SRAM) (ADDRESS_RANGE ) …. ) (SB (TYPE STREAM_BUFFER) ….. (InstL1 (TYPE ICACHE) ……… ) (L2 (TYPE DCACHE) ……. ) (MainMemory (TYPE DRAM) ) (Connect (TYPE CONNECTIVITY) (CONNECTIONS (InstL1, L2) (DataL1, SB) (SB, L2) (L2, MainMemory) )))

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 EXPRESSION: results q

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Optimization for main memory Exploiting burst mode of DRAM (1) [P. Grun, N. Dutt, A. Nicolau: Memory aware compilation through accurate timing extraction, DAC, 2000, pp. 316 – 321] Supported trafos: memory mapping, code reordering or loop unrolling

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Optimization for main memory Exploiting burst mode of DRAM (2) Timing extracted from EXPRESSION model for(i=0; i<9;i+=3){ a=a+x[i]+x[i+1]+x[i+2]+ y[i]+y[i+1]+y[i+2]; b=b+z[i]+z[i+1]+z[i+2]+ u[i]+u[i+1]+u[i+2];} Open circles of original paper changed into closed circles (column decodes). 2 banks

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Memory hierarchies beyond main memory Massive datasets are being collected everywhere Storage management software is billion-$ industry Examples (2002): Phone: AT&T 20TB phone call database, wireless tracking Consumer: WalMart 70TB database, buying patterns WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day Geography: NASA satellites generate 1.2TB per day [© Larse Arge, I/O-Algorithms, More New Information Over Next 2 Years Than in All Previous History

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Example: LIDAR Terrain Data ~1,2 km ~ 280 km/h at m ~ 1,5 m between measurements COWI A/S (and others) is currently scanning Denmark [© Larse Arge, I/O-Algorithms,

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Application Example: Flooding Prediction +1 meter +2 meter [© Larse Arge, I/O-Algorithms,

fakultät für informatik informatik 12 technische universität dortmund Dynamic Voltage Scaling Peter Marwedel TU Dortmund Informatik 12 Germany 2009/01/17 Graphics: © Alexandra Nolte, Gesine Marwedel, 2003

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Voltage Scaling and Power Management Dynamic Voltage Scaling V dd

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Recap from chapter 3: Fundamentals of dynamic voltage scaling (DVS) Power consumption of CMOS circuits (ignoring leakage): Delay for CMOS circuits: Decreasing V dd reduces P quadratically, while the run-time of algorithms is only linearly increased (ignoring the effects of the memory system).

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Example: Processor with 3 voltages Case a): Complete task ASAP Task that needs to execute 10 9 cycles within 25 seconds. E a = 10 9 x 40 x = 40 [J]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Case b): Two voltages E b = x – x = 32.5 [J]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Case c): Optimal voltage E c = 10 9 x 25 x = 25 [J]

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Observations A minimum energy consumption is achieved for the ideal supply voltage of 4 Volts. In the following: variable voltage processor = processor that allows any supply voltage up to a certain maximum. It is expensive to support truly variable voltages, and therefore, actual processors support only a few fixed voltages.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Generalisation Lemma [Ishihara, Yasuura]: If a variable voltage processor completes a task before the deadline, the energy consumption can be reduced. If a processor uses a single supply voltage V and completes a task T just at its deadline, then V is the unique supply voltage which minimizes the energy consumption of T. If a processor can only use a number of discrete voltage levels, then a voltage schedule with at most two voltages minimizes the energy consumption under any time constraint. If a processor can only use a number of discrete voltage levels, then the two voltages which minimize the energy consumption are the two immediate neighbors of the ideal voltage V ideal possible for a variable voltage processor.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 The case of multiple tasks: Assignment of optimum voltages to a set of tasks N : the number of tasks EC j : the number of executed cycles of task j L : the number of voltages of the target processor V i : the i th voltage, with 1 i L F i : the clock frequency for supply voltage V i T : the global deadline at which all tasks must have been completed X i, j : the number of clock cycles task j is executed at voltage V i SC j : the average switching capacitance during the execution of task j (SC i comprises the actual capacitance CL and the switching activity )

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Designing an I(L)P model Simplifying assumptions of the IP-model include the following: There is one target processor that can be operated at a limited number of discrete voltages. The time for voltage and frequency switches is negligible. The worst case number of cycles for each task are known.

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Energy Minimization using an Integer Programming Model Minimize subject to and

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Dynamic power management (DPM) Dynamic Power management tries to assign optimal power saving states. Questions: When to go to an power-saving state? Different, but typically complex models: Markov chains, renewal theory, ….

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Experimental Results © Yasuura, 2000

technische universität dortmund fakultät für informatik p. marwedel, informatik 12, 2009 Summary Retargetability Optimizations exploiting memory hierarchies Prefetching Memory-architecture aware compilation -Burst mode access exploited by EXPRESSION Support for FLASH memory Memory hierarchies beyond main memory Dynamic voltage scaling (DVS) An ILP model for voltage assignment in a multi- tasking system Dynamic power management (DPM) (briefly)