High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Register Usage Keep as many values in registers as possible Register assignment Register allocation Popular techniques – Local vs. global – Graph coloring.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Compiler Optimizations for Memory Hierarchy Chapter 20 High Performance Compilers.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
High-Level Transformations for Embedded Computing
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
CS 352H: Computer Systems Architecture
Code Optimization Overview and Examples
Global Register Allocation Based on
Efficient Evaluation of XQuery over Streaming Data
Code Optimization.
Names and Attributes Names are a key programming language feature
Instruction Level Parallelism
Associativity in Caches Lecture 25
Optimizing Compilers Background
Modeling of Digital Systems
Optimization Code Optimization ©SoftMoore Consulting.
5.2 Eleven Advanced Optimizations of Cache Performance
Main Memory Management
CSCI1600: Embedded and Real Time Software
Performance Optimization for Embedded Software
Introduction to Code Generation
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Register Pressure Guided Unroll-and-Jam
Code Optimization Overview and Examples Control Flow Graph
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Spring 2008 CSE 591 Compilers for Embedded Systems
Chapter 12 Pipelining and RISC
Lecture 17: Register Allocation via Graph Colouring
Code Transformation for TLB Power Reduction
Parallel Programming in C with MPI and OpenMP
Lecture 5: Pipeline Wrap-up, Static ILP
CSc 453 Final Code Generation
CSCI1600: Embedded and Real Time Software
CS 201 Compiler Construction
Presentation transcript:

High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf

© 2006 Elsevier Topics Code generation and back-end compilation. Memory-oriented software optimizations.

© 2006 Elsevier Embedded vs. general-purpose compilers General-purpose compilers must generate code for a wide range of programs:  No real-time requirements.  Often no explicit low-power requirements.  Generally want fast compilation times. Embedded compilers must meet real-time, low-power requirements.  May be willing to wait longer for compilation results.

© 2006 Elsevier Code generation steps Instruction selection chooses opcodes, modes. Register allocation binds values to registers.  Many DSPs and ASIPs have irregular register sets. Address generation selects addressing mode, registers, etc. Instruction scheduling is important for pipelining and parallelism.

© 2006 Elsevier twig model for instruction selection twig models instructions, programs as graphs. Covers program graph with instruction graph.  Covering can be driven by costs.

© 2006 Elsevier twig instruction models Rewriting rule:  replacement<- template {cost} = action Dynamic programming can be used to cover program with instructions for tree- structured instructions.  Must use heuristics for more general instructions.

© 2006 Elsevier ASIP instruction description PEAS-III describes pipeline resources used by an instruction. Leupers and Marwedel model instructions as register transfers and NOPs. Register transfers are executed under conditions.

© 2006 Elsevier Register allocation and lifetimes Variable lifetime:  Time that the variable value must be maintained  First created to last used Allocate registers to program variables  1-to-1 mapping # registers = # program variables -> excessive  N-to-M mapping Registers share variable values Take advantage of variable lifetimes to minimize number of registers Still might need to spill registers to memory

© 2006 Elsevier Clique covering – minimize registers Cliques in graph describe registers.  Clique: every pair of vertices is connected by an edge. Cliques should be maximal (strongly connected)  Minimizes registers Clique covering performed by graph coloring heuristics. Nodes are variables Edges are disjoint lifetimes Cliques are clusters variables, all within have disjoint lifetimes

© 2006 Elsevier Clique covering x1 x2 x3 x4 x5 x6 x7 Reg 1: x3, x5, x6, x7 Reg 2: x1, x4 Reg 3: x2 x1x2x3x4x5x6x7 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 Must hold final output t = 0 x1 <= a+b x2 <= a+c x3 <= x1-d x4 <= x1+d x5 <= x3+x2 x6 <= x2+x5 x7 <= x4 + x6

Functional Unit Allocation © 2006 Elsevier x1 <= a+b x2 <= a+c x3 <= x1-d x4 <= x1+d x5 <= x3+x2 x6 <= x2+x5 x7 <= x4 + x6 ab ac +/- d d x1 x2 x3 x4 +/- x5 +/- x6 +/- x7 2 +/- Units Required As Soon As Possible (ASAP) Critical Path = 4 cycles 3 +/- Units Required 2+/- Units Required As Late As Possible (ALAP) 2+/- Units Required Notice that schedule decisions affect variable liveness and register allocation!

© 2006 Elsevier VLIW register files VLIW register sets are often partitioned.  Values must be explicitly copied. Jacome and de Veciana divide program into windows:  Window start and stop, data path resource, set of activities bound to that resource within the time range. Construct basic windows, then aggregated windows. Schedule aggregated windows while propagating delays.

© 2006 Elsevier FlexWare instruction definition [Lie94] © 1994 IEEE

© 2006 Elsevier Other techniques PEAS-III categorizes instructions: arithmetic/logic, control, load/store, stack, special.  Compiler traces resource utilization, calculates latency and throughput. Mesman et al. modeled code scheduling constraints with constraint graph.  Model data dependencies, multicycle ops, etc.  Solve system by adding some edges to fix some operation times.

© 2006 Elsevier Araujo and Malik Optimal selection/ allocation/ scheduling algorithm for limited architecture---location can have either one or unbounded number available. Use a tree-grammar paerser to select instructions and allocate registers; use O(n) algorithm to schedule instructions. [Ara95] © 1995 IEEE

© 2006 Elsevier Araujo and Malik algorithm [Ara95] © 1995 IEEE

© 2006 Elsevier Code placement Place code to minimize cache conflicts. Possible cache conflicts may be determined using addresses; interesting conflicts are determined through analysis. May require blank areas in program.

© 2006 Elsevier Hwu and Chang Analyzed traces to find relative execution times. Inline expanded infrequently used subroutines. Placed frequently-used traces using greedy algorithm.

© 2006 Elsevier McFarling Analyzed program structure, trace information. Annotated program with loop execution count, basic block size, procedure call frequency. Walked through program to propagate labels, group code based on labels, place code groups to minimize interference.

© 2006 Elsevier McFarling procedure inlining Estimated number of cache misses in a loop:  s l = effective loop body size.  s b = basic block size.  f = average execution frequency of block.  M l = number of misses per loop instance.  l = average number of loop iterations.  S = cache size. Estimated new cache miss rate for inlining; used greedy algorithm to select functions to inline.

© 2006 Elsevier Pettis and Hansen Profiled programs using gprof. Put caller and callee close together in the program, increasing the chance they would be on the same page. Ordered procedures using call graph, weighted by number of invocations, merging highly-weighted edges. Optimized if-then-else code to take advantage of the processor’s branch prediction mechanism. Identified basic blocks that were not executed by given input data; moved to separate processes to improve cache behavior.

© 2006 Elsevier Tomiyama and Yasuura Formulated trace placement as an integer linear programming. Basic method increased code size. Improved method combined traces to create merged traces that fit evenly into cache lines.

© 2006 Elsevier FlexWare programming environment [Pau02] © 2002 IEEE

© 2006 Elsevier Memory-oriented optimizations Memory is a key bottleneck in many embedded systems. Memory usage can be optimized at any level of the memory hierarchy. Can target data or instructions. Global flow analysis can be particularly useful.

© 2006 Elsevier Loop transformations Data dependencies may be within or between loop iterations. A loop nest has loops enclosed by other loops. A perfect loop nest has no conditional statements.

© 2006 Elsevier Types of loop transformations Loop permutation changes order of loops. Index rewriting changes the form of the loop indexes. Loop unrolling copies the loop body. Loop splitting creates separate loops for operations in the loop body. Loop merging combines loop bodies. Loop padding adds data elements to change cache characteristics.

© 2006 Elsevier Polytope model Loop transformations can be modeled as matrix operations:

© 2006 Elsevier Loop permutation and fusion a[i] = a[i] + 5 b[i] = a[i] + 10 a[i] = a[i] + 5 b[i] = a[i] + 10 After Loop Fusion

© 2006 Elsevier Kandemir et al. loop energy experiments [Kan00] © 2000 ACM Press

© 2006 Elsevier Java transformations Real-Time Specification for Java (RTSJ) specifies Java for real time:  Scheduling: requires fixed-priority scheduler with at least 28 priorities.  Memory management: allows program to operate outside the heap.  Synchronization: additional mechanisms.

© 2006 Elsevier Optimizing compiler flow (Bacon et al.) Procedure restructuring inlines functions, eliminates tail recursion, etc. High-level data flow optimization reduces operator strength, moves loop-invariant code, etc. Partial evaluation simplifies algebra, computes constants, etc. Loop preparation peels loops, etc. Loop reordering interchanges, skews, etc.

© 2006 Elsevier Catthoor et al. methodology Memory-oriented data flow analysis and model extraction. Global data flow transformations. Global loop and control flow optimizations. Data reuse decisions for memory hierarchy. Memory organization. In-place optimization.

© 2006 Elsevier Buffer management Excessive dynamic memory management wastes cycles, energy with no functional improvements. IMEC: analyze code to understand data transfer requirements, balance concerns across program. Panda et al.: loop transformations can improve buffer utilization. Before: for (i=0; i<N; ++i) for (j=0; j<N-L; ++j) b[i][j] = 0; for (i=0; i<N; ++i) for (j=0; j<N-L; ++j) for (k=0; k<L; ++k) b[i][j] = a[i][j+k]; After: for (i=0; i<N; ++i) for (j=0; j<N-L; ++j) b[i][j] = 0; for (k=0; k<L; ++k) b[i][j] = a[i][j+k]; closer

© 2006 Elsevier Cache optimizations Strategies:  Move data to reduce the number of conflicts.  Move data to take advantage of prefetching. Need:  Load map.  Information on access frequencies.

© 2006 Elsevier Cache data placement Panda et al.: place data to reduce cache conflicts. 1. Build closeness graph for accesses. 2. Cluster variables into cache-line sized units. 3. Build a cluster interference graph. 4. Use interference graph to optimize placement. [Pan97] © 1997 ACM Press

© 2006 Elsevier Array placement Panda et al.: improve conflict test to handle arrays. Given addresses X, Y. Cache line size k holding M words. Formulas for X and Y overlapping:

© 2006 Elsevier Array assignment algorithm [Pan97] © 1997 IEEE

© 2006 Elsevier Data and loop transformations Kandemir et al.: combine data and loop transformations to optimize cache performance. Transform loop nest to make the innermost index as the only array element in one array dimension (unused in other dimensions). Align references to the right side to conform to the left side. Search right-side transformations to choose best one.

© 2006 Elsevier Scratch pad optimizations Panda et al.: assign scalars statically, analyze cache conflicts to choose between scratch pad, cache. VAC(u): variable access count. IAC(u): interference access count. IF(u): total interference count VAC(u) + IAC(u). LCF(u): loop conflict factor. TCF(u): total conflict factor.

© 2006 Elsevier Scratch pad allocation formulation AD( c ): access density.

© 2006 Elsevier Scratch pad allocaiton algorithm [Pan00] © 2000 ACM Press

© 2006 Elsevier Scratch pad allocation performance [Pan00] © 2000 ACM Press

© 2006 Elsevier Main memory-oriented optimizations Memory chips provide several useful modes:  Burst mode accesses sequential locations.  Paged modes allow only part of the address to be transmitted.  Banked memories allow parallel accesses. Access times depend on address(es) being accessed.