Platform-based Design

Slides:



Advertisements
Similar presentations
1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
ECE 667 Synthesis and Verification of Digital Circuits
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation Henk Corporaal Technical.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.
Cpeg421-08S/final-review1 Course Review Tom St. John.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Register Allocation (via graph coloring)
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures (part b) ILP compilation (part a)
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Generic Software Pipelining at the Assembly Level Markus Pister
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Embedded Computer Architecture TU/e 5kk73 Henk Corporaal VLIW architectures: Generating VLIW code.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.
1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Code Optimization Overview and Examples
Compiler Design (40-414) Main Text Book:
Code Optimization.
Advanced Architectures
Optimizing Compilers Background
Optimization Code Optimization ©SoftMoore Consulting.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
CSL718 : VLIW - Software Driven ILP
Performance Optimization for Embedded Software
Instruction Scheduling Hal Perkins Winter 2008
Parallelization, Compilation and Platforms PCP
Henk Corporaal TUEindhoven 2011
Code Optimization Overview and Examples Control Flow Graph
Processor Architectures and Program Mapping
Architectural-Level Synthesis
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Static Code Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Chapter 12 Pipelining and RISC
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Intermediate Code Generation
Compiler Construction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 17: Register Allocation via Graph Colouring
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Code Generation Part II
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Static Scheduling Techniques
COMPUTER ORGANIZATION AND ARCHITECTURE
Instruction Scheduling Hal Perkins Autumn 2011
CS 201 Compiler Construction
Presentation transcript:

Platform-based Design Generating ILP code TU/e 5kk70 Henk Corporaal Bart Mesman

Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering and Reconfigurable components Code generation compiler basics mapping and scheduling Hands-on 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics Overview Compiler trajectory / structure / passes Control Flow Graph (CFG) Mapping and Scheduling Basic block list scheduling Extended scheduling scope Loop scheduling Loop transformations 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: trajectory Source program Preprocessor Compiler Error messages Assembler Library code Loader/Linker Object program 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: structure / passes Source code Lexical analyzer token generation check syntax check semantic parse tree generation Parsing Intermediate code data flow analysis local optimizations global optimizations Code optimization code selection peephole optimizations Code generation making interference graph graph coloring spill code insertion caller / callee save and restore code Register allocation Sequential code Scheduling and allocation exploiting ILP Object code 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

position := initial + rate * 60 Compiler basics: structure Simple example: from HLL to (Sequential) Assembly code position := initial + rate * 60 Lexical analyzer temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 id := id + id * 60 Syntax analyzer Code optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 := + id * 60 Code generator movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1 Intermediate code generator 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: Control flow graph (CFG) C input code: if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 CFG: 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 ………….. Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: Basic optimizations Machine independent optimizations Machine dependent optimizations 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: Basic optimizations Machine independent optimizations Common subexpression elimination Constant folding Copy propagation Dead-code elimination Induction variable elimination Strength reduction Algebraic identities Commutative expressions Associativity: Tree height reduction Note: not always allowed(due to limited precision) For details check any compiler book ! 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman All these optimizations are explained in Aho, Seti and Ulman []

Compiler basics: Basic optimizations Machine dependent optimization example What’s the optimal implementation of a*34 ? Use multiplier: mul Tb, Ta, 34 Pro: No thinking required Con: May take many cycles Alternative: SHL Tc, Ta, 1 ADD Tb, Tc, Tzero SHL Tc, Tc, 4 ADD Tb, Tb, Tc Pros: May take fewer cycles Cons: Uses more registers Additional instructions ( I-cache load / code size) 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: Register allocation Register Organization Conventions needed for parameter passing and register usage across function calls r31 r21 r20 r11 r10 r1 r0 Callee saved registers Caller saved registers Argument and result transfer Hard-wired 0 Temporaries 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Register allocation using graph coloring Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program? Some definitions: A variable is defined at a point in program when a value is assigned to it. A variable is used at a point in a program when its value is referenced in an expression. The live range of a variable is the execution range between definitions and uses of a variable. 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Register allocation using graph coloring Example: Program: a := c := b := := b d := := a := c := d a b c d Live Ranges 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Graph needs 3 colors => program needs 3 registers Register allocation using graph coloring Inference Graph a b c d a Coloring: a = red b = green c = blue d = green b c d Graph needs 3 colors => program needs 3 registers 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Register allocation using graph coloring Spill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Program: a := c := store c b := := b d := := a load c := c := d a b c d Live Ranges Example: Only two registers available !! 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Register allocation for a monolithic RF { \bd \item [Renumber:] The first phase finds all live ranges in a procedure and numbers (renames) them uniquely. \item[Build:] This phase constructs the interference graph. \item[Spill Costs:] In preparation for coloring, a spill cost estimate is computed for every live range. The cost is simply the sum of the execution frequencies of the transports that define or use the variable of the live range. \item[Simplify:] This phase removes nodes with degree $< k$ in an arbitrary order from the graph and pushes them on a stack. Whenever it discovers that all remaining nodes have degree $\geq k$, it chooses a spill candidate. This node is also removed from the graph and optimistically pushed on the stack, hoping a color will be available in spite of its high degree. \item[Select:] Colors are selected for nodes. In turn, each node is popped from the stack, reinserted in the interference graph and given a color distinct from its neighbors. Whenever it discovers that it has no color available for some node, it leaves the node uncolored and continues with the next node. \item[Spill Code:] In the final phase spill code is inserted for the live ranges of all uncolored nodes. \ed Some symbolic registers must be mapped on a specific machine register (like stack pointer). These registers get their color in the simplify stage instead of being pushed on the stack. The other machine registers are divided in caller-saved and callee-saved registers. The allocator computes the {\em caller-saved} and {\em callee-saved} cost. The caller-saved cost for the symbolic registers is computed when they have live-ranges across a procedure call. The cost per symbolic register is twice the execution frequency of its transport. The callee-saved cost of a symbolic register is twice the execution frequency of the procedure to which the transport of the symbolic register belongs. With these two costs in mind the allocator chooses a machine register. Register allocation for a monolithic RF Scheme of the optimistic register allocator Spill code Renumber Build Spill costs Simplify Select The Select phase selects a color (= machine register) for a variable that minimizes the heuristic h: h = fdep(col, var) + caller_callee(col, var) where: fdep(col, var) : a measure for the introduction of false dependencies caller_callee(col, var) : cost for mapping var on a caller or callee saved register 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: Code selection CISC era (before 1985) Code size important Determine shortest sequence of code Many options may exist Pattern matching Example M68029: D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ]  ADD ([10,A1], D2*16, 20) D1 RISC era Performance important Only few possible code sequences New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Compiler basics Mapping and Scheduling operations Design Space Exploration: TTA framework 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Mapping / Scheduling: placing operations in space and time * + - a b 2 z y d e f r x Data Dependence Graph (DDG) d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

How to map these operations? Architecture constraints: One Function Unit All operations single cycle latency * + - a b 2 z y d e f r x * + - cycle 1 2 3 4 5 6 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

How to map these operations? Architecture constraints: One Add-sub and one Mul unit All operations single cycle latency * + - a b 2 z y d e f r x * + - cycle 1 2 3 4 5 6 Mul Add-sub 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

There are many mapping solutions Pareto graph (solution space) T execution x Cost Point x is pareto if there is no point y for which i yi<xi 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Basic Block Scheduling Make a dependence graph Determine minimal length Determine: ASAP (As Soon As Possible) times = earliest times instructions can be scheduled ALAP (As Late As Possible) times = latest times instructions can be scheduled Slack of each operation = ALAP – ALAP Priority of operations Place each operation in first cycle with sufficient resources Notes: Basic Block is a (maximal) piece of consecutive instructions which can only be entered at the first instruction and left at the end Scheduling order sequential Priority determined by used heuristic; e.g. slack + other contributions 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Basic Block Scheduling: determine ASAP and ALAP times ASAP cycle B C we assume all operations are single cycle ! ALAP cycle ADD A <1,1> slack SUB A C <2,2> ADD NEG LD <3,3> <1,3> <2,3> A B LD MUL ADD <4,4> <2,4> <1,4> z X y 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Cycle based list scheduling proc Schedule(DDG = (V,E)) beginproc ready = { v | (u,v)  E } ready’ = ready sched =  current_cycle = 0 while sched  V do for each v  ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched  {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v  sched   (u,v) E, u  sched } ready’ = { v | v  ready   (u,v) E, cycle(u) + delay(u,v)  current_cycle} endwhile endproc 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Scheduling: Overview Transforming a sequential program into a parallel program: read sequential program read machine description file for each procedure do perform function inlining transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do perform instruction scheduling write out the parallel program 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Extended basic block scheduling: Code Motion a) add r3, r4, 4 b) beq . . . D e) mul r1, r1, r3 C d) sub r3, r3, r2 B c) add r1, r1, r2 Why moving code? Downward code motions? — a  B, a  C, a  D, c  D, d  D Upward code motions? — c  A, d  A, e  B, e  C, e  A 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Extended Scheduling scope Code: CFG: Control Flow Graph A; If cond Then B Else C; D; Then E Else F; G; A C F B D E G 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Scheduling scopes Hyperblock/region Trace Superblock Decision tree 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Create and Enlarge Scheduling Scope F B D E G Create and Enlarge Scheduling Scope Superblock B C F E’ D’ G’ A E D G B C E F D G A Trace tail duplication 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Create and Enlarge Scheduling Scope F B D E G Create and Enlarge Scheduling Scope B C E’ F’ D’ G’’ A E D G Decision Tree tail duplication F G’ B C E F D G A Hyperblock/ region 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Comparing scheduling scopes F B D E G Comparing scheduling scopes Extended basic block scheduling: Scope Why choose a specific scope? The first four are all acyclic; i.e. no back edges Trace (Fisher, IEEE trans. on comp. 1981) Use standard list scheduling However: lots of bookkeeping and code copying for code motions past fork and join points Superblock (Hwu, e.a. Journal of Supercomputing, may 1993) Easier than trace scheduling no join points -> no copying during scheduling only upward code motion -> no motion past forks Tail duplication needed Decision tree Follow multiple paths No join points \ra no complex bookkeeping No incoming edges \ra no code duplication during scheduling Each block with multiple entries becomes root -> trees are small -> tail duplication needed Hyperblock Superblock with multiple paths {\bf if-converted} Single entry Re-if-conversion for architectures without guarded execution {Warter, e.a., conf. on PLDI (progr. lang. design and impl.), jun'93} Region (Bernstein and Rodey: conf. on PLDI, Nov'91) Correspond to bodies of natural loops Regions can be nested (hierarchical scheduling) No profiling needed for region selection (this in contrast to the former scopes) Very large scope (encompasses the other approaches) Loop Keep multiple iterations active at a time Different approaches to be discussed later on Disadvantages of Trace and Superblock scheduling: Follow only one path Require high completion ratio: i.e.\ if first block is executed, all blocks should have high probability to be executed Requires biased branches and accurate (static) branch prediction 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Code movement (upwards) within regions: what to check? destination block I Copy needed Intermediate block Check for off-liveness Legend: Code movement I I I I add source block 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Extended basic block scheduling: Code Motion A dominates B  A is always executed before B Consequently: A does not dominate B  code motion from B to A requires code duplication B post-dominates A  B is always executed after A B does not post-dominate A  code motion from B to A is speculative A C B E D F Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B? 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Scheduling: Loops Loop Optimizations: Loop unrolling Loop peeling A B 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Basic block scheduling Scheduling: Loops Problems with unrolling: Exploits only parallelism within sets of n iterations Iteration start-up latency Code expansion Basic block scheduling Basic block scheduling and unrolling resource utilization Software pipelining time 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Software pipelining Software pipelining a loop is: Or: Scheduling the loop such that iterations start before preceding iterations have finished Or: Moving operations across the backedge Example: y = a.x  LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST LD ML ST Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration 3 cycles/iteration 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Software pipelining (cont’d) Basic techniques: Modulo scheduling (Rau, Lam) list scheduling with modulo resource constraints Kernel recognition techniques unroll the loop schedule the iterations identify a repeating pattern Examples: Perfect pipelining (Aiken and Nicolau) URPR (Su, Ding and Xia) Petri net pipelining (Allan) Enhanced pipeline scheduling (Ebcioğlu) fill first cycle of iteration copy this instruction over the backedge This algorithm most used in commercial compilers 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Software pipelining: Modulo scheduling Example: Modulo scheduling a loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code without loop control for (i = 0; i < n; i++) A[i+6] = 3* A[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue Kernel Epilogue (c) Software pipeline Prologue fills the SW pipeline with iterations Epilogue drains the SW pipeline 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Software pipelining: determine II, the Initiation Interval Cyclic data dependences For (i=0;.....) A[i+6]= 3*A[i]-1 ld r1, (r2) (0,1) (1,0) (delay, iteration distance) mul r3, r1, 3 (1,6) (0,1) (1,0) sub r4, r3, 1 (0,1) (1,0) st r4, (r5) cycle(v)  cycle(u) + delay(u,v) - II.distance(u,v) 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Modulo scheduling constraints MII, minimum initiation interval, bounded by cyclic dependences and resources: MII = max{ ResMII, RecMII } Resources: Cycles: Therefore: Or: 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Let's go back to: The Role of the Compiler 9 steps required to translate an HLL program (see online bookchapter) Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Division of responsibilities between hardware and compiler Application (1) Frontend Superscalar (2) Determine Dependencies Determine Dependencies Dataflow (3) Binding of Operands Binding of Operands Multi-threaded (4) Scheduling Scheduling Indep. Arch (5) Binding of Operations Binding of Operations VLIW (6) Binding of Transports Binding of Transports TTA (7) Execute Responsibility of compiler Responsibility of Hardware 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Design Space Exploration: TTA framework 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Mapping applications to processors MOVE framework User intercation Pareto curve (solution space) cost exec. time x Optimizer Architecture parameters feedback feedback Parametric compiler Hardware generator Move framework Parallel object code chip TTA based system 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

TTA (MOVE) organization Data Memory integer RF float boolean instruct. unit immediate load/store ALU Socket Instruction Memory 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Code generation trajectory for TTAs Frontend: GCC or SUIF (adapted) Application (C) Compiler frontend Architecture description Sequential code Sequential simulation Input/Output Compiler backend Profiling data Parallel code Parallel simulation Input/Output 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Exploration: TTA resource reduction • 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Exporation: TTA connectivity reduction Critical connections disappear Reducing bus delay Execution time FU stage constrains cycle time Number of connections removed 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Can we do better Yes !! How ? Transformations SFUs: Special Function Units Vector processing Multiple Processors Cost Execution time 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Transforming the specification + + + + Based on associativity of + operation a + (b + c) = (a + b) + c 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Transforming the specification d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; r = 2*b – a; x = z + y; << - a 1 b + x z y r a * + - b 2 z y d e f r x 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Changing the architecture adding SFUs: special function units + + 4-input adder why is this faster? 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Changing the architecture adding SFUs: special function units In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !! but could use FPGAs 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

SFUs: fine grain patterns Why using fine grain SFUs: Code size reduction Register file #ports reduction Could be cheaper and/or faster Transport reduction Power reduction (avoid charging non-local wires) Supports whole application domain ! Which patterns do need support? Detection of recurring operation patterns needed 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

SFUs: covering results Adding only 20 'patterns' of 2 operations dramatically reduces # of operations (with about 40%) !! 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Exploration: resulting architecture 9 buses 4 RFs 4 Addercmp FUs 2 Multiplier FUs 2 Diffadd FUs stream output input Architecture for image processing Several SFUs Note the reduced connectivity 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Conclusions Billions of embedded processing systems / year how to design these systems quickly, cheap, correct, low power,.... ? what will their processing platform look like? VLIWs are very powerful and flexible can be easily tuned to application domain TTAs even more flexible, scalable, and lower power 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Conclusions Compilation for ILP architectures is mature However used in commercial compilers However Great discrepancy between available and exploitable parallelism Advanced code scheduling techniques needed to exploit ILP 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Bottom line: Do not pay for hardware if you can do it by software !! 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Hands-on (2005) Map JPEG to a TTA processor see web page: http://www.ics.ele.tue.nl/~heco/courses/pam Install TTA tools (compiler and simulator) Go through all listed steps Perform DSE: design space exploration Add SFU 1 or 2 page report in 2 weeks 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Hands-on (2006/7) Let’s look at DSE: Design Space Exploration We will use the Imagine processor http://cva.stanford.edu/projects/imagine/ 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Handson-1 (2008/9) VLIW processor of Silicon Hive Map an image processing algorithm Optimize the mapping Optimize the architecture 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman