Platform-based Design

Platform-based Design
Generating ILP code TU/e 5kk70 Henk Corporaal Bart Mesman

Overview Enhance performance: architecture methods
Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering and Reconfigurable components Code generation compiler basics mapping and scheduling Hands-on 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics Overview Compiler trajectory / structure / passes
Control Flow Graph (CFG) Mapping and Scheduling Basic block list scheduling Extended scheduling scope Loop scheduling Loop transformations 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: trajectory
Source program Preprocessor Compiler Error messages Assembler Library code Loader/Linker Object program 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: structure / passes
Source code Lexical analyzer token generation check syntax check semantic parse tree generation Parsing Intermediate code data flow analysis local optimizations global optimizations Code optimization code selection peephole optimizations Code generation making interference graph graph coloring spill code insertion caller / callee save and restore code Register allocation Sequential code Scheduling and allocation exploiting ILP Object code 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

position := initial + rate * 60
Compiler basics: structure Simple example: from HLL to (Sequential) Assembly code position := initial + rate * 60 Lexical analyzer temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 id := id + id * 60 Syntax analyzer Code optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 := + id * 60 Code generator movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1 Intermediate code generator 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: Control flow graph (CFG)
C input code: if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 CFG: 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 ………….. Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: Basic optimizations
Machine independent optimizations Machine dependent optimizations 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Machine independent optimizations Common subexpression elimination Constant folding Copy propagation Dead-code elimination Induction variable elimination Strength reduction Algebraic identities Commutative expressions Associativity: Tree height reduction Note: not always allowed(due to limited precision) For details check any compiler book ! 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman All these optimizations are explained in Aho, Seti and Ulman []

Machine dependent optimization example What’s the optimal implementation of a*34 ? Use multiplier: mul Tb, Ta, 34 Pro: No thinking required Con: May take many cycles Alternative: SHL Tc, Ta, 1 ADD Tb, Tc, Tzero SHL Tc, Tc, 4 ADD Tb, Tb, Tc Pros: May take fewer cycles Cons: Uses more registers Additional instructions ( I-cache load / code size) 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: Register allocation
Register Organization Conventions needed for parameter passing and register usage across function calls r31 r21 r20 r11 r10 r1 r0 Callee saved registers Caller saved registers Argument and result transfer Hard-wired 0 Temporaries 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Register allocation using graph coloring
Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program? Some definitions: A variable is defined at a point in program when a value is assigned to it. A variable is used at a point in a program when its value is referenced in an expression. The live range of a variable is the execution range between definitions and uses of a variable. 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Example: Program: a := c := b := := b d := := a := c := d a b c d Live Ranges 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Graph needs 3 colors => program needs 3 registers
Register allocation using graph coloring Inference Graph a b c d a Coloring: a = red b = green c = blue d = green b c d Graph needs 3 colors => program needs 3 registers 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Spill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Program: a := c := store c b := := b d := := a load c := c := d a b c d Live Ranges Example: Only two registers available !! 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Register allocation for a monolithic RF
{ \bd \item [Renumber:] The first phase finds all live ranges in a procedure and numbers (renames) them uniquely. \item[Build:] This phase constructs the interference graph. \item[Spill Costs:] In preparation for coloring, a spill cost estimate is computed for every live range. The cost is simply the sum of the execution frequencies of the transports that define or use the variable of the live range. \item[Simplify:] This phase removes nodes with degree $< k$ in an arbitrary order from the graph and pushes them on a stack. Whenever it discovers that all remaining nodes have degree $\geq k$, it chooses a spill candidate. This node is also removed from the graph and optimistically pushed on the stack, hoping a color will be available in spite of its high degree. \item[Select:] Colors are selected for nodes. In turn, each node is popped from the stack, reinserted in the interference graph and given a color distinct from its neighbors. Whenever it discovers that it has no color available for some node, it leaves the node uncolored and continues with the next node. \item[Spill Code:] In the final phase spill code is inserted for the live ranges of all uncolored nodes. \ed Some symbolic registers must be mapped on a specific machine register (like stack pointer). These registers get their color in the simplify stage instead of being pushed on the stack. The other machine registers are divided in caller-saved and callee-saved registers. The allocator computes the {\em caller-saved} and {\em callee-saved} cost. The caller-saved cost for the symbolic registers is computed when they have live-ranges across a procedure call. The cost per symbolic register is twice the execution frequency of its transport. The callee-saved cost of a symbolic register is twice the execution frequency of the procedure to which the transport of the symbolic register belongs. With these two costs in mind the allocator chooses a machine register. Register allocation for a monolithic RF Scheme of the optimistic register allocator Spill code Renumber Build Spill costs Simplify Select The Select phase selects a color (= machine register) for a variable that minimizes the heuristic h: h = fdep(col, var) + caller_callee(col, var) where: fdep(col, var) : a measure for the introduction of false dependencies caller_callee(col, var) : cost for mapping var on a caller or callee saved register 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Compiler basics: Code selection
CISC era (before 1985) Code size important Determine shortest sequence of code Many options may exist Pattern matching Example M68029: D1 := D1 + M[ M[10+A1] + 16*D ]  ADD ([10,A1], D2*16, 20) D1 RISC era Performance important Only few possible code sequences New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Compiler basics Mapping and Scheduling operations Design Space Exploration: TTA framework 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Mapping / Scheduling: placing operations in space and time
* + - a b 2 z y d e f r x Data Dependence Graph (DDG) d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

How to map these operations?
Architecture constraints: One Function Unit All operations single cycle latency * + - a b 2 z y d e f r x * + - cycle 1 2 3 4 5 6 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

How to map these operations?
Architecture constraints: One Add-sub and one Mul unit All operations single cycle latency * + - a b 2 z y d e f r x * + - cycle 1 2 3 4 5 6 Mul Add-sub 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

There are many mapping solutions
Pareto graph (solution space) T execution x Cost Point x is pareto if there is no point y for which i yi<xi 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Basic Block Scheduling
Make a dependence graph Determine minimal length Determine: ASAP (As Soon As Possible) times = earliest times instructions can be scheduled ALAP (As Late As Possible) times = latest times instructions can be scheduled Slack of each operation = ALAP – ALAP Priority of operations Place each operation in first cycle with sufficient resources Notes: Basic Block is a (maximal) piece of consecutive instructions which can only be entered at the first instruction and left at the end Scheduling order sequential Priority determined by used heuristic; e.g. slack + other contributions 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Basic Block Scheduling: determine ASAP and ALAP times
ASAP cycle B C we assume all operations are single cycle ! ALAP cycle ADD A <1,1> slack SUB A C <2,2> ADD NEG LD <3,3> <1,3> <2,3> A B LD MUL ADD <4,4> <2,4> <1,4> z X y 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Cycle based list scheduling
proc Schedule(DDG = (V,E)) beginproc ready = { v | (u,v)  E } ready’ = ready sched =  current_cycle = 0 while sched  V do for each v  ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched  {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v  sched   (u,v) E, u  sched } ready’ = { v | v  ready   (u,v) E, cycle(u) + delay(u,v)  current_cycle} endwhile endproc 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Scheduling: Overview Transforming a sequential program into a parallel program: read sequential program read machine description file for each procedure do perform function inlining transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do perform instruction scheduling write out the parallel program 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Extended basic block scheduling: Code Motion
a) add r3, r4, 4 b) beq . . . D e) mul r1, r1, r3 C d) sub r3, r3, r2 B c) add r1, r1, r2 Why moving code? Downward code motions? — a  B, a  C, a  D, c  D, d  D Upward code motions? — c  A, d  A, e  B, e  C, e  A 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Extended Scheduling scope
Code: CFG: Control Flow Graph A; If cond Then B Else C; D; Then E Else F; G; A C F B D E G 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Scheduling scopes Hyperblock/region Trace Superblock Decision tree
9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Create and Enlarge Scheduling Scope
F B D E G Create and Enlarge Scheduling Scope Superblock B C F E’ D’ G’ A E D G B C E F D G A Trace tail duplication 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Create and Enlarge Scheduling Scope
F B D E G Create and Enlarge Scheduling Scope B C E’ F’ D’ G’’ A E D G Decision Tree tail duplication F G’ B C E F D G A Hyperblock/ region 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Comparing scheduling scopes
F B D E G Comparing scheduling scopes Extended basic block scheduling: Scope Why choose a specific scope? The first four are all acyclic; i.e. no back edges Trace (Fisher, IEEE trans. on comp. 1981) Use standard list scheduling However: lots of bookkeeping and code copying for code motions past fork and join points Superblock (Hwu, e.a. Journal of Supercomputing, may 1993) Easier than trace scheduling no join points -> no copying during scheduling only upward code motion -> no motion past forks Tail duplication needed Decision tree Follow multiple paths No join points \ra no complex bookkeeping No incoming edges \ra no code duplication during scheduling Each block with multiple entries becomes root -> trees are small -> tail duplication needed Hyperblock Superblock with multiple paths {\bf if-converted} Single entry Re-if-conversion for architectures without guarded execution {Warter, e.a., conf. on PLDI (progr. lang. design and impl.), jun'93} Region (Bernstein and Rodey: conf. on PLDI, Nov'91) Correspond to bodies of natural loops Regions can be nested (hierarchical scheduling) No profiling needed for region selection (this in contrast to the former scopes) Very large scope (encompasses the other approaches) Loop Keep multiple iterations active at a time Different approaches to be discussed later on Disadvantages of Trace and Superblock scheduling: Follow only one path Require high completion ratio: i.e.\ if first block is executed, all blocks should have high probability to be executed Requires biased branches and accurate (static) branch prediction 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Code movement (upwards) within regions: what to check?
destination block I Copy needed Intermediate block Check for off-liveness Legend: Code movement I I I I add source block 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Extended basic block scheduling: Code Motion
A dominates B  A is always executed before B Consequently: A does not dominate B  code motion from B to A requires code duplication B post-dominates A  B is always executed after A B does not post-dominate A  code motion from B to A is speculative A C B E D F Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B? 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Scheduling: Loops Loop Optimizations: Loop unrolling Loop peeling A B

Basic block scheduling
Scheduling: Loops Problems with unrolling: Exploits only parallelism within sets of n iterations Iteration start-up latency Code expansion Basic block scheduling Basic block scheduling and unrolling resource utilization Software pipelining time 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Software pipelining Software pipelining a loop is: Or:
Scheduling the loop such that iterations start before preceding iterations have finished Or: Moving operations across the backedge Example: y = a.x  LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST LD ML ST Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration 3 cycles/iteration 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Software pipelining (cont’d)
Basic techniques: Modulo scheduling (Rau, Lam) list scheduling with modulo resource constraints Kernel recognition techniques unroll the loop schedule the iterations identify a repeating pattern Examples: Perfect pipelining (Aiken and Nicolau) URPR (Su, Ding and Xia) Petri net pipelining (Allan) Enhanced pipeline scheduling (Ebcioğlu) fill first cycle of iteration copy this instruction over the backedge This algorithm most used in commercial compilers 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Software pipelining: Modulo scheduling
Example: Modulo scheduling a loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code without loop control for (i = 0; i < n; i++) A[i+6] = 3* A[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue Kernel Epilogue (c) Software pipeline Prologue fills the SW pipeline with iterations Epilogue drains the SW pipeline 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Software pipelining: determine II, the Initiation Interval
Cyclic data dependences For (i=0;.....) A[i+6]= 3*A[i]-1 ld r1, (r2) (0,1) (1,0) (delay, iteration distance) mul r3, r1, 3 (1,6) (0,1) (1,0) sub r4, r3, 1 (0,1) (1,0) st r4, (r5) cycle(v)  cycle(u) + delay(u,v) - II.distance(u,v) 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Modulo scheduling constraints
MII, minimum initiation interval, bounded by cyclic dependences and resources: MII = max{ ResMII, RecMII } Resources: Cycles: Therefore: Or: 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Let's go back to: The Role of the Compiler
9 steps required to translate an HLL program (see online bookchapter) Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Division of responsibilities between hardware and compiler
Application (1) Frontend Superscalar (2) Determine Dependencies Determine Dependencies Dataflow (3) Binding of Operands Binding of Operands Multi-threaded (4) Scheduling Scheduling Indep. Arch (5) Binding of Operations Binding of Operations VLIW (6) Binding of Transports Binding of Transports TTA (7) Execute Responsibility of compiler Responsibility of Hardware 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Design Space Exploration: TTA framework 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Mapping applications to processors MOVE framework
User intercation Pareto curve (solution space) cost exec. time x Optimizer Architecture parameters feedback feedback Parametric compiler Hardware generator Move framework Parallel object code chip TTA based system 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

TTA (MOVE) organization
Data Memory integer RF float boolean instruct. unit immediate load/store ALU Socket Instruction Memory 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Code generation trajectory for TTAs
Frontend: GCC or SUIF (adapted) Application (C) Compiler frontend Architecture description Sequential code Sequential simulation Input/Output Compiler backend Profiling data Parallel code Parallel simulation Input/Output 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Exploration: TTA resource reduction
• 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Exporation: TTA connectivity reduction
Critical connections disappear Reducing bus delay Execution time FU stage constrains cycle time Number of connections removed 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Can we do better Yes !! How ? Transformations
SFUs: Special Function Units Vector processing Multiple Processors Cost Execution time 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Transforming the specification
+ + + + Based on associativity of + operation a + (b + c) = (a + b) + c 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Transforming the specification
d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; r = 2*b – a; x = z + y; << - a 1 b + x z y r a * + - b 2 z y d e f r x 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Changing the architecture adding SFUs: special function units
+ + 4-input adder why is this faster? 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Changing the architecture adding SFUs: special function units
In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !! but could use FPGAs 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

SFUs: fine grain patterns
Why using fine grain SFUs: Code size reduction Register file #ports reduction Could be cheaper and/or faster Transport reduction Power reduction (avoid charging non-local wires) Supports whole application domain ! Which patterns do need support? Detection of recurring operation patterns needed 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

SFUs: covering results
Adding only 20 'patterns' of 2 operations dramatically reduces # of operations (with about 40%) !! 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Exploration: resulting architecture
9 buses 4 RFs 4 Addercmp FUs 2 Multiplier FUs 2 Diffadd FUs stream output input Architecture for image processing Several SFUs Note the reduced connectivity 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Conclusions Billions of embedded processing systems / year
how to design these systems quickly, cheap, correct, low power,.... ? what will their processing platform look like? VLIWs are very powerful and flexible can be easily tuned to application domain TTAs even more flexible, scalable, and lower power 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Conclusions Compilation for ILP architectures is mature However
used in commercial compilers However Great discrepancy between available and exploitable parallelism Advanced code scheduling techniques needed to exploit ILP 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Bottom line: Do not pay for hardware if you can do it by software !!

Hands-on (2005) Map JPEG to a TTA processor
see web page: Install TTA tools (compiler and simulator) Go through all listed steps Perform DSE: design space exploration Add SFU 1 or 2 page report in 2 weeks 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Hands-on (2006/7) Let’s look at DSE: Design Space Exploration
We will use the Imagine processor 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Handson-1 (2008/9) VLIW processor of Silicon Hive
Map an image processing algorithm Optimize the mapping Optimize the architecture 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, and B. Mesman

Platform-based Design

Similar presentations

Presentation on theme: "Platform-based Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Platform-based Design

Similar presentations

Presentation on theme: "Platform-based Design"— Presentation transcript:

Similar presentations

About project

Feedback