Download presentation
Presentation is loading. Please wait.
Published byCassie Walline Modified over 9 years ago
1
Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation Henk Corporaal http://www.ics.ele.tue.nl/~heco/courses/EmbSystems Technical University Eindhoven DTI / NUS Singapore 2005/2006
2
H.C. TD 51022 Compiling for ILP Architectures Overview: Motivation and Goals Measuring and exploiting available parallelism Compiler basics Scheduling for ILP architectures Summary and Conclusions
3
H.C. TD 51023 Motivation Performance requirements increase Applications may contain much instruction level parallelism Processors offer lots of hardware concurrency Problem to be solved: –how to exploit this concurrency automatically?
4
H.C. TD 51024 Goals of code generation High speedup –Exploit all the hardware concurrency –Extract all application parallelism obey true dependencies only resolve false dependencies by renaming No code rewriting: automatic parallelization –However: application tuning may be required Limit code expansion
5
H.C. TD 51025 Overview Motivation and Goals Measuring and exploiting available parallelism Compiler basics Scheduling for ILP architectures Summary and Conclusions
6
H.C. TD 51026 Measuring and exploiting available parallelism How to measure parallelism within applications? –Using existing compiler –Using trace analysis Track all the real data dependencies (RaWs) of instructions from issue window –register dependence –memory dependence Check for correct branch prediction –if prediction correct continue –if wrong, flush schedule and start in next cycle
7
H.C. TD 51027 Trace analysis Program For i := 0..2 A[i] := i; S := X+3; Compiled code set r1,0 set r2,3 set r3,&A Loop:st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Execution trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 How parallel can this code be executed?
8
H.C. TD 51028 Trace analysis Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7
9
H.C. TD 51029 Ideal Processor Assumptions for ideal/perfect processor: 1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction – Perfect => all program instructions available for execution 3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal Also: –unlimited number of instructions issued/cycle (unlimited resources), and –unlimited instruction window –perfect caches –1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level
10
H.C. TD 510210 Upper Limit to ILP: Ideal Processor Integer: 18 - 60 FP: 75 - 150 IPC
11
H.C. TD 510211 Different effects reduce the exploitable parallelism Reducing window size –i.e., the number of instructions to choose from Non-perfect branch prediction –perfect (oracle model) –dynamic predictor (e.g. 2 bit prediction table with finite number of entries) –static prediction (using profiling) –no prediction Restricted number of registers for renaming –typical superscalars have O(100) registers Restricted number of other resources, like FUs
12
H.C. TD 510212 Non-perfect alias analysis ( memory disambiguation) Models to use: –perfect –inspection: no dependence in following cases: r1 := 0(r9) r1 := 0(fp) 4(r9) := r2 0(gp) := r2 A more advanced analysis may disambiguate most stack and global references, but not the heap references –none Important: –good branch prediction, 128 registers for renaming, alias analysis on stack and global accesses, and for FloatingPt a large window size Different effects reduce the exploitable parallelism
13
H.C. TD 510213 Summary Amount of parallelism is limited –higher in Multi-Media –higher in kernels Trace analysis detects all types of parallelism –task, data and operation types Detected parallelism depends on –quality of compiler –hardware –source-code transformations
14
H.C. TD 510214 Overview Motivation and Goals Measuring and exploiting available parallelism Compiler basics Scheduling for ILP architectures Source level transformations Compilation frameworks Summary and Conclusions
15
H.C. TD 510215 Compiler basics Overview –Compiler trajectory / structure / passes –Abstract Syntax Tree (AST) –Control Flow Graph (CFG) –Data Dependence Graph (DDG) –Basic optimizations –Register allocation –Code selection
16
H.C. TD 510216 Compiler basics: trajectory Preprocessor Compiler Assembler Loader/Linker Source program Object program Error messages Library code
17
H.C. TD 510217 Compiler basics: structure / passes Lexical analyzer Parsing Code optimization Register allocation Source code Sequential code Intermediate code Code generation Scheduling and allocation Object code token generation check syntax check semantic parse tree generation data flow analysis local optimizations global optimizations code selection peephole optimizations making interference graph graph coloring spill code insertion caller / callee save and restore code exploiting ILP
18
H.C. TD 510218 Compiler basics: structure Simple compilation example Lexical analyzer Syntax analyzer Intermediate code generator position := initial + rate * 60 id := id + id * 60 := + id * 60 id Code optimizer Code generator temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 temp1 := id3 * 60.0 id1 := id2 + temp1 movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1
19
H.C. TD 510219 Compiler basics: structure - SUIF-1 toolkit example pre-processing C front-end converting non-standard structures to SUIF constant propagation forward propagation induction variable identification scalar privatization analysis reduction analysis locality optimization and parallelism analysis parallel code generation FORTRAN specific transformations SUIF to textSUIF to postscriptSUIF to C SUIF text postscript C FORTRAN to C FORTRAN C high-SUIF to low-SUIF constant propagation strength reduction dead-code elimination register allocation assembly code generation assembly code
20
H.C. TD 510220 Compiler basics: Abstract Syntax Tree (AST) C input code: if (a > b) { r = a % b; } else { r = b % a; } Parse tree: ‘infinite’ nesting: Stat IF Cmp > Var a Var b Statlist Stat Expr Assign Var r Binop % Var a Var b Statlist Stat Expr Assign Var r Binop % Var b Var a
21
H.C. TD 510221 Compiler basics: Control flow graph (CFG) C input code: CFG: 1 sub t1, a, b bgz t1, 2, 3 4 ………….. 3 rem r, b, a goto 4 2 rem r, a, b goto 4 Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. if (a > b) { r = a % b; } else { r = b % a; }
22
H.C. TD 510222 a := b + 15; c := 3.14 * d; e := c / f; Translation to DDG ld + st &b 15 &a ld* /st ld st &f 3.14 &e &d &c Data Dependence Graph (DDG)
23
H.C. TD 510223 Machine independent optimizations Machine dependent optimizations (details are in any good compiler book) Compiler basics : Basic optimizations
24
H.C. TD 510224 –Common subexpression elimination –Constant folding –Copy propagation –Dead-code elimination –Induction variable elimination –Strength reduction –Algebraic identities Commutative expressions Associativity: Tree height reduction –Note: not always allowed (due to limited precision) Machine independent optimizations
25
H.C. TD 510225 What’s the optimal implementation of a*34 ? –Use multiplier: mul Tb,Ta,34 Pro: No thinking required Con: May take many cycles –Alternative: SHL Tc, Ta, 1 ADD Tb, Tc, Tzero SHL Tc, Tc, 4 ADD Tb, Tb, Tc Pros: May take fewer cycles Cons: Uses more registers Additional instructions ( I-cache load / code size) Machine dependent optimization example
26
H.C. TD 510226 Register Organization Conventions needed for parameter passing and register usage across function calls; a MIPS example: Compiler basics : Register allocation r31 r21 r20 r11 r10 r1 r0 Callee saved registers Caller saved registers Argument and result transfer Hard-wired 0 Temporaries
27
H.C. TD 510227 Register allocation using graph coloring Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program? A variable is defined at a point in program when a value is assigned to it. A variable is used at a point in a program when its value is referenced in an expression. The live range of a variable is the execution range between definitions and uses of a variable.
28
H.C. TD 510228 Program: a := c := b := := b d := := a := c := d abcd Live Ranges Register allocation using graph coloring Example:
29
H.C. TD 510229 Register allocation using graph coloring a bc d Inference Graph a bc d Coloring: a = red b = green c = blue d = green Graph needs 3 colors (chromatic nr =3) => program needs 3 registers
30
H.C. TD 510230 Register allocation using graph coloring Spill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Example: Only two registers available !! Program: a := c := store c b := := b d := := a load c := c := d abcd Live Ranges
31
H.C. TD 510231 CISC era –Code size important –Determine shortest sequence of code Many options may exist –Pattern matching Example M68020: D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ] ADD ([10,A1], D2*16, 20), D1 RISC era –Performance important –Only few possible code sequences –New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020 Compiler basics : Code selection
32
H.C. TD 510232 Overview Motivation and Goals Measuring and exploiting available parallelism Compiler basics Scheduling for ILP architectures Source level transformations Compilation frameworks Summary and Conclusions
33
H.C. TD 510233 What is scheduling? Time allocation: –Assigning instructions or operations to time slots –Preserve dependences: Register dependences Memory dependences –Optimize code with respect to performance/ code size/ power consumption/.. Space allocation –satisfy resource constraints: Bind operations to FUs Bind variables to registers/ register files Bind transports to buses
34
H.C. TD 510234 Why scheduling? Let’s look at the execution time: T execution = N cycles x T cycle = N instructions x CPI x T cycle Scheduling may reduce T execution –Reduce CPI (cycles per instruction) early scheduling of long latency operations avoid pipeline stalls due to structural, data and control hazards allow N issue > 1 and therefore CPI < 1 –Reduce N instructions compact many operations into each instruction (VLIW)
35
H.C. TD 510235 Scheduling data hazards RaW dependence Avoiding RaW stalls: Reordering of instructions by the compiler Example: avoiding one-cycle load interlock Code: a = b + c d = e - f Unscheduled code: Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4 Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4
36
H.C. TD 510236 Scheduling control hazards Branch requires 3 actions: Compute new address Determine condition Perform the actual branch (if taken): PC := new address IF ID OF EX WB IF ID OF EX IF ID OF EX WB time Branch L Predict not taken L:
37
H.C. TD 510237 Control hazards: what's the penalty? CPI = CPI ideal + f branch x P branch P branch = N delayslots x miss_rate Superscalars tend to have large branch penalty P branch due to –many pipeline stages –multiple instructions (or operations) / cycle Note: –the lower CPI the larger the effect of penalties
38
H.C. TD 510238 What can we do about control hazards and CPI penalty? Keep penalty P branch low: –Early computation of new PC –Early determination of condition –Visible delay slots filled by compiler (MIPS) Branch prediction Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro’95] Remove branches: if-conversion –Conditional instructions: CMOVE, cond skip next –Guarding all instructions: TriMedia
39
H.C. TD 510239 Scheduling: Conditional instructions After conversion: Example: Cmove (supported by Alpha) If (A=0) S = T; assume: r1: A, r2: S, r3: T Object code: Bnez r1, L Mov r2, r3 L:.... Cmovz r2, r3, r1
40
H.C. TD 510240 Scheduling: Conditional instructions Conditional instructions are useful, however: Squashed instructions still take execution time and execution resources –Consequence: long target blocks can not be if-converted Condition has to be known early Moving operations across multiple branches requires complicated predicates Compatibility: change of ISA (instruction set architecture) Practice: Current superscalars support a limited set of conditional instructions CMOVE: alpha, MIPS, PowerPC, SPARC HP PA: any RR instruction can conditionally squash next instruction Large VLIWs profit from making all instructions conditional guarded execution: TriMedia, Intel/HP IA-64, TI C6x
41
H.C. TD 510241 Guarded execution SLT r1,r2,r3 BEQ r1,r0, else then: ADDI r2,r2,1..X.. j cont else:SUBI r2,r2,1..Y.. cont: MUL r4,r2 SLT b1,r2,r3 b1:ADDI r2,r2,1 !b1: SUBI r2,r2,1 b1:..X.. !b1:..Y.. MUL r4,r2 IF-conversion
42
H.C. TD 510242 Scheduling: Conditional instructions Full guard support If-conversion of conditional code Assume: t branch branch latency p branch branching probability t true execution time of the TRUE branch t false execution time of the FALSE branch Execution times of original and if-converted code for non-ILP architecture: t original_code = (1 + p branch ) x t branch + p x t true + (1 - p branch ) x t false t if_converted_code = t true + t false
43
H.C. TD 510243 Scheduling: Conditional instructions Speedup of if-converted code for non-ILP architectures Only interesting for short target blocks!
44
H.C. TD 510244 Scheduling: Conditional instructions Speedup of if-converted code for ILP architectures with sufficient resources Much larger area of interest !! t if_converted = max(t true, t false )
45
H.C. TD 510245 Scheduling: Conditional instructions Full guard support for large ILP architectures has a number of advantages: –Removing unpredictable branches –Enlarging scheduling scope –Enabling software pipelining –Enhancing code motion when speculation is not allowed –Resource sharing; even when speculation is allowed guarding may be profitable
46
H.C. TD 510246 Scheduling: Overview Transforming a sequential program into a parallel program: read sequential program read machine description file for each procedure do perform function inlining for each procedure do transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do perform instruction scheduling write parallel program
47
H.C. TD 510247 Scheduling: Int.Lin.Programming Integer linear programming scheduling method Introduce: –Decision variables: x i,j = 1 if operation i is scheduled in cycle j –Constraints like: –Limited resources: where x t operation of type t and M t number of resources of type t –Data dependence constraints –Timing constraints Problem: too many decision variables
48
H.C. TD 510248 List Scheduling Make a dependence graph Determine minimal length Determine ASAP, ALAP, and slack of each operation Place each operation in first cycle with sufficient resources Note: –Scheduling order sequential –Priority determined by used heuristic; e.g. slack
49
H.C. TD 510249 Basic Block Scheduling ADD LD AC y MUL AB z ADD SUB NEGLD A BC X ASAP cycle ALAP cycle slack
50
H.C. TD 510250 ASAP and ALAP formulas asap(v) = max {asap(u) + delay(u,v) | (u,v) E } if pred(v) 0 otherwise alap(v) = min {alap(u) - delay(u,v) | (u,v) E } if succ(v) L max otherwise slack(v) = alap(v) - asap(v)
51
H.C. TD 510251 Cycle based list scheduling proc Schedule (DDG = (V,E)) beginproc ready = { v | (u,v) E }// all nodes which have no predecessor ready’ = ready// all nodes which can be scheduled in sched = // current cycle current_cycle = 0 while sched V do for each v ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhile endproc
52
H.C. TD 510252 Problem with basic block scheduling Basic blocks contain on average only about 6 instructions Unrolling may help for loops Go beyond basic blocks: 1. Extended basic block scheduling 2. Software pipelining
53
H.C. TD 510253 Extended basic block scheduling: Scope B C E F D G A Trace Superblock B C F E’ D’ G’ A E D G tail duplication Partitioning a CFG into scheduling scopes:
54
H.C. TD 510254 Extended basic block scheduling: Scope B C E F D G A Hyperblock/ region Partitioning a CFG into scheduling scopes: B C E’ F’ D’ G’’ A E D G Decision Tree tail duplication F G’
55
H.C. TD 510255 Comparing scheduling scopes: Extended basic block scheduling: Scope
56
H.C. TD 510256 Extended basic block scheduling: Code Motion A a) add r4, r4, 4 b) beq... D e) st r1, 8(r4) C d) sub r1, r1, r2 B c) add r1, r1, r2 Downward code motions? — a B, a C, a D, c D, d D Upward code motions? — c A, d A, e B, e C, e A
57
H.C. TD 510257 Extended basic block scheduling: Code Motion D/b I D II I D b’ M M M M D I M b Legend: Basic blocks between source and destination basic blocks Control flow edges where off-liveness checks have to be performed Basic blocks where duplication have to be placed Destination basic blocks Source basic blocks SCP (single copy on a path) rule: no path may exist between 2 different D blocks
58
H.C. TD 510258 Extended basic block scheduling: Code Motion A dominates B A is always executed before B –Consequently: A does not dominate B code motion from B to A requires code duplication B post-dominates A B is always executed after A –Consequently: B does not post-dominate A code motion from B to A is speculative A CB ED F Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B?
59
H.C. TD 510259 Scheduling: Loops B C D A B C’’ D A C’ C B C’’ D A C’ C Loop peeling Loop unrolling Loop Optimizations:
60
H.C. TD 510260 Scheduling: Loops Problems with unrolling: Exploits only parallelism within sets of n iterations Iteration start-up latency Code expansion Basic block scheduling Basic block scheduling and unrolling Software pipelining resource utilization time
61
H.C. TD 510261 Software pipelining Software pipelining a loop is: –Scheduling the loop such that iterations start before preceding iterations have finished Or: –Moving operations across the backedge LD ML ST LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST Example: y = a.x 3 cycles/iteration Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration
62
H.C. TD 510262 Software pipelining: Modulo scheduling Example: Modulo scheduling a loop for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code without loop control ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue Kernel Epilogue (c) Software pipeline Prologue fills the SW pipeline with iterations Epilogue drains the SW pipeline
63
H.C. TD 510263 Summary and Conclusions Compilation for ILP architectures is getting mature and enters the commercial area. However: –Great discrepancy between available and exploitable parallelism What if you need more parallelism? -source-to-source transformations -use other algorithms
64
H.C. TD 510264
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.