Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures (part b) ILP compilation (part a)

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman2 What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture operation 1operation 2operation 3operation 4 Instruction format: operation 5

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman3 VLIW evaluation Strong points of VLIW: –Scalable (add more FUs) –Flexible (an FU can be almost anything; e.g. multimedia support) Weak points: With N FUs: –Bypassing complexity: O(N 2 ) –Register file complexity: O(N) –Register file size: O(N 2 ) Register file design restricts FU flexibility Solution:.................................................. ?

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman4 VLIW evaluation Instruction memory Instruction fetch unit Instruction decode unit FU-1 FU-2 FU-3 FU-4 FU-5 Register file Data memory CPU Bypassing network Control problem O(N 2 )O(N)-O(N 2 ) With N function units

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman5 Solution TTA: Transport Triggered Architecture > st * +- > * +-

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman6 Transport Triggered Architecture General organization of a TTA Instruction memory Instruction fetch unit Instruction decode unit FU-1 FU-2 FU-3 FU-4 FU-5 Register file Data memory CPU Bypassing network

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman7 TTA structure; datapath details Socket Data Memory Instruction Memory

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman8 TTA hardware characteristics Modular: building blocks easy to reuse Very flexible and scalable –easy inclusion of Special Function Units (SFUs) Very low complexity –> 50% reduction on # register ports –reduced bypass complexity (no associative matching) –up to 80 % reduction in bypass connectivity –trivial decoding –reduced register pressure –easy register file partitioning (a single port is enough!)

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman9 TTA software characteristics More difficult to schedule ! But: extra scheduling optimizations add r3, r1, r2 r1  add.o1; r2  add.o2; add.r  r3 That does not look like an improvement !?! + o1o2 r

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman10 Program TTAs How to do data operations ? 1. Transport of operands to FU Operand move (s) Trigger move 2. Transport of results from FU Result move (s) How to do Control flow ? 1. Jumps:#jump-address  pc 2. Branch:#displacement  pcd 3. Call:pc  r; #call-address  pcd Example Add r3,r1,r2 becomes r1  Oint// operand move to integer unit r2  Tadd// trigger move to integer unit ………….// addition operation in progress Rint  r3// result move from integer unit TriggerOperand Internal stage Result FU Pipeline

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman11 Scheduling example add r1,r2,r2 sub r4,r1,95 VLIW r1 -> add.o1, r2 -> add.o2 add.r -> sub.o1, 95 -> sub.o2 sub.r -> r4 TTA integer RF immediate unit integer ALU integer ALU load/store unit

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman12 TTA Instruction format General MOVE field: g: guard specifier i: immediate specifier src: source dst: desitnation gisrcdst How to use immediates? Small, 6 bits Long, 32 bits g1immdst g0Ir-1dstimm move 1 General MOVE instructions: multiple fields move 2move 3move 4

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman13 Programming TTAs How to do conditional execution Each move is guarded Example r1  cmp.o1// operand move to compare unit r2  cmp.o2// trigger move to compare unit cmp.r  g// put result in boolean register g g:r3  r4// guarded move takes place when r1=r2

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman14 Register file port pressure for TTAs

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman15 Summary of TTA Advantages Better usage of transport capacity –Instead of 3 transports per dyadic operation, about 2 are needed –# register ports reduced with at least 50% –Inter FU connectivity reduces with 50-70% No full connectivity required Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs Flexible: Fus can incorporate arbitrary functionality Scalable: #FUS, #reg.files, etc. can be changed FU splitting results into extra exploitable concurrency TTAs are easy to design and can have short cycle times

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman16 TTA automatic DSE Architecture parameters Optimizer Parametric compiler Hardware generator feedback User intercation Parallel object code chip Pareto curve (solution space) cost exec. time x x x x x x x x x x x x x x x xx x x x Move framework

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman17 Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples –C6 –TM –TTA Clustering and Reconfigurable components Code generation Hands-on

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman18 Clustered VLIW Clustering = Splitting up the VLIW data path - same can be done for the instruction path – FU loop buffer register file FU loop buffer register file FU loop buffer register file Level 1 Instruction Cache Level 1 Data Cache Level 2 (shared) Cache

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman19 Clustered VLIW Why clustering? Timing: faster clock Lower Cost –silicon area –T2M Lower Energy

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman20 Programmable Interconnect I/O Blocks (IOBs) Configurable Logic Blocks (CLBs) Fine-Grained reconfigurable: Xilinx XC4000 FPGA

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman21 Coarse-Grained reconfigurable: Chameleon CS2000 Highlights: 32-bit datapath (ALU/Shift) 16x24 Multiplier distributed local memory fixed timing

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman22 Hybrid FPGAs: Virtex II-Pro ReConfig. logic Up to 16 serial transceivers PowerPCs Courtesy of Xilinx (Virtex II Pro) PowerPC Reconfigurable logic blocks Memory blocks GHz IO: Up to 16 serial transceivers

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman23 HW or SW reconfigurable? Data path granularity finecoarse Reconfiguration time 1 cycle Subword parallelism loopbuffer context reset Spatial mapping Temporal mapping FPGAVLIW configuration bandwidth

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman24 Granularity Makes Differences

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman25 Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples –C6 –TM –TTA Clustering and Reconfigurable components Code generation Hands-on

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman26 Compiler basics Overview –Compiler trajectory / structure / passes –Control Flow Graph (CFG) –Mapping and Scheduling –Basic block list scheduling –Extended scheduling scope –Loop scheduling –Loop transformations

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman27 Compiler basics: trajectory Preprocessor Compiler Assembler Loader/Linker Source program Object program Error messages Library code

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman28 Compiler basics: structure / passes Lexical analyzer Parsing Code optimization Register allocation Source code Sequential code Intermediate code Code generation Scheduling and allocation Object code token generation check syntax check semantic parse tree generation data flow analysis local optimizations global optimizations code selection peephole optimizations making interference graph graph coloring spill code insertion caller / callee save and restore code exploiting ILP

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman29 Compiler basics: structure Simple compilation example Lexical analyzer Syntax analyzer Intermediate code generator position := initial + rate * 60 id := id + id * 60 := + id * 60 id Code optimizer Code generator temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 temp1 := id3 * 60.0 id1 := id2 + temp1 movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman30 Compiler basics: Control flow graph (CFG) C input code: CFG: 1 sub t1, a, b bgz t1, 2, 3 4 ………….. 3 rem r, b, a goto 4 2 rem r, a, b goto 4 Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. if (a > b) { r = a % b; } else { r = b % a; }

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman31 Machine independent optimizations Machine dependent optimizations Compiler basics : Basic optimizations

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman32 Machine independent optimizations –Common subexpression elimination –Constant folding –Copy propagation –Dead-code elimination –Induction variable elimination –Strength reduction –Algebraic identities Commutative expressions Associativity: Tree height reduction –Note: not always allowed(due to limited precision) Compiler basics : Basic optimizations

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman33 Machine dependent optimization example What’s the optimal implementation of a*34 ? –Use multiplier: mul Tb, Ta, 34 Pro: No thinking required Con: May take many cycles –Alternative: SHL Tc, Ta, 1 ADD Tb, Tc, Tzero SHL Tc, Tc, 4 ADD Tb, Tb, Tc Pros: May take fewer cycles Cons: Uses more registers Additional instructions ( I-cache load / code size) Compiler basics : Basic optimizations

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman34 Register Organization Conventions needed for parameter passing and register usage across function calls Compiler basics : Register allocation r31 r21 r20 r11 r10 r1 r0 Callee saved registers Caller saved registers Argument and result transfer Hard-wired 0 Temporaries

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman35 Register allocation using graph coloring Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program? A variable is defined at a point in program when a value is assigned to it. A variable is used at a point in a program when its value is referenced in an expression. The live range of a variable is the execution range between definitions and uses of a variable.

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman36 Program: a := c := b := := b d := := a := c := d abcd Live Ranges Register allocation using graph coloring Example:

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman37 Register allocation using graph coloring a bc d Inference Graph a bc d Coloring: a = red b = green c = blue d = green Graph needs 3 colors => program needs 3 registers

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman38 Register allocation using graph coloring Spill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Example: Only two registers available !! Program: a := c := store c b := := b d := := a load c := c := d abcd Live Ranges

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman39 Register allocation for a monolithic RF Scheme of the optimistic register allocator RenumberBuildSpill costsSimplifySelect Spill code The Select phase selects a color (= machine register) for a variable that minimizes the heuristic: h1 = fdep(col, var) + caller_callee(col, var) where: fdep(col, var) : a measure for the introduction of false dependencies caller_callee(col, var) : cost for mapping var on a caller or callee saved register

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman40 CISC era –Code size important –Determine shortest sequence of code Many options may exist –Pattern matching Example M68029: D1 := D1 + M[ M[10+A1] + 16*D2 + 20 ]  ADD ([10,A1], D2*16, 20) D1 RISC era –Performance important –Only few possible code sequences –New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020 Compiler basics : Code selection

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman41 Mapping / Scheduling: placing operations in space and time d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; ** ++ - + ab 2 zy d ef r x Data Dependence Graph (DDG)

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman42 How to map these operations? ** ++ - + a b 2 z y d e f r x Architecture constraints: One Function Unit All operations single cycle latency * * + + - + cycle 1 2 3 4 5 6

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman43 How to map these operations? ** ++ - + a b 2 z y d e f r x Architecture constraints: One Add-sub and one Mul unit All operations single cycle latency * *+ + - + cycle 1 2 3 4 5 6 MulAdd-sub

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman44 There are many mapping solutions Pareto curve (solution space) T execution x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x Cost 0

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman45 Basic Block Scheduling Make a dependence graph Determine minimal length Determine ASAP, ALAP, and slack of each operation Place each operation in first cycle with sufficient resources Note: –Scheduling order sequential –Priority determined by used heuristic; e.g. slack

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman46 Basic Block Scheduling ADD LD AC y MUL AB z ADD SUB NEGLD A BC X ASAP cycle ALAP cycle slack

6/25/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman47 Cycle based list scheduling proc Schedule (DDG = (V,E)) beginproc ready = { v |  (u,v)  E } ready’ = ready sched =  current_cycle = 0 while sched  V do for each v  ready’ do if  ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched  {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v  sched   (u,v)  E, u  sched } ready’ = { v | v  ready   (u,v)  E, cycle(u) + delay(u,v)  current_cycle} endwhile endproc

Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures (part b) ILP compilation (part a)

Similar presentations

Presentation on theme: "Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures (part b) ILP compilation (part a)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures (part b) ILP compilation (part a)

Similar presentations

Presentation on theme: "Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures (part b) ILP compilation (part a)"— Presentation transcript:

Similar presentations

About project

Feedback