Download presentation
Presentation is loading. Please wait.
1
Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)
2
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman2 Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples –C6 –TM –TTA Clustering Code generation / scheduling Design Space Exploration: TTA framework
3
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman3 Scheduling: Overview Transforming a sequential program into a parallel program: read sequential program read machine description file for each procedure do perform function inlining for each procedure do transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do perform instruction scheduling write parallel program
4
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman4 Extended basic block scheduling: Code Motion A a) add r3, r4, 4 b) beq... D e) mul r1, r1, r3 C d) sub r3, r3, r2 B c) add r1, r1, r2 Downward code motions? — a B, a C, a D, c D, d D Upward code motions? — c A, d A, e B, e C, e A
5
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman5 Extended Scheduling scope A C F B D E G A; If cond Then B Else C; D; If cond Then E Else F; G; Code: CFG: Control Flow Graph
6
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman6 Scheduling scopes Trace Superblock Decision tree Hyperblock/region
7
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman7 Create and Enlarge Scheduling Scope B C E F D G A Trace Superblock B C F E’ D’ G’ A E D G tail duplication
8
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman8 Create and Enlarge Scheduling Scope B C E F D G A Hyperblock/ region B C E’ F’ D’ G’’ A E D G Decision Tree tail duplication F G’
9
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman9 Comparing scheduling scopes
10
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman10 Code movement (upwards) within regions I II add I source block destination block I Copy needed Intermediate block Check for off-liveness Legend: Code movement
11
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman11 Extended basic block scheduling: Code Motion A dominates B A is always executed before B –Consequently: A does not dominate B code motion from B to A requires code duplication B post-dominates A B is always executed after A –Consequently: B does not post-dominate A code motion from B to A is speculative A CB ED F Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B?
12
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman12 Scheduling: Loops B C D A Loop Optimizations: B C’’ D A C’ C Loop peeling B C’’ D A C’ C Loop unrolling
13
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman13 Scheduling: Loops Problems with unrolling: Exploits only parallelism within sets of n iterations Iteration start-up latency Code expansion Basic block scheduling Basic block scheduling and unrolling Software pipelining resource utilization time
14
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman14 Software pipelining Software pipelining a loop is: –Scheduling the loop such that iterations start before preceding iterations have finished Or: –Moving operations across the backedge LD ML ST LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST Example: y = a.x 3 cycles/iteration Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration
15
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman15 Software pipelining (cont’d) Basic techniques: Modulo scheduling (Rau, Lam) –list scheduling with modulo resource constraints Kernel recognition techniques –unroll the loop –schedule the iterations –identify a repeating pattern –Examples: Perfect pipelining (Aiken and Nicolau) URPR (Su, Ding and Xia) Petri net pipelining (Allan) Enhanced pipeline scheduling (Ebcioğlu) –fill first cycle of iteration –copy this instruction over the backedge
16
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman16 Software pipelining: Modulo scheduling Example: Modulo scheduling a loop for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code without loop control ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue Kernel Epilogue (c) Software pipeline Prologue fills the SW pipeline with iterations Epilogue drains the SW pipeline
17
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman17 Software pipelining: determine II, the Initation Interval ld r1, (r2) mul r3, r1, 3 (0,1)(1,0) sub r4, r3, 1 st r4, (r5) (0,1)(1,0) (0,1)(1,0) (1,6) (delay, distance) Cyclic data dependences cycle(v) cycle(u) + delay(u,v) - II.distance(u,v) For (i=0;.....) A[i+6]= 3*A[i]-1
18
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman18 Modulo scheduling constraints MII minimum initiation interval bounded by cyclic dependences and resources: MII = max{ ResMII, RecMII } Resources: Cycles: Therefore: Or:
19
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman19 The Role of the Compiler 9 steps required to translate an HLL program (see online bookchapter) Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports
20
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman20 Division of responsibilities between hardware and compiler Frontend Binding of Operands Determine Dependencies Scheduling Binding of Transports Binding of Operations Execute Binding of Operands Determine Dependencies Scheduling Binding of Transports Binding of Operations Responsibility of compilerResponsibility of Hardware Application Superscalar Dataflow Multi-threaded Indep. Arch VLIW TTA
21
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman21 Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples –C6 –TM –TTA Clustering Code generation Design Space Exploration: TTA framework
22
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman22 Mapping applications to processors MOVE framework Architecture parameters Optimizer Parametric compiler Hardware generator feedback User intercation Parallel object code chip Pareto curve (solution space) cost exec. time x x x x x x x x x x x x x x x xx x x x Move framework TTA based system
23
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman23 TTA (MOVE) organization Socket Data Memory Instruction Memory
24
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman24 Code generation trajectory for TTAs Application (C) Compiler frontend Sequential code Compiler backend Parallel code Sequential simulation Parallel simulation Architecture description Profiling data Input/Output Frontend: GCC or SUIF (adapted) Frontend: GCC or SUIF (adapted)
25
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman25 Exploration: TTA resource reduction
26
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman26 Exporation: TTA connectivity reduction Number of connections removed Execution time Reducing bus delay FU stage constrains cycle time Critical connections disappear 0
27
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman27 Can we do better How ? Transformations SFUs: Special Function Units Multiple Processors Cost Execution time
28
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman28 Transforming the specification + + + + + + Based on associativity of + operation a + (b + c) = (a + b) + c
29
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman29 Transforming the specification d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; r = 2*b – a; x = z + y; << - a 1b + x zy r
30
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman30 Changing the architecture adding SFUs: special function units + + + + + + 4-input adder why is this faster?
31
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman31 Changing the architecture adding SFUs: special function units In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !!
32
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman32 SFUs: fine grain patterns Why using fine grain SFUs: –Code size reduction –Register file #ports reduction –Could be cheaper and/or faster –Transport reduction –Power reduction (avoid charging non-local wires) –Supports whole application domain ! Which patterns do need support? Detection of recurring operation patterns needed
33
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman33 SFUs: covering results
34
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman34 Exploration: resulting architecture 9 buses 4 RFs 4 Addercmp FUs2 Multiplier FUs 2 Diffadd FUs stream output stream input Architecture for image processing Note the reduced connectivity
35
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman35 Conclusions Billions of embedded processing systems –how to design these systems quickly, cheap, correct, low power,.... ? –what will their processing platform look like? VLIWs are very powerful and flexible –can be easily tuned to application domain TTAs even more flexible, scalable, and lower power
36
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman36 Conclusions Compilation for ILP architectures is mature, and Enters the commercial area. However –Great discrepancy between available and exploitable parallelism Advanced code scheduling techniques needed to exploit ILP
37
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman37 Bottom line:
38
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman38 Hands-on (not this year) Map JPEG to a TTA processor –see web page: http://www.ics.ele.tue.nl/~heco/courses/pam Install TTA tools (compiler and simulator) Go through all listed steps Perform DSE: design space exploration Add SFU 1 or 2 page report in 2 weeks
39
7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman39 Hands-on Let’s look at DSE: Design Space Exploration We will use the Imagine processor http://cva.stanford.edu/projects/imagine/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.