Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

Slides:

Advertisements

Similar presentations

MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation Henk Corporaal Technical.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.

Compiler techniques for exposing ILP

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI 1994.

Instruction Level Parallelism (ILP) Colin Stevens.

Cpeg421-08S/final-review1 Course Review Tom St. John.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.

Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer.

Processor Architectures and Program Mapping 5kk10 TU/e 2006 Henk Corporaal Jef van Meerbergen Bart Mesman.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures (part b) ILP compilation (part a)

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Generic Software Pipelining at the Assembly Level Markus Pister

October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.

Embedded Computer Architecture TU/e 5kk73 Henk Corporaal VLIW architectures: Generating VLIW code.

Parallelism Processing more than one instruction at a time. Pipelining

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Automated Design of Custom Architecture Tulika Mitra

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Transport Triggered Architectures used for Embedded Systems Henk Corporaal EE department Delft Univ. of Technology

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Advanced Architectures

Design-Space Exploration

Henk Corporaal TUEindhoven 2009

Platform-based Design

CSL718 : VLIW - Software Driven ILP

CSCI1600: Embedded and Real Time Software

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Loop Scheduling and Software Pipelining

Parallelization, Compilation and Platforms PCP

Henk Corporaal TUEindhoven 2011

Processor Architectures and Program Mapping

Architectural-Level Synthesis

Instruction Level Parallelism (ILP)

ESE532: System-on-a-Chip Architecture

Static Code Scheduling

How to improve (decrease) CPI

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman2 Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples –C6 –TM –TTA Clustering Code generation / scheduling Design Space Exploration: TTA framework

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman3 Scheduling: Overview Transforming a sequential program into a parallel program: read sequential program read machine description file for each procedure do perform function inlining for each procedure do transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do perform instruction scheduling write parallel program

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman4 Extended basic block scheduling: Code Motion A a) add r3, r4, 4 b) beq... D e) mul r1, r1, r3 C d) sub r3, r3, r2 B c) add r1, r1, r2 Downward code motions? — a  B, a  C, a  D, c  D, d  D Upward code motions? — c  A, d  A, e  B, e  C, e  A

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman5 Extended Scheduling scope A C F B D E G A; If cond Then B Else C; D; If cond Then E Else F; G; Code: CFG: Control Flow Graph

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman6 Scheduling scopes Trace Superblock Decision tree Hyperblock/region

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman7 Create and Enlarge Scheduling Scope B C E F D G A Trace Superblock B C F E’ D’ G’ A E D G tail duplication

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman8 Create and Enlarge Scheduling Scope B C E F D G A Hyperblock/ region B C E’ F’ D’ G’’ A E D G Decision Tree tail duplication F G’

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman9 Comparing scheduling scopes

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman10 Code movement (upwards) within regions I II add I source block destination block I Copy needed Intermediate block Check for off-liveness Legend: Code movement

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman11 Extended basic block scheduling: Code Motion A dominates B  A is always executed before B –Consequently: A does not dominate B  code motion from B to A requires code duplication B post-dominates A  B is always executed after A –Consequently: B does not post-dominate A  code motion from B to A is speculative A CB ED F Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B?

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman12 Scheduling: Loops B C D A Loop Optimizations: B C’’ D A C’ C Loop peeling B C’’ D A C’ C Loop unrolling

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman13 Scheduling: Loops Problems with unrolling: Exploits only parallelism within sets of n iterations Iteration start-up latency Code expansion Basic block scheduling Basic block scheduling and unrolling Software pipelining resource utilization time

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman14 Software pipelining Software pipelining a loop is: –Scheduling the loop such that iterations start before preceding iterations have finished Or: –Moving operations across the backedge LD ML ST LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST Example: y = a.x  3 cycles/iteration Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman15 Software pipelining (cont’d) Basic techniques: Modulo scheduling (Rau, Lam) –list scheduling with modulo resource constraints Kernel recognition techniques –unroll the loop –schedule the iterations –identify a repeating pattern –Examples: Perfect pipelining (Aiken and Nicolau) URPR (Su, Ding and Xia) Petri net pipelining (Allan) Enhanced pipeline scheduling (Ebcioğlu) –fill first cycle of iteration –copy this instruction over the backedge

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman16 Software pipelining: Modulo scheduling Example: Modulo scheduling a loop for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code without loop control ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue Kernel Epilogue (c) Software pipeline Prologue fills the SW pipeline with iterations Epilogue drains the SW pipeline

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman17 Software pipelining: determine II, the Initation Interval ld r1, (r2) mul r3, r1, 3 (0,1)(1,0) sub r4, r3, 1 st r4, (r5) (0,1)(1,0) (0,1)(1,0) (1,6) (delay, distance) Cyclic data dependences cycle(v)  cycle(u) + delay(u,v) - II.distance(u,v) For (i=0;.....) A[i+6]= 3*A[i]-1

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman18 Modulo scheduling constraints MII minimum initiation interval bounded by cyclic dependences and resources: MII = max{ ResMII, RecMII } Resources: Cycles: Therefore: Or:

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman19 The Role of the Compiler 9 steps required to translate an HLL program (see online bookchapter) Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman20 Division of responsibilities between hardware and compiler Frontend Binding of Operands Determine Dependencies Scheduling Binding of Transports Binding of Operations Execute Binding of Operands Determine Dependencies Scheduling Binding of Transports Binding of Operations Responsibility of compilerResponsibility of Hardware Application Superscalar Dataflow Multi-threaded Indep. Arch VLIW TTA

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman21 Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples –C6 –TM –TTA Clustering Code generation Design Space Exploration: TTA framework

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman22 Mapping applications to processors MOVE framework Architecture parameters Optimizer Parametric compiler Hardware generator feedback User intercation Parallel object code chip Pareto curve (solution space) cost exec. time x x x x x x x x x x x x x x x xx x x x Move framework TTA based system

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman23 TTA (MOVE) organization Socket Data Memory Instruction Memory

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman24 Code generation trajectory for TTAs Application (C) Compiler frontend Sequential code Compiler backend Parallel code Sequential simulation Parallel simulation Architecture description Profiling data Input/Output Frontend: GCC or SUIF (adapted) Frontend: GCC or SUIF (adapted)

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman25 Exploration: TTA resource reduction

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman26 Exporation: TTA connectivity reduction Number of connections removed Execution time Reducing bus delay FU stage constrains cycle time Critical connections disappear 0

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman27 Can we do better How ? Transformations SFUs: Special Function Units Multiple Processors Cost Execution time

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman28 Transforming the specification Based on associativity of + operation a + (b + c) = (a + b) + c

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman29 Transforming the specification d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; r = 2*b – a; x = z + y; << - a 1b + x zy r

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman30 Changing the architecture adding SFUs: special function units input adder why is this faster?

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman31 Changing the architecture adding SFUs: special function units In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !!

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman32 SFUs: fine grain patterns Why using fine grain SFUs: –Code size reduction –Register file #ports reduction –Could be cheaper and/or faster –Transport reduction –Power reduction (avoid charging non-local wires) –Supports whole application domain ! Which patterns do need support? Detection of recurring operation patterns needed

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman33 SFUs: covering results

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman34 Exploration: resulting architecture 9 buses 4 RFs 4 Addercmp FUs2 Multiplier FUs 2 Diffadd FUs stream output stream input Architecture for image processing Note the reduced connectivity

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman35 Conclusions Billions of embedded processing systems –how to design these systems quickly, cheap, correct, low power,.... ? –what will their processing platform look like? VLIWs are very powerful and flexible –can be easily tuned to application domain TTAs even more flexible, scalable, and lower power

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman36 Conclusions Compilation for ILP architectures is mature, and Enters the commercial area. However –Great discrepancy between available and exploitable parallelism Advanced code scheduling techniques needed to exploit ILP

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman37 Bottom line:

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman38 Hands-on (not this year) Map JPEG to a TTA processor –see web page: Install TTA tools (compiler and simulator) Go through all listed steps Perform DSE: design space exploration Add SFU 1 or 2 page report in 2 weeks

7/15/2015 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman39 Hands-on Let’s look at DSE: Design Space Exploration We will use the Imagine processor