VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI 1994.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Computer Organization and Architecture

Computer Architecture Instruction-Level Parallel Processors

CSCI 4717/5717 Computer Architecture

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

ENGS 116 Lecture 111 ILP: Software Approaches 2 Vincent H. Berk October 14 th Reading for monday: 3.10 – 3.15, Reading for today: 4.2 – 4.6.

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

COMP4611 Tutorial 6 Instruction Level Parallelism

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Instruction Level Parallelism (ILP) Colin Stevens.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Multiscalar processors

DLX Instruction Format

Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Chapter 21 IA-64 Architecture (Think Intel Itanium)

Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)

IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.

CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Computer Architecture Principles Dr. Mike Frank

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Henk Corporaal TUEindhoven 2009

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Instruction Scheduling for Instruction-Level Parallelism

Morgan Kaufmann Publishers The Processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 8: Dynamic ILP Topics: out-of-order processors

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Henk Corporaal TUEindhoven 2011

Code Optimization Overview and Examples Control Flow Graph

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Midterm 2 review Chapter

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Lecture 9: Dynamic ILP Topics: out-of-order processors

Presentation transcript:

VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI Presented by Jason Horihan

Why do we need a special compiler when we have “Super Beast” superscalar processors that extract ILP for us? Processor hardware can only look ahead a small distance to extract ILP Processor hardware can only look ahead a small distance to extract ILP Branch Prediction is not perfect and can only take us so far. Branch Prediction is not perfect and can only take us so far.

VLIW Scheduling Techniques Speculative Load/Store Motion out of Loops Speculative Load/Store Motion out of Loops Unspeculation Unspeculation Scheduling Scheduling Limited Combining Limited Combining Basic Block Expansion Basic Block Expansion Prolog Tailoring Prolog Tailoring All of these are implemented at the code generation stage of the compiler. All of these are implemented at the code generation stage of the compiler.

Speculative Load/Store Motion out of Loops Loads and Stores can be moved if: Loads and Stores can be moved if: 1. Within each group of loads and stores: 1. Within each group of loads and stores: - Each instruction uses the same base register - Each instruction has the same displacement from this base - Each instruction operates on identical operand data length and type

2. The base register of each group is not written to in the loop. 2. The base register of each group is not written to in the loop. 3. There is no overlap with the group operands and any other memory reference in the loop 3. There is no overlap with the group operands and any other memory reference in the loop 4.On every path to the entrance of the loop, a load of an address constant to the base register -or- a load or store to the same location to insure “safe” operation 4.On every path to the entrance of the loop, a load of an address constant to the base register -or- a load or store to the same location to insure “safe” operation

Transformed Code: Ld r4, a(r2) …. Ld r10,a(r2) L1:Mv r12,r10 Ai r12,r12,6 Mv r10,r10 ….. Br L1 St r10, a(r2) Original Code: Original Code: Ld r4, a(r2) …. L1:Ld r12,a(r2) Ai r12,r12,6 St r12,a(r2) ….. Br L1

Unspeculation Instructions moved above conditional branches to improve performance can lower performance when execution goes down the path where the speculative instructions were not needed. Instructions moved above conditional branches to improve performance can lower performance when execution goes down the path where the speculative instructions were not needed. Moving some of these speculative instructions down into one of the paths can increase performance Moving some of these speculative instructions down into one of the paths can increase performance

To perform unspeculation on an instruction (or group of), conditions must be met: To perform unspeculation on an instruction (or group of), conditions must be met: 1.The destination register(s) of the speculative group on one of the paths must ALL be dead. 2.Any instructions between the speculative instruction and the conditional branch must not define or use any of the registers used in the speculative instructions. 3.Instructions cannot have side-effects

Scheduling Loop Unrolling Loop Unrolling Renaming Renaming Global Scheduling Global Scheduling Software Pipelining Software Pipelining

Limited Combining Similar to value numbering, but spans multiple blocks. Similar to value numbering, but spans multiple blocks. 1. Starts with a load immediate or a move register 2. Searches sequence of following instructions, following non-conditional jumps, until a last use is found. 3. Source or destination registers of starting instruction can not be set in the sequence

If the search succeeds, the entire sequence of instructions, from the instruction after the starting instruction to the last use instruction is inserted in place of the starting instruction. If the search succeeds, the entire sequence of instructions, from the instruction after the starting instruction to the last use instruction is inserted in place of the starting instruction. Occurrences of the destination register from the starting instructions are replaced with its source register. Occurrences of the destination register from the starting instructions are replaced with its source register. A branch from the “new” last use instruction is inserted to jump to the instruction after the “old” last use instruction. A branch from the “new” last use instruction is inserted to jump to the instruction after the “old” last use instruction.

Original Code Mvr5, r4 …. Br L3 ….L3: Ld r3, 4(r5) …. Br L4 ….L4: Ld r7, 8(r5) Transformed Code …. Ld r3, 4(r4) …. Ld r7, 8(r4) Br L10 L3: Ld r3, 4(r5) …. Br L4 …. L4: Ld r7, 8(r5) L10:

Basic Block Expansion Main goal is to eliminate unconditional jumps at the end of some basic blocks. Main goal is to eliminate unconditional jumps at the end of some basic blocks. Begin by copying instructions at the target of the unconditional branch and inserting them before the unconditional branch. Begin by copying instructions at the target of the unconditional branch and inserting them before the unconditional branch. When enough consecutive non-branch instructions have been gathered, the copy stops. When enough consecutive non-branch instructions have been gathered, the copy stops.

Original Code …. Bz r1, L1 Op Br L2 …. L2: Bz r3, Lx Op1 Op2 Br L2 L3: Transformed Code …. Bz r1, L1 Op Bz r3, Lx Op1 Op2 Br L2 …. L2: Bz r3, Lx Op1 Op2 L2aBr L2 L3:

Prolog Tailoring When entering and exiting a procedure, registers must be saved and restored in the prolog and epilog. When entering and exiting a procedure, registers must be saved and restored in the prolog and epilog. Prolog Tailoring delays the saving of the registers until absolutely necessary. Prolog Tailoring delays the saving of the registers until absolutely necessary. This shortens the execution path and only saves what is necessary for a given path This shortens the execution path and only saves what is necessary for a given path Exception handlers must be changed Exception handlers must be changed

Prolog Tailoring Algorithm: Prolog Tailoring Algorithm: 1. Generate a “MustKill” set for each node in program graph. 2. If at a given node, a register that hasn’t been savedbefore will definitely be killed, code must be generated to save this register

Proc p1 save r1,r2,r3,r4 …. Ldr2,... Ld r1,… …. restore r1,r2 returnL1: ldr3,.. …. ldr4,.. ldr3,… …. …. restore r3,r4 return Proc p1 …. save r1,r2 Ldr2,... Ld r1,… …. restore r1,r2 return L1: save r3 ldr3,.. …. save r4 ldr4,.. ldr3,… restore r4 …. restore r3 return

Results SPECint92 Measurements (yeah!) Benchmark Xlc Time Xlc Specmark VLIW Time VLIW Specmark Espresso Li Eqntott Compress Sc Gcc SPECint Measurements done on a RS/6000 model 980

Questions? ????????????????????????????????????????