CSE431 L17 VLIW Processors.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors Mary Jane Irwin ( www.cse.psu.edu/~mji.

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Advertisements

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.

CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.

CSE431 Chapter 4C.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 4C: The Processor, Part C Mary Jane Irwin ( )

Instruction Level Parallelism Chapter 4: CS465. Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase.

Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Chapter 4C: The Processor, Part C. Review: Pipeline Hazards  Structural hazards l Design pipeline to eliminate structural hazards  Data hazards – read.

Multiple Issue Processors: Superscalar and VLIW

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

Lecture 8Fall 2006 Chapter 6: Superscalar Adapted from Mary Jane Irwin at Penn State University for Computer Organization and Design, Patterson & Hennessy,

CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (

EI 209 Chapter 4E.1Haojin Zhu, CSE, 2015 EI 209 Computer Organization Fall 2015 Chapter 4E: Multiple-Issue Processor [Adapted from Computer Organization.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

CSIE30300 Computer Architecture Unit 13: Introduction to Multiple Issue Hsin-Chou Chi [Adapted from material by and

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

CSE431 L09 Multiple Issue Intro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 09: Multiple Issue Introduction Mary Jane Irwin (

Use of Pipelining to Achieve CPI < 1

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

CS 352H: Computer Systems Architecture

XU, Qiang 徐強 [Adapted from UC Berkeley’s D. Patterson’s and

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Morgan Kaufmann Publishers

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipeline Implementation (4.6)

/ Computer Architecture and Design

Henk Corporaal TUEindhoven 2009

ECE232: Hardware Organization and Design

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Morgan Kaufmann Publishers The Processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

CS314 Computer Organization Fall 2016

Computer Architecture

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Henk Corporaal TUEindhoven 2011

Control unit extension for data hazards

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CS203 – Advanced Computer Architecture

CSC3050 – Computer Architecture

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Guest Lecturer: Justin Hsia

CS314 Computer Organization Fall 2017

Presentation transcript:

CSE431 L17 VLIW Processors.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors Mary Jane Irwin ( ) [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

CSE431 L17 VLIW Processors.2Irwin, PSU, 2005 Review: Multi-Issue Datapath Responsibilities  Must handle, with a combination of hardware and software l Data dependencies – aka data hazards -True data dependencies (read after write) –Use data forwarding hardware –Use compiler scheduling -Storage dependence (aka name dependence) –Use register renaming to solve both »Antidependencies (write after read) »Output dependencies (write after write) l Procedural dependencies – aka control hazards -Use aggressive branch prediction (speculation) -Use predication l Resource conflicts – aka structural hazards -Use resource duplication or resource pipelining to reduce (or eliminate) resource conflicts -Use arbitration for result and commit buses and register file read and write ports

CSE431 L17 VLIW Processors.3Irwin, PSU, 2005 Review: Multiple-Issue Processor Styles  Dynamic multiple-issue processors (aka superscalar) l Decisions on which instructions to execute simultaneously (in the range of 2 to 8 in 2005) are being made dynamically (at run time by the hardware) l E.g., IBM Power 2, Pentium 4, MIPS R10K, HP PA 8500 IBM  Static multiple-issue processors (aka VLIW) l Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler) l E.g., Intel Itanium and Itanium 2 for the IA-64 ISA – EPIC (Explicit Parallel Instruction Computer) -128 bit “bundles” containing 3 instructions each 41 bits + 5 bit template field (specifies which FU each instr needs) -Five functional units (IntALU, MMedia, DMem, FPALU, Branch) -Extensive support for speculation and predication

CSE431 L17 VLIW Processors.4Irwin, PSU, 2005 History of VLIW Processors  Started with (horizontal) microprogramming l Very wide microinstructions used to directly generate control signals in single-issue processors (e.g., IBM 360 series)  VLIW for multi-issue processors first appeared in the Multiflow and Cydrome (in the early 1980’s)  Current commercial VLIW processors l Intel i860 RISC (dual mode: scalar and VLIW) l Intel I-64 (EPIC: Itanium and Itanium 2) l Transmeta Crusoe l Lucent/Motorola StarCore l ADI TigerSHARC l Infineon (Siemens) Carmel

CSE431 L17 VLIW Processors.5Irwin, PSU, 2005 Static Multiple Issue Machines (VLIW)  Static multiple-issue processors (aka VLIW) use the compiler to decide which instructions to issue and execute simultaneously l Issue packet – the set of instructions that are bundled together and issued in one clock cycle – think of it as one large instruction with multiple operations l The mix of instructions in the packet (bundle) is usually restricted – a single “instruction” with several predefined fields l The compiler does static branch prediction and code scheduling to reduce (control) or eliminate (data) hazards  VLIW’s have l Multiple functional units (like SS processors) l Multi-ported register files (again like SS processors) l Wide program bus

CSE431 L17 VLIW Processors.6Irwin, PSU, 2005 An Example: A VLIW MIPS  Consider a 2-issue MIPS with a 2 instr bundle ALU Op (R format) or Branch (I format) Load or Store (I format) 64 bits  Instructions are always fetched, decoded, and issued in pairs l If one instr of the pair can not be used, it is replaced with a noop  Need 4 read ports and 2 write ports and a separate memory address adder

CSE431 L17 VLIW Processors.7Irwin, PSU, 2005 A MIPS VLIW (2-issue) Datapath Instruction Memory Add PC 4 Write Data Write Addr Register File ALU Add Data Memory Sign Extend Add Sign Extend  No hazard hardware (so no load use allowed)

CSE431 L17 VLIW Processors.8Irwin, PSU, 2005 Code Scheduling Example  Consider the following loop code lp:lw$t0,0($s1) # $t0=array element addu$t0,$t0,$s2 # add scalar in $s2 sw$t0,0($s1) # store result addi$s1,$s1,-4 # decrement pointer bne$s1,$0,lp # branch if $s1 != 0  Must “schedule” the instructions to avoid pipeline stalls l Instructions in one bundle must be independent l Must separate load use instructions from their loads by one cycle l Notice that the first two instructions have a load use dependency, the next two and last two have data dependencies l Assume branches are perfectly predicted by the hardware

CSE431 L17 VLIW Processors.9Irwin, PSU, 2005 The Scheduled Code (Not Unrolled)  Four clock cycles to execute 5 instructions for a l CPI of 0.8 (versus the best case of 0.5) l IPC of 1.25 (versus the best case of 2.0) l noops don’t count towards performance !! ALU or branchData transferCC lp:lw $t0,0($s1) 1 addi $s1,$s1,-4 2 addu $t0,$t0,$s2 3 bne $s1,$0,lpsw $t0,4($s1) 4

CSE431 L17 VLIW Processors.10Irwin, PSU, 2005 Loop Unrolling  Loop unrolling – multiple copies of the loop body are made and instructions from different iterations are scheduled together as a way to increase ILP  Apply loop unrolling (4 times for our example) and then schedule the resulting code l Eliminate unnecessary loop overhead instructions l Schedule so as to avoid load use hazards  During unrolling the compiler applies register renaming to eliminate all data dependencies that are not true dependencies

CSE431 L17 VLIW Processors.11Irwin, PSU, 2005 Unrolled Code Example lp:lw$t0,0($s1) # $t0=array element lw$t1,-4($s1) # $t1=array element lw$t2,-8($s1) # $t2=array element lw$t3,-12($s1) # $t3=array element addu$t0,$t0,$s2 # add scalar in $s2 addu$t1,$t1,$s2 # add scalar in $s2 addu$t2,$t2,$s2 # add scalar in $s2 addu$t3,$t3,$s2 # add scalar in $s2 sw$t0,0($s1) # store result sw$t1,-4($s1) # store result sw$t2,-8($s1) # store result sw$t3,-12($s1) # store result addi$s1,$s1,-16 # decrement pointer bne$s1,$0,lp # branch if $s1 != 0

CSE431 L17 VLIW Processors.12Irwin, PSU, 2005 The Scheduled Code (Unrolled)  Eight clock cycles to execute 14 instructions for a l CPI of 0.57 (versus the best case of 0.5) l IPC of 1.8 (versus the best case of 2.0) ALU or branchData transferCC lp:addi $s1,$s1,-16lw $t0,0($s1) 1 lw $t1,12($s1) 2 addu $t0,$t0,$s2lw $t2,8($s1) 3 addu $t1,$t1,$s2lw $t3,4($s1) 4 addu $t2,$t2,$s2sw $t0,16($s1) 5 addu $t3,$t3,$s2sw $t1,12($s1) 6 sw $t2,8($s1) 7 bne $s1,$0,lpsw $t3,4($s1) 8

CSE431 L17 VLIW Processors.13Irwin, PSU, 2005 Speculation  Speculation is used to allow execution of future instr’s that (may) depend on the speculated instruction l Speculate on the outcome of a conditional branch (branch prediction) l Speculate that a store (for which we don’t yet know the address) that precedes a load does not refer to the same address, allowing the load to be scheduled before the store (load speculation)  Must have (hardware and/or software) mechanisms for l Checking to see if the guess was correct l Recovering from the effects of the instructions that were executed speculatively if the guess was incorrect -In a VLIW processor the compiler can insert additional instr’s that check the accuracy of the speculation and can provide a fix-up routine to use when the speculation was incorrect  Ignore and/or buffer exceptions created by speculatively executed instructions until it is clear that they should really occur

CSE431 L17 VLIW Processors.14Irwin, PSU, 2005 Predication  Predication can be used to eliminate branches by making the execution of an instruction dependent on a “predicate”, e.g., if (p) {statement 1} else {statement 2} would normally compile using two branches. With predication it would compile as (p) statement 1 (~p) statement 2  The use of (condition) indicates that the instruction is committed only if condition is true  Predication can be used to speculate as well as to eliminate branches

CSE431 L17 VLIW Processors.15Irwin, PSU, 2005 Compiler Support for VLIW Processors  The compiler packs groups of independent instructions into the bundle l Done by code re-ordering (trace scheduling)  The compiler uses loop unrolling to expose more ILP  The compiler uses register renaming to solve name dependencies and ensures no load use hazards occur  While superscalars use dynamic prediction, VLIW’s primarily depend on the compiler for branch prediction l Loop unrolling reduces the number of conditional branches l Predication eliminates if-the-else branch structures by replacing them with predicated instructions  The compiler predicts memory bank references to help minimize memory bank conflicts

CSE431 L17 VLIW Processors.16Irwin, PSU, 2005 VLIW Advantages & Disadvantages (aot SS)  Advantages l Simpler hardware (potentially less power hungry) l Potentially more scalable -Allow more instr’s per VLIW bundle and add more FUs  Disadvantages l Programmer/compiler complexity and longer compilation times -Deep pipelines and long latencies can be confusing (making peak performance elusive) l Lock step operation, i.e., on hazard all future issues stall until hazard is resolved (hence need for predication) l Object (binary) code incompatibility l Needs lots of program memory bandwidth l Code bloat -Noops are a waste of program memory space -Loop unrolling to expose more ILP uses more program memory space

CSE431 L17 VLIW Processors.17Irwin, PSU, 2005 CISC vs RISC vs SS vs VLIW CISCRISCSuperscalarVLIW Instr sizevariable sizefixed size fixed size (but large) Instr formatvariable format fixed format Registersfew, some special many GPGP and rename (RUU) many, many GP Memory reference embedded in many instr’s load/store Key Issuesdecode complexity data forwarding, hazards hardware dependency resolution (compiler) code scheduling Instruction flow IF ID EXM WB IF ID EXM WB EXMWB IF ID EXM WB IF ID EXM WB EXMWB IF ID EXM WB IF ID EXM WB IF ID EXM WB IF ID EXM WB IF ID EXM WB

CSE431 L17 VLIW Processors.18Irwin, PSU, 2005 Next Lecture and Reminders  Next lecture -Reading assignment – PH 7.1  Reminders l HW4 due November 3 l Final exam (tentatively) schedule -Tuesday, December 13 th, 2:30-4:20, Location TBD