POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Slides:



Advertisements
Similar presentations
Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
CS 6461: Computer Architecture Instruction Level Parallelism
Computer Architecture Pipelines & Superscalars. Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.
COMP4211 (Seminar) Intro to Instruction-Level Parallelism
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Instruction-Level Parallelism and Its Dynamic Exploitation
Topics to be covered Instruction Execution Characteristics
Advanced Architectures
Computer Architecture Principles Dr. Mike Frank
Concepts and Challenges
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS203 – Advanced Computer Architecture
/ Computer Architecture and Design
Henk Corporaal TUEindhoven 2009
Advantages of Dynamic Scheduling
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Henk Corporaal TUEindhoven 2011
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio: Simone Campanoni:

2 Outline Introduction to ILP (a short reminder) VLIW architecture

Beyond CPI = 1 Initial goal to achieve CPI = 1 Can we improve beyond this? Two approaches Superscalar and VLIW Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) e.g. IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 The successful approach (to date) for general purpose computing - 3 -

Beyond CPI = 1 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates Currently found more success in DSP, Multimedia applications Joint HP/Intel agreement in 1999/2000 Intel Architecture-64 (Merced/A-64) 64-bit address Style: “Explicitly Parallel Instruction Computer (EPIC)” But first a little context …

5 5 Definition of ILP ILP = Potential overlap of execution among unrelated instructions Overlapping possible if: No Structural Hazards No RAW, WAR of WAW Stalls No Control Stalls

6 Instruction Level Parallelism Two strategies to support ILP: Dynamic Scheduling: Depend on the hardware to locate parallelism Static Scheduling: Rely on software for identifying potential parallelism Hardware intensive approaches dominate desktop and server markets

7 Dynamic Scheduling The hardware reorder the instruction execution to reduce pipeline stalls while maintaining data flow and exception behavior. Main advantages: It enables handling some cases where dependences are unknown at compile time It simplifies the compiler complexity It allows compiled code to run efficiently on a different pipeline. Those advantages are gained at a cost of a significant increase in hardware complexity and power consumption.

8 Static Scheduling Compilers can use sophisticated algorithms for code scheduling to exploit ILP (Instruction Level Parallelism). The amount of parallelism available within a basic block – a straight-line code sequence with no branches in except to the entry and no branches out except at the exit – is quite small. Data dependence can further limit the amount of ILP we can exploit within a basic block to much less than the average basic block size. To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks (i.e. across branches).

9 Static detection and resolution of dependences (  static scheduling): accomplished by the compiler  dependences are avoided by code reordering. Output of the compiler: reordered into dependency- free code. Typical example: VLIW (Very Long Instruction Word) processors expect dependency-free code. Detection and resolution of dependences: Static Scheduling

Very Long Instruction Word Architectures Processor can initiate multiple operations per cycle Specified completely by the compiler (!like superscalar) Low hardware complexity (no scheduling hardware, reduced support of variable latency instructions) No instruction reordering performed by the hardware Explicit parallelism Single control flow 10

11 VLIW: Very Long Instruction Word Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no x-operation RAW check No data use before data ready => no data interlocks Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Int Op 2Mem Op 1Mem Op 2FP Op 1FP Op 2Int Op 1

12 VLIW Compiler Responsibilities The compiler: Schedules to maximize parallel execution Exploit ILP and LLP It is necessary to map the instructions over the machine functional units This mapping must account for time constraints and dependencies among the tasks Guarantees intra-instruction parallelism Schedules to avoid data hazards (no interlocks) Typically separates operations with explicit NOPs The goal is to minimize the total execution time for the program

A VLIW Machine Configuration 13

VLIW: Pros and Cons Pros Simple HW Easy to extend the #FUs Good compilers can effectively detect parallelism Cons Huge number of registers to keep active the FUs Needed to store operands and results Large data transport capacity between FUs and register files Register files and Memory High bandwidth between i-cache and fetch unit Large code size 14

15 Loop Execution for (i=0; i<N; i++) B[i] = A[i] + C; Int1Int 2M1M2FP+FPx loop: How many FP ops/cycle? ld add r1 fadd sd add r2 bne 1 fadd / 8 cycles = loop: ld f1, 0(r1) add r1, 8 fadd f2, f0, f1 sd f2, 0(r2) add r2, 8 bne r1, r3, loop Compile Schedule

16 Loop Unrolling for (i=0; i<N; i++) B[i] = A[i] + C; for (i=0; i<N; i+=4) { B[i] = A[i] + C; B[i+1] = A[i+1] + C; B[i+2] = A[i+2] + C; B[i+3] = A[i+3] + C; } Unroll inner loop to perform 4 iterations at once Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

17 Scheduling Loop Unrolled Code loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2) add r2, 32 bne r1, r3, loop Schedule Int1Int 2M1M2FP+FPx loop: Unroll 4 ways ld f1 ld f2 ld f3 ld f4add r1fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8add r2bne How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36

18 Software Pipelining loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1) add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop Int1Int 2M1M2FP+FPx Unroll 4 ways first ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 sd f6 sd f7 sd f8 add r1 add r2 bne ld f1 ld f2 ld f3 ld f4 fadd f5 fadd f6 fadd f7 fadd f8 sd f5 add r1 loop: iterate prolog epilog How many FLOPS/cycle? 4 fadds / 4 cycles = 1

19 Software Pipelining vs. Loop Unrolling time performance time performance Loop Unrolled Software Pipelined Startup overhead Wind-down overhead Loop Iteration Software pipelining pays startup/wind-down costs only once per loop, not once per iteration

What follows 20