The EPIC-VLIW Approach

Slides:



Advertisements
Similar presentations
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Multiscalar processors
Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.
Chapter 21 IA-64 Architecture (Think Intel Itanium)
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
VLIW CSE 471 Autumn 021 A (naïve) Primer on VLIW – EPIC with slides borrowed/edited from an Intel-HP presentation VLIW direct descendant of horizontal.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Advanced Architectures
Computer Architecture Principles Dr. Mike Frank
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COSC3330 Computer Architecture
5.2 Eleven Advanced Optimizations of Cache Performance
Henk Corporaal TUEindhoven 2009
Flow Path Model of Superscalars
Levels of Parallelism within a Single Processor
Lecture 6: Static ILP, Branch prediction
Yingmin Li Ting Yan Qi Zhao
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CSE 586 Computer Architecture Lecture 4
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
VLIW direct descendant of horizontal microprogramming
Levels of Parallelism within a Single Processor
Chapter 12 Pipelining and RISC
Control unit extension for data hazards
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
How to improve (decrease) CPI
Control unit extension for data hazards
Lecture 4: Instruction Set Design/Pipelining
Lecture 5: Pipeline Wrap-up, Static ILP
Predication ECE 721 Prof. Rotenberg.
Presentation transcript:

The EPIC-VLIW Approach Explicitly Parallel Instruction Computing (EPIC) is a “philosophy” Very Long Instruction Word (VLIW) is an implementation of EPIC Concept derives from horizontal microprogramming, namely: A sequence of steps (microoperation) that interprets the ISA If only one microop per cycle: vertical microprogramming If (at the extreme all) several units (say, incr PC, add, f-p, register file read, register file write etc…) can be activated in the same cycle: horizontal microprogramming VLIW CSE 471

The EPIC “philosophy” Compiler generates packets, or bundles, of instructions that can execute together Instructions executed in order (static scheduling) and assumed to have a fixed latency Architecture should provide features that assists the compiler in exploiting ILP Branch prediction, load speculation (see later), and associated recoveries Difficulties occur with unpredictable latencies : Branch prediction → Use of predication in addition to static and dynamic branch prediction Pointer-based computations →Use cache hints, speculative loads VLIW CSE 471

Why EPIC? Dynamically scheduled processors have (lower CPI) better performance than statically scheduled ones. So why EPIC? Statically scheduled hardware is simpler Examples? Static scheduling can look at the whole program rather than a relatively small instruction window. More possibilities of optimization R1  R2 + R3 latency 1 R3  R4 * R5 latency m; could start m-2 cycles before the addition VLIW CSE 471

Other Static Scheduling Techniques Eliminate branches via predication (next slides) Loop unrolling Software pipelining (see in a few slides) Use of global scheduling Trace scheduling technique: focus on the critical path Software prefetching We’ll talk about prefetching at length later VLIW CSE 471

Predication Basic Idea Associate a Boolean condition (predicate) with the issue, execution, or commit of an instruction The stage in which to test the predicate is an implementation choice If the predicate is true, the result of the instruction is kept If the predicate is false, the instruction is nullified Distinction between Partial predication: only a few opcodes can be predicated Full predication: every instruction is predicated VLIW CSE 471

Predication Benefits Allows compiler to overlap the execution of independent control constructs w/o code explosion Allows compiler to reduce frequency of branch instructions and, consequently, of branch mispredictions Reduces the number of branches to be tested in a given cycle Reduces the number of multiple execution paths and associated hardware costs (copies of register maps etc.) Allows code movement in superblocks VLIW CSE 471

Predication Costs Increased fetch utilization Increased register consumption If predication is tested at commit time, increased functional-unit utilization With code movement, increased complexity of exception handling For example, insert extra instructions for exception checking If every instruction is predicated, larger instruction Impacts I-cache VLIW CSE 471

Flavors of Predication Implementation Has its roots in vector machines like CRAY-1 Creation of vector masks to control vector operations on an element per element basis Often (partial) predication limited to conditional moves as, e.g., in the Alpha, MIPS 10000, IBM Power PC, SPARC and Intel P6 microarchitecture Full predication: Every instruction predicated as in Intel Itanium (IA-64 ISA) VLIW CSE 471

Partial Predication: Conditional Moves CMOV R1, R2, R3 Move R2 to R1 if R3 = 0 Main compiler use: If (cond ) S1 (with result in Rres) (1) Compute result of S1 in Rs1; (2) Compute condition in Rcond; (3) CMOV Rres, Rs1, Rcond No need (in this example) for branch prediction Very useful if condition can be computed ahead or, e.g., in parallel with result. But: Increases register pressure (Rcond is general register) VLIW CSE 471

Other Forms of Partial Predication Select dest, src1, src2,cond Corresponds to C-like --- dest = ( (cond) ? src1 : src2) Note the destination register is always assigned a value Use in the Multiflow (first commercial VLIW machine) Nullify Any register-register instruction can nullify the next instruction, thus making it conditional VLIW CSE 471

Full Predication Define predicates with instructions of the form: Pred_<cmp> Pout1<type> , Pout2<type>,, src1, src2 (Pin) where Pout1 and Pout2 are assigned values according to the comparison between src1 and src2 and the cmp “opcode” The predicate types are most often U (unconditional) and U its complement, and OR and OR The predicate define instruction can itself be predicated with the value of Pin There are definite rules for that, e.g., if Pin = 0, U and U are set to 0 independently of the result of the comparison and the OR predicates are not modified. VLIW CSE 471

If-conversion if The if condition will set p1 to U The then will be executed predicated on p1(U) The else will be executed predicated on p1(U) The “join” will in general be predicated on some form of OR predicate then else join VLIW CSE 471

IA-64 : Explicitly Parallel Architecture 128 bits (bundle) Template 5 bits Instruction 2 41 bits Instruction 1 41 bits Instruction 0 41 bits Memory (M) Memory (M) Integer (I) (MMI) IA-64 template specifies The type of operation for each instruction, e.g. MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB, MBB, BBB Intra-bundle relationship, e.g. M / MI or MI / I (/ is a “stop” meaning no parallelism) Inter-bundle relationship Most common combinations covered by templates Headroom for additional templates Simplifies hardware requirements Scales compatibly to future generations M=Memory F=Floating-point I=Integer L=Long Immediate B=Branch VLIW CSE 471

Itanium Overview VLIW CSE 471

IA-64’s Large Register File Floating-Point Registers Branch Registers Predicate Registers Integer Registers 63 81 63 bit 0 GR1 GR31 GR127 GR32 GR0 GR1 GR31 GR127 GR32 GR0 0.0 BR0 PR0 1 BR7 PR1 PR15 PR16 PR63 NaT 32 Static 32 Static 16 Static 96 Stacked, Rotating 96 Rotating 48 Rotating VLIW CSE 471

Itanium implementation Can execute 2 bundles (6 instructions) per cycle 10 stage pipeline 4 integer units (2 of them can handle load-store), 2 f-p units and 3 branch units Issue in order, execute in order but can complete out of order. Uses a (restricted) register scoreboard technique to resolve dependencies. VLIW CSE 471

Itanium implementation Predication reduces number of branches and number of mispredicts, Nonetheless: sophisticated branch predictor Two level branch predictor of the SAs variety Some provision for multiway branches Several basic blocks can terminate in the same bundle 4 registers for highly predictable target addresses (end of loops) hence no bubble on taken branch Return address stack Hints from the compiler Possibility of prefetching instructions from L2 to instruction buffer VLIW CSE 471

Itanium implementation There are “instruction queues” between the fetch unit and the execution units. Therefore branch bubbles can often be absorbed because of long latencies (and stalls) in the execute stages Some form of scoreboarding is used for detecting dependencies VLIW CSE 471

Traditional Register Models Traditional Register Stacks Procedure Register Memory Procedures Register A A A A B B Procedure A calls procedure B Procedures must share space in register Performance penalty due to register save / restore B C C D ? D I think that the “traditional register stack” model they refer to is the “register windows” model used in Sparc Eliminate the need for save / restore by reserving fixed blocks in register However, fixed blocks waste resources VLIW CSE 471 8

Traditional Register Stacks IA-64 Register Stack Traditional Register Stacks IA-64 Register Stack Procedures Register Procedures Register A A A A B B B B C C D D C C D ? D D D Eliminate the need for save / restore by reserving fixed blocks in register However, fixed blocks waste resources IA-64 able to reserve variable block sizes No wasted resources VLIW CSE 471 8

Software pipelining Reorganize loops with loop-carried dependences by “symbolically” unrolling them New code : statements of distinct iterations of original code Take an “horizontal” slice of several (dependent) iterations Original code New code Load x[i] dependence Use x[i] Store x[i] Store x[i] Use x[i-1] Load x[i-2] Iter. i Iter. i + 1 Iter. i + 2 VLIW CSE 471

Software Pipelining via Rotating Registers Sequential Loop Execution Software Pipelining Loop Execution Time Time Traditional architectures need complex software loop unrolling for pipelining Results in code expansion --> Increases cache misses --> Reduces performance IA-64 utilizes rotating registers (r0 ->r1, r1 -> r2 etc in successive iterations) to achieve software pipelining Avoids code expansion --> Reduces cache misses --> Higher performance VLIW CSE 471 48 66 31 95

IA-64 Floating-Point Architecture Multiple read ports (82 bit floating point numbers) A X B + C Memory 128 FP Register File . . . FMAC . . . FMAC #1 FMAC #2 FMAC Multiple write ports D 128 registers Allows parallel execution of multiple floating-point operations Simultaneous Multiply - Accumulate (FMAC) 3-input, 1-output operation : a * b + c -> d Shorter latency than independent multiply and add Greater internal precision and single rounding error VLIW CSE 471