VLIW CSE 471 Autumn 021 A (naïve) Primer on VLIW – EPIC with slides borrowed/edited from an Intel-HP presentation VLIW direct descendant of horizontal.

Slides:



Advertisements
Similar presentations
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Branch Target Buffers BPB: Tag + Prediction
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.
Chapter 21 IA-64 Architecture (Think Intel Itanium)
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Instruction-Level Parallelism and Its Dynamic Exploitation
Computer Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Part IV Data Path and Control
COSC3330 Computer Architecture
CS203 – Advanced Computer Architecture
/ Computer Architecture and Design
Henk Corporaal TUEindhoven 2009
The EPIC-VLIW Approach
Lecture 6: Static ILP, Branch prediction
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
Yingmin Li Ting Yan Qi Zhao
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CSE 586 Computer Architecture Lecture 4
VLIW direct descendant of horizontal microprogramming
Control unit extension for data hazards
Dynamic Hardware Prediction
How to improve (decrease) CPI
Control unit extension for data hazards
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

VLIW CSE 471 Autumn 021 A (naïve) Primer on VLIW – EPIC with slides borrowed/edited from an Intel-HP presentation VLIW direct descendant of horizontal microprogramming –Two commercially unsuccessful machines: Multiflow and Cydrome Compiler generates instructions that can execute together –Instructions executed in order and assumed to have a fixed latency Difficulties occur with unpredictable latencies : –Branch prediction -> Use of predication in addition to static and dynamic branch prediction –Pointer-based computations -> Use cache hints and speculative loads

VLIW CSE 471 Autumn 022 IA-64 : Explicitly Parallel Architecture IA-64 template specifies –The type of operation for each instruction, e.g. MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB, MBB, BBB –Intra-bundle relationship, e.g. M / MI or MI / I (/ is a “stop” meaning no parallelism ) –Inter-bundle relationship Most common combinations covered by templates –Headroom for additional templates Simplifies hardware requirements Scales compatibly to future generations Instruction 2 41 bits Instruction 1 41 bits Instruction 0 41 bits Template 5 bits 128 bits (bundle) M=MemoryF=Floating-pointI=Integer L=Long Immediate B=Branch (MMI) Memory (M) Integer (I)

VLIW CSE 471 Autumn 023 (Merced) Itanium implementation Can execute 2 bundles (6 instructions) per cycle 10 stage pipeline 4 integer units (2 of them can handle load-store), 2 f-p units and 3 branch units Issue in order, execute in order but can complete out of order. Uses a (restricted) register scoreboard technique to resolve dependencies.

VLIW CSE 471 Autumn 024 Itanium implementation (?) Predication reduces number of branches and number of mispredicts, Nonetheless: sophisticated branch predictor –Compiler hints: BPR instruction provides “easy” to predict branch address; reduces number of entries in BTB –Two-level hardware prediction Sas (4,2) (512 entry local history table 4-way set-associative, indexing 128 PHT –one per set- each with 16 entries –2-bit saturating counters). Number of bubbles on predicted branch taken: 2 or 3 –And a 64-entry BTB (only 1 cycle stall) + return stack –Mispredicted branch penalty: 9 cycles

VLIW CSE 471 Autumn 025 Itanium implementation (?) There are “instruction queues” between the fetch unit and the execution units. Therefore branch bubbles can often be absorbed because of long latencies (and stalls) in the execute stages

VLIW CSE 471 Autumn 026 IA-64 for High Performance Number of branches in large server apps overwhelm traditional processors –IA-64 predication removes branches, avoids mispredicts –Alas, full predication, i.e., predicating every instruction, does not improve performance as much as one would hope and is often detrimental (e.g., instructions are longer hence I-cache miss rate could be higher) Environments with a large number of users require high performance –IA-64 uses speculation to reduce impact of memory latency –64-bit addressing enables systems with very large virtual and physical memory

VLIW CSE 471 Autumn 027 Middle Tier Application Needs Mid-tier applications have diverse code requirements –Integer code with many small loops –Significant call / return requirements (C++, Java) IA-64’s register model supports these various requirements –Large register file provides significant resources for optimized performance –Register stack to handle call-intensive code –Rotating registers enables efficient loop execution

VLIW CSE 471 Autumn 028 IA-64’s Large Register File BR7 BR0 Branch Registers Stacked, Rotating GR1 GR31 GR127 GR32GR0NaT 32 Static 0 Integer Registers 630 Predicate Registers Registers 1 PR1 PR63 PR0 PR15 PR16 48 Rotating 16 Static bit 0 96 Rotating GR1 GR31 GR127 GR32GR0 32 Static 0.0 Floating-Point Registers 810

VLIW CSE 471 Autumn 029 Traditional Register Models Traditional Register Stacks A B MemoryRegister A Traditional Register Models Procedure A calls procedure B Procedures must share space in register Performance penalty due to register save / restore A B C D A B C D Eliminate the need for save / restore by reserving fixed blocks in register However, fixed blocks waste resources RegisterProcedureProcedures ? I think that the “traditional register stack” model they refer to is the “register windows” model used in Sparc

VLIW CSE 471 Autumn 0210 Traditional Register Stacks IA-64 Register Stack A B C D A B C D Eliminate the need for save / restore by reserving fixed blocks in register However, fixed blocks waste resources RegisterProcedures IA-64 Register Stack A B C A B C D RegisterProcedures IA-64 able to reserve variable block sizes No wasted resources D ? D D

VLIW CSE 471 Autumn 0211 Software pipelining Reorganize loops with loop-carried dependences by “symbolically” unrolling them –New code : statements of distinct iterations of original code –Take an “horizontal” slice of several (dependent) iterations Iter. i Iter. i + 1 Iter. i + 2 Store x[i] Use x[i] Load x[i] Original code New code Store x[i] Use x[i-1] Load x[i-2] dependence

VLIW CSE 471 Autumn 0212 Software Pipelining via Rotating Registers Traditional architectures need complex software loop unrolling for pipelining –Results in code expansion --> Increases cache misses --> Reduces performance IA-64 utilizes rotating registers (r0 ->r1, r1 -> r2 etc in successive iterations) to achieve software pipelining –Avoids code expansion --> Reduces cache misses --> Higher performance Sequential Loop Execution Time Software Pipelining Loop Execution Time

VLIW CSE 471 Autumn 0213 IA-64 Floating-Point Architecture 128 registers –Allows parallel execution of multiple floating-point operations Simultaneous Multiply - Accumulate (FMAC) –3-input, 1-output operation : a * b + c -> d –Shorter latency than independent multiply and add –Greater internal precision and single rounding error Memory 128 FP RegisterFile Multiple read ports Multiple write ports... FMAC #1 FMAC #2 ABC D X+ (82 bit floating point numbers) FMAC FMAC

VLIW CSE 471 Autumn 0214 Predication Basic Idea Associate a Boolean condition (predicate) with the issue, execution, or commit of an instruction –The stage in which to test the predicate is an implementation choice If the predicate is true, the result of the instruction is kept If the predicate is false, the instruction is nullified Distinction between –Partial predication: only a few opcodes can be predicated –Full predication: every instruction is predicated

VLIW CSE 471 Autumn 0215 Predication Benefits Allows compiler to overlap the execution of independent control constructs w/o code explosion Allows compiler to reduce frequency of branch instructions and, consequently, of branch mispredictions Reduces the number of branches to be tested in a given cycle Reduces the number of multiple execution paths and associated hardware costs (copies of register maps etc.) Allows code movement in superblocks

VLIW CSE 471 Autumn 0216 Predication Costs Increased fetch utilization Increased register consumption If predication is tested at commit time, increased functional-unit utilization With code movement, increased complexity of exception handling –For example, insert extra instructions for exception checking

VLIW CSE 471 Autumn 0217 Flavors of Predication Implementation Has its roots in vector machines like CRAY-1 –Creation of vector masks to control vector operations on an element per element basis Often (partial) predication limited to conditional moves as, e.g., in the Alpha, MIPS 10000, Power PC, SPARC and the Pentium Pro Full predication: Every instruction predicated as in IA-64

VLIW CSE 471 Autumn 0218 Partial Predication: Conditional Moves CMOV R1, R2, R3 –Move R2 to R1 if R3 = 0 Main compiler use: If (cond ) S1 (with result in Rres) –(1) Compute result of S1 in Rs1; –(2) Compute condition in Rcond; –(3) CMOV Rres, Rs1, Rcond Increases register pressure (Rcond is general register) No need (in this example) for branch prediction Very useful if condition can be computed ahead or, e.g., in parallel with result.

VLIW CSE 471 Autumn 0219 Other Forms of Partial Predication Select dest, src1, src2,cond –Corresponds to C-like --- dest = ( (cond) ? src1 : src2) –Note the destination register is always assigned a value –Use in the Multiflow (first commercial VLIW machine) Nullify –Any register-register instruction can nullify the next instruction, thus making it conditional

VLIW CSE 471 Autumn 0220 Full Predication Define predicates with instructions of the form: Pred_ Pout1, Pout2,, src1, src2 (P in ) where – Pout1 and Pout2 are assigned values according to the comparison between src1 and src2 and the cmp “opcode” –The predicate types are most often U (unconditional) and U its complement, and OR and OR –The predicate define instruction can itself be predicated with the value of P in There are definite rules for that, e.g., if P in = 0, U and U are set to 0 independently of the result of the comparison and the OR predicates are not modified.

VLIW CSE 471 Autumn 0221 If-conversion The if condition will set p1 to U The then will be executed predicated on p1(U) The else will be executed predicated on p1(U) The “join” will in general be predicated on some form of OR predicate if then else join