COSC3330 Computer Architecture

Slides:



Advertisements
Similar presentations
Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.
Advertisements

Computer Architecture Instruction-Level Parallel Processors
® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.
Chapter 21 IA-64 Architecture (Think Intel Itanium)
IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
VLIW CSE 471 Autumn 021 A (naïve) Primer on VLIW – EPIC with slides borrowed/edited from an Intel-HP presentation VLIW direct descendant of horizontal.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
CS 352H: Computer Systems Architecture
COSC6385 Advanced Computer Architecture
Advanced Architectures
Central Processing Unit Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
Instructional Parallelism
Superscalar Processors & VLIW Processors
The EPIC-VLIW Approach
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 23: Static Scheduling for High ILP
How to improve (decrease) CPI
Henk Corporaal TUEindhoven 2011
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg
VLIW direct descendant of horizontal microprogramming
Superscalar and VLIW Architectures
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Krste Asanovic Electrical Engineering and Computer Sciences
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Design of Digital Circuits Lecture 19a: VLIW
Lecture 5: Pipeline Wrap-up, Static ILP
Chapter 4 The Von Neumann Model
Presentation transcript:

COSC3330 Computer Architecture Lecture 16. VLIW Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Topic VLIW

Example Pipelined ILP Machine Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Max Throughput, Six Instructions per Cycle Latency in Cycles One Pipeline Stage How much instruction-level parallelism (ILP) required to keep machine pipelines busy?

Sequential ISA Bottleneck Sequential source code Superscalar compiler Find independent operations Sequential machine code Schedule operations a = foo(b); for (i=0, i< Check instruction dependencies Superscalar processor Schedule execution

VLIW: Very Long Instruction Word Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Int Op 1 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no x-operation RAW check No data use before data ready => no data interlocks

Very-Long Instruction Word (VLIW) Computers PC Instruction Memory Instruction word consists of several conventional 3-operand instructions, one for each of the ALUs Op Rd Ra Rb Op Rd Ra Rb Op Rd Ra Rb Register File Register file has 3N ports to feed N ALUs. All ALU-ALU communication takes place via register file. 6

Why VLIWs? Opportunity for much simpler hardware Compiler discovers dependencies Places the instructions Simple encodings Potentially lower # of transistors than other designs Reduced speculation, Out-of-Order not needed Size efficiencies, price, power consumption Is this true for Itanium? 7

VLIW Compiler Responsibilities The compiler: Schedules to maximize parallel execution Guarantees intra-instruction parallelism Schedules to avoid data hazards Typically separates operations with explicit NOPs

Design Philosophy: VLIW vs. Superscalar RISC Object code Static _VOID _DEFUN(_mor_nu), struct _reent *ptr _AND register size_t { . . IM1 = I–1 IM2 = I–2 IM3 = I–3 T1 = LOAD . T3 = 2*T1 . Scheduling and Operation Independence: Recognizing hardware Normal Compiler Same Normal Source code Run-time The same ILP Hardware in Both cases Compile Time Static _VOID _DEFUN(_mor_nu), struct _reent *ptr _AND register size_t { . . Normal compiler plus scheduling and operation Independence: Recognizing software

Early VLIW Machines Multiflow Trace (1987) Cydrome Cydra-5 (1987) commercialization of ideas from Fisher’s Yale group including “trace scheduling” available in configurations with 7, 14, or 28 operations/instruction 28 operations packed into a 1024-bit instruction word Cydrome Cydra-5 (1987) 7 operations encoded in 256-bit instruction word rotating register file

Josh Fisher “In recognition of 25 years of seminal contributions to instruction-level parallelism, pioneering work on VLIW architectures, and the formulation of the Trace Scheduling compilation technique” 2003 Eckert-Mauchly Award

Intel/HP EPIC Explicitly Parallel Instruction Computer (EPIC) A kin breed of VLIW (e.g., compiler holding the key to high performance) New Intel architecture (designed from ground) Not compactible with x86 64 bit, IA-64 (not x86-64) RISC + Superscalar An Itanium Instruction Bundle ld4 r43=[r38] add r38=16,r38 br.call.sptk b0=printf# ;;

Intel Itanium Execution: 6 inst. per clock 10 stage pipeline 4 ALU, 4 Multimedia ALU, 4 FP (up to 8 FP ops./cycle), 2 Load / Store, 3 Branch units 128 64-bit general purpose registers 128 80-bit floating-point registers 13

Intel Itanium ISA Itanium Instruction “Bundle” (VLIW) 128 bits each Contains three Itanium instructions (aka syllables) Template bits in each bundle specify dependencies both within a bundle as well as between sequential bundles A collection of independent bundles forms a “group” (use stops) Each Itanium Instruction Fixed-length 41 bits long Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st, INT ld/st, ALU) Contains max three 7-bit register specifiers 127 86 45 5 4 Instruction Slot 1 Instruction Slot 2 Instruction Slot 3 Templt

Intel Itanium ISA Each IA-64 instruction is categorized into 6 types and may be executed on one or more execution unit types. 4 functional unit categories: – I unit (integer) – F unit (floating-point) – M unit (memory) – B unit (branch) 6 microoperation categories: – Integer ALU (A-type) executed on M- or I units – Non-ALU Integer (I-type) executed on I units – Memory (M-type) executed on M units – Floating-point (F-type) executed on F units – Branch (B-type) executed on B units – Extended (L/X-type) executed on I- or B units

Encoding Instruction Bundle { .mii ld4 r28=[r8] add r9 = 2,r1;; add r30= 1,r9 } MI_I format  Template encoded “02” Use “;;” as “stop bit” in assembly code to separate dependent instructions Instructions between “;;” belong to the same “instruction group” RAW and WAW are not allowed in the same instruction group Each instruction slot can represent one functional unit type based on encoding (e.g. slot 0 can be M-unit or B-unit)

Intel Itanium ISA There are 12 basic bundle types: MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB. Each basic type has two versions, one with a stop after the third slot and one without MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_, MBB_, BBB_, MMB_, MFB_

Itanium Instruction Example { .mii add r1 = r2, r3 sub r4 = r4, r5;; shr r1, r4, r1;; } { .mmi ld8 r2, [r1];; st8 [r1] = r23 tbit p1,p2 = r4, 5 { .mbb ld8 r45 = [r55] (p3)br.call b1=func1 (p4)br.cond Label1 { .mfi st4 [r45] = r6 fmac f1=f2,f3 add r3=r3, 8;;

Predication Traditional Architectures Itanium™ Architecture then else cmp cmp then p1 p2 p1 p2 else p1 p2 Converts branches to conditional execution Executes multiple paths simultaneously Exposes parallelism and reduces critical path Better utilizes wider machines Reduces mispredicted branches The figure above demonstrates how traditional architectures would view a particular segment of code. The jumps represent branches. If the condition in the first block is true, “then” instructions 3 and 4 should be executed or “else” instructions 5 and 6 should be executed. Architectures try to predict the correct flow resulting in significant performance penalties for mispredicted branches.

More Example of Parallel Compare 1 cmp.eq p1,p2 = r0,r0;; cmp.eq.and.orcm p1,p2 = c1,r0 cmp.eq.and.orcm p1,p2 = c2,r0 cmp.eq.and.orcm p1,p2 = c3,r0 cmp.eq.and.orcm p1,p2 = c4,r0 (p1) add r1=r2,r3 (p2) sub r4=r5-r6 c1 c2 c3 else c4 then Itanium Code 2 if (c1 && c2 && c3 && c4) r1 = r2 + r3; else // !c1 || !c2 || !c3 || !c4 r4 = r5 – r6

More Example of Parallel Compare Parallel cmp.eq.and or cmp.eq.or write the same values to both predicates Use cmp.eq.and.orcm or cmp.eq.or.andcm for writing complementary predicates Also called DeMorgan type (for complementary output) cmp.ge.and.orcm p6,p7= 80, r4

cmp.eq.and p1,p2= 80, r4 And Predicate Usage p1 = p1 and (80 == r4?) How to initialize p1 and p2 cmp.unc.eq p1,p2 = r0,r0

Design Philosophy: VLIW vs. Superscalar

VLIW - Compiler Challenges Very complex compiler Statically predictable branches Static disambiguation of memory addresses Information unavailable at static compile time Interprocedural optimization is difficult Code bloat Compiler specifies placement of each instruction place NOPs to preserve instruction execution order Many nop’s

HW Issues - Scalability PC Instruction Memory 1 Instruction Memory N register file ports cause these structures to get large. clustering in which several functional units share a register file and the compiler orchestrates the movement of data among them Op Rd Ra Rb Op Rd Ra Rb Op Rd Ra Rb Register File 1 Register File N 25

Other Hardware Issues Compatibility of code Backward compatibility or upgradeability Due to exposed implementation details Multiflow sold machines from 7-wide to 24-wide Each required recompilation of source program 26