Presentation is loading. Please wait.

Presentation is loading. Please wait.

COSC3330 Computer Architecture

Similar presentations


Presentation on theme: "COSC3330 Computer Architecture"— Presentation transcript:

1 COSC3330 Computer Architecture
Lecture 16. VLIW Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

2 Topic VLIW

3 Example Pipelined ILP Machine
Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Max Throughput, Six Instructions per Cycle Latency in Cycles One Pipeline Stage How much instruction-level parallelism (ILP) required to keep machine pipelines busy?

4 Sequential ISA Bottleneck
Sequential source code Superscalar compiler Find independent operations Sequential machine code Schedule operations a = foo(b); for (i=0, i< Check instruction dependencies Superscalar processor Schedule execution

5 VLIW: Very Long Instruction Word
Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Int Op 1 Two Integer Units, Single Cycle Latency Two Load/Store Units, Three Cycle Latency Two Floating-Point Units, Four Cycle Latency Multiple operations packed into one instruction Each operation slot is for a fixed function Constant operation latencies are specified Architecture requires guarantee of: Parallelism within an instruction => no x-operation RAW check No data use before data ready => no data interlocks

6 Very-Long Instruction Word (VLIW) Computers
PC Instruction Memory Instruction word consists of several conventional 3-operand instructions, one for each of the ALUs Op Rd Ra Rb Op Rd Ra Rb Op Rd Ra Rb Register File Register file has 3N ports to feed N ALUs. All ALU-ALU communication takes place via register file. 6

7 Why VLIWs? Opportunity for much simpler hardware
Compiler discovers dependencies Places the instructions Simple encodings Potentially lower # of transistors than other designs Reduced speculation, Out-of-Order not needed Size efficiencies, price, power consumption Is this true for Itanium? 7

8 VLIW Compiler Responsibilities
The compiler: Schedules to maximize parallel execution Guarantees intra-instruction parallelism Schedules to avoid data hazards Typically separates operations with explicit NOPs

9 Design Philosophy: VLIW vs. Superscalar
RISC Object code Static _VOID _DEFUN(_mor_nu), struct _reent *ptr _AND register size_t { . . IM1 = I–1 IM2 = I–2 IM3 = I–3 T1 = LOAD . T3 = 2*T1 . Scheduling and Operation Independence: Recognizing hardware Normal Compiler Same Normal Source code Run-time The same ILP Hardware in Both cases Compile Time Static _VOID _DEFUN(_mor_nu), struct _reent *ptr _AND register size_t { . . Normal compiler plus scheduling and operation Independence: Recognizing software

10 Early VLIW Machines Multiflow Trace (1987) Cydrome Cydra-5 (1987)
commercialization of ideas from Fisher’s Yale group including “trace scheduling” available in configurations with 7, 14, or 28 operations/instruction 28 operations packed into a 1024-bit instruction word Cydrome Cydra-5 (1987) 7 operations encoded in 256-bit instruction word rotating register file

11 Josh Fisher “In recognition of 25 years of seminal contributions to instruction-level parallelism, pioneering work on VLIW architectures, and the formulation of the Trace Scheduling compilation technique” 2003 Eckert-Mauchly Award

12 Intel/HP EPIC Explicitly Parallel Instruction Computer (EPIC)
A kin breed of VLIW (e.g., compiler holding the key to high performance) New Intel architecture (designed from ground) Not compactible with x86 64 bit, IA-64 (not x86-64) RISC + Superscalar An Itanium Instruction Bundle ld4 r43=[r38] add r38=16,r br.call.sptk b0=printf# ;;

13 Intel Itanium Execution: 6 inst. per clock 10 stage pipeline
4 ALU, 4 Multimedia ALU, 4 FP (up to 8 FP ops./cycle), 2 Load / Store, 3 Branch units bit general purpose registers bit floating-point registers 13

14 Intel Itanium ISA Itanium Instruction “Bundle” (VLIW)
128 bits each Contains three Itanium instructions (aka syllables) Template bits in each bundle specify dependencies both within a bundle as well as between sequential bundles A collection of independent bundles forms a “group” (use stops) Each Itanium Instruction Fixed-length 41 bits long Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st, INT ld/st, ALU) Contains max three 7-bit register specifiers 127 86 45 5 4 Instruction Slot 1 Instruction Slot 2 Instruction Slot 3 Templt

15 Intel Itanium ISA Each IA-64 instruction is categorized into 6 types and may be executed on one or more execution unit types. 4 functional unit categories: – I unit (integer) – F unit (floating-point) – M unit (memory) – B unit (branch) 6 microoperation categories: – Integer ALU (A-type) executed on M- or I units – Non-ALU Integer (I-type) executed on I units – Memory (M-type) executed on M units – Floating-point (F-type) executed on F units – Branch (B-type) executed on B units – Extended (L/X-type) executed on I- or B units

16 Encoding Instruction Bundle
{ .mii ld4 r28=[r8] add r9 = 2,r1;; add r30= 1,r9 } MI_I format  Template encoded “02” Use “;;” as “stop bit” in assembly code to separate dependent instructions Instructions between “;;” belong to the same “instruction group” RAW and WAW are not allowed in the same instruction group Each instruction slot can represent one functional unit type based on encoding (e.g. slot 0 can be M-unit or B-unit)

17 Intel Itanium ISA There are 12 basic bundle types:
MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB. Each basic type has two versions, one with a stop after the third slot and one without MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_, MBB_, BBB_, MMB_, MFB_

18 Itanium Instruction Example
{ .mii add r1 = r2, r3 sub r4 = r4, r5;; shr r1, r4, r1;; } { .mmi ld8 r2, [r1];; st8 [r1] = r23 tbit p1,p2 = r4, 5 { .mbb ld8 r45 = [r55] (p3)br.call b1=func1 (p4)br.cond Label1 { .mfi st4 [r45] = r6 fmac f1=f2,f3 add r3=r3, 8;;

19 Predication Traditional Architectures Itanium™ Architecture then else
cmp cmp then p1 p2 p1 p2 else p1 p2 Converts branches to conditional execution Executes multiple paths simultaneously Exposes parallelism and reduces critical path Better utilizes wider machines Reduces mispredicted branches The figure above demonstrates how traditional architectures would view a particular segment of code. The jumps represent branches. If the condition in the first block is true, “then” instructions 3 and 4 should be executed or “else” instructions 5 and 6 should be executed. Architectures try to predict the correct flow resulting in significant performance penalties for mispredicted branches.

20 More Example of Parallel Compare
1 cmp.eq p1,p2 = r0,r0;; cmp.eq.and.orcm p1,p2 = c1,r0 cmp.eq.and.orcm p1,p2 = c2,r0 cmp.eq.and.orcm p1,p2 = c3,r0 cmp.eq.and.orcm p1,p2 = c4,r0 (p1) add r1=r2,r3 (p2) sub r4=r5-r6 c1 c2 c3 else c4 then Itanium Code 2 if (c1 && c2 && c3 && c4) r1 = r2 + r3; else // !c1 || !c2 || !c3 || !c4 r4 = r5 – r6

21 More Example of Parallel Compare
Parallel cmp.eq.and or cmp.eq.or write the same values to both predicates Use cmp.eq.and.orcm or cmp.eq.or.andcm for writing complementary predicates Also called DeMorgan type (for complementary output) cmp.ge.and.orcm p6,p7= 80, r4

22 cmp.eq.and p1,p2= 80, r4 And Predicate Usage p1 = p1 and (80 == r4?)
How to initialize p1 and p2 cmp.unc.eq p1,p2 = r0,r0

23 Design Philosophy: VLIW vs. Superscalar

24 VLIW - Compiler Challenges
Very complex compiler Statically predictable branches Static disambiguation of memory addresses Information unavailable at static compile time Interprocedural optimization is difficult Code bloat Compiler specifies placement of each instruction place NOPs to preserve instruction execution order Many nop’s

25 HW Issues - Scalability
PC Instruction Memory 1 Instruction Memory N register file ports cause these structures to get large. clustering in which several functional units share a register file and the compiler orchestrates the movement of data among them Op Rd Ra Rb Op Rd Ra Rb Op Rd Ra Rb Register File 1 Register File N 25

26 Other Hardware Issues Compatibility of code
Backward compatibility or upgradeability Due to exposed implementation details Multiflow sold machines from 7-wide to 24-wide Each required recompilation of source program 26


Download ppt "COSC3330 Computer Architecture"

Similar presentations


Ads by Google