Independence Instruction Set Architectures Conventional ISA Instructions execute in order No way of stating Instruction A is independent of B Must detect at runtime cost: time, power, complexity Idea: Change Execution Model at the ISA model Allow specification of independence VLIW Goals: Flexible enough Match well technology Vectors and SIMD Only for a set of the same operation ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
VLIW Very Long Instruction Word #1 defining attribute Instruction format Very Long Instruction Word #1 defining attribute The four instructions are independent Some parallelism can be expressed this way Extending the ability to specify parallelism Take into consideration technology Recall, delay slots This leads to #2 defining attribute: NUAL Non-unit assumed latency ALU1 ALU2 MEM1 control ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
NUAL vs. UAL Unit Assumed Latency (UAL) Semantics of the program are that each instruction is completed before the next one is issued This is the conventional sequential model Non-Unit Assumed Latency (NUAL): At least 1 operation has a non-unit assumed latency, L, which is greater than 1 The semantics of the program are correctly understood if exactly the next L-1 instructions are understood to have issued before this operation completes NUAL: Result observation is delayed by L cycles ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
#2 Defining Attribute: NUAL Assumed latencies for all operations ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control visible ALU1 ALU2 MEM1 control visible visible visible ALU1 ALU2 MEM1 control Glorified delay slots Additional opportunities for specifying parallelism ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
#3 DF: Resource Assignment The VLIW also implies allocation of resources The spec. inst format maps well onto the following datapath: ALU1 ALU2 MEM1 control ALU ALU cache Control Flow Unit ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
VLIW: Definition Multiple independent Functional Units Instruction consists of multiple independent instructions Each of them is aligned to a functional unit Latencies are fixed Architecturally visible Compiler packs instructions into a VLIW also schedules all hardware resources Entire VLIW issues as a single unit Result: ILP with simple hardware compact, fast hardware control fast clock At least, this is the goal ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
VLIW Example FU FU I-fetch & Issue Memory Port Memory Port Multi-ported Register File ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
VLIW Example Instruction format ALU1 ALU2 MEM1 control Program order and execution order ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control ALU1 ALU2 MEM1 control Instructions in a VLIW are independent Latencies are fixed in the architecture spec. Hardware does not check anything Software has to schedule so that all works ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
Compilers are King VLIW philosophy: Key technologies “dumb” hardware “intelligent” compiler Key technologies Predicated Execution Trace Scheduling If-Conversion Software Pipelining ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
Predicated Execution b = 1; else b = 2; Instructions are predicated if (cond) then perform instruction In practice calculate result if (cond) destination = result Converts control flow dependences to data dependences if ( a == 0) b = 1; else b = 2; true; pred = (a == 0) pred; b = 1 !pred; b = 2 ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
Predicated Execution: Trade-offs Is predicated execution always a win? Is predication meaningful for VLIW only? ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
Trace Scheduling Goal: “Fact” of life: But: Create a large continuous piece or code Schedule to the max: exploit parallelism “Fact” of life: Basic blocks are small Scheduling across BBs is difficult But: While many control flow paths exist There are few “hot” ones ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
Trace Scheduling Trace Scheduling First used to compact microcode Static control speculation Assume specific path Schedule accordingly Introduce check and repair code where necessary First used to compact microcode FISHER, J. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers C-30, 7 (July 1981), 478--490. ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
Trace Scheduling: Example Assume AC is the common path A A schedule A&C C B Repair C B Expand the scope/flexibility of code motion ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
Trace Scheduling: Example #2 bA bB bA bB bC bC bD check bD bE repair bC bD repair bE all OK ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
Trace Scheduling Example test = a[i] + 20; If (test > 0) then sum = sum + 10 else sum = sum + c[i] c[x] = c[y] + 10 test = a[i] + 20 if (test <= 0) then goto repair … assume delay Straight code repair: sum = sum – 10 sum = sum + c[i] ECE 1773 – Fall 2006 © A. Moshovos (U. of Toronto) Some material by Wen-Mei Hwu (UIUC) and S. Mahlke (Michigan)
SIMD Single Instruction Multiple Data
SIMD: Motivation Contd. Recall: Part of architecture is understanding application needs Many Apps: for i = 0 to infinity a(i) = b(i) + c Same operation over many tuples of data Mostly independent across iterations ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos
Some things are naturally parallel
Sequential Execution Model / SISD int a[N]; // N is large for (i =0; i < N; i++) a[i] = a[i] * fade; Flow of control / Thread One instruction at the time Optimizations possible at the machine level time
Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel a[i] = a[i] * fade; time This has been tried before: ILLIAC III, UIUC, 1966 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4038028&tag=1 http://ed-thelen.org/comp-hist/vs-illiac-iv.html
SIMD Processing r2 r2 r2 r2 r2 + r1 r2 r3 r1 r1 r1 r1 r1 + + + + + r3 (N operations) SCALAR (1 operation) r2 r2 r2 r2 r2 + r1 r2 r3 r1 r1 r1 r1 r1 + + + + + r3 r3 r3 r3 r3 add r3, r1, r2 add r3, r1, r2
TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 fetch decode rf exec wb exec wb exec wb fetch decode rf exec wb exec wb exec wb
TIME C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 fetch decode rf exec wb exec wb exec wb fetch decode rf exec wb exec wb exec wb fetch decode rf exec wb exec wb exec wb
SIMD Architecture Replicate Datapath, not the control CU μCU regs PE PE PE ALU MEM MEM MEM Replicate Datapath, not the control All PEs work in tandem CU orchestrates operations ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos
Multimedia extensions SIMD in modern CPUs ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos
MMX: Basics Multimedia applications are becoming popular Are current ISAs a good match for them? Methodology: Consider a number of “typical” applications Can we do better? Cost vs. performance vs. utility tradeoffs Net Result: Intel’s MMX Can also be viewed as an attempt to maintain market share If people are going to use these kind of applications we better support them ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos
Multimedia Applications Most multimedia apps have lots of parallelism: for I = here to infinity out[I] = in_a[I] * in_b[I] At runtime: out[0] = in_a[0] * in_b[0] out[1] = in_a[1] * in_b[1] out[2] = in_a[2] * in_b[2] out[3] = in_a[3] * in_b[3] ….. Also, work on short integers: in_a[i] is 0 to 256 for example (color) or, 0 to 64k (16-bit audio) ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).
Observations 32-bit registers are wasted only using part of them and we know ALUs underutilized and we know Instruction specification is inefficient even though we know that a lot of the same operations will be performed still we have to specify each of the individually Instruction bandwidth Discovering Parallelism Memory Ports? Could read four elements of an array with one 32-bit load Same for stores The hardware will have a hard time discovering this Coalescing and dependences ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).
MMX Contd. Can do better than traditional ISA new data types new instructions Pack data in 64-bit words bytes “words” (16 bits) “double words” (32 bits) Operate on packed data like short vectors SIMD First used in Livermore S-1 (> 20 years) ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).
MMX:Example Up to 8 operations (64bit) go in parallel Potential improvement: 8x In practice less but still good Besides another reason to think your machine is obsolete ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).
Data Types ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).
Vector Processors + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 vector length vadd.vv v3, v1, v2 VECTOR (N operations) Scalar processors operate on single numbers (scalars) Vector processors operate on vectors of numbers Linear sequences of numbers From. Christos Kozyrakis, Stanford
TIME fetch decode rf exec wb C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
Example of Simple Vector Processor
What’s in a Vector Processor A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set for vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs)
Vector Code Example Y[0:63] = Y[0:63] + a * X[0:63] LD R0, a VLD V1, 0(Rx) V1 = X[] VLD V2, 0(Ry) V2 = Y[] VMUL.SV V3, R0, V1 V3 = X[]*a VADD.VV V4, V2, V3 V4 = Y[]+V3 VST V4, 0(Ry) store in Y[] ECE1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). © Moshovos