1 Recap Superscalar and VLIW Processors
2 A Model of an Ideal Processor Provides a base for ILP measurements No structural hazards Register renaming—infinite virtual registers and all WAW & WAR hazards avoided Machine with perfect speculation Branch prediction—perfect; no mispredictions Jump prediction—all jumps perfectly predicted –There are only true data dependences left! –These cannot be avoided
3 Upper Bound on ILP
4 More Realistic HW: Branch Impact Window: 2000 instructions Max 64 instr/cycle issue Many registers
5 Renaming Register impact Window: 2000 instructions Max 64 instr/cycle issue
6 Window Impact 64 instr/cycle issue 64 renaming registers
7 How do we take advantage of this large number of ILP Superscalar processors VLIW (Very Long Instruction Word) processors All high-performance modern processors (e.g., Pentium, Sparc, Itanium) use one of the above techniques.
8 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build pipelines with multiple functional units (we can execute more than one instruction). If we can issue more than 1 instruction into the pipe at a time, then it is possible we can complete more than 1 instruction per cycle. This implies that we need to fetch and decode 2 or more instructions per cycle.
9 Multiple Issue Processors Sperscalar Processors Variable number of instructions per clock cycle Instruction Scheduling Statically Statically: Compiler technique Instruction execution in order of sequence dynamically dynamically: Scoreboarding/Tomasulo’s Algorithm Instructions are out of order execution VLIW : Very Long Instruction Word EPIC Fixed number of instructions formatted as a large instruction or a fixed instruction packet with parallelism among instructions [EPIC: explicitly parallel Instruction Computing] Statically scheduled by the compiler
10 Multiple-Issue Processor Types Common Issue HazardSchedulingDistinguishingExamples namestructuredetectioncharacteristics Super scalardynamic HW static in-order execution SUN UltraSPARC (static) Super scalardynamic HW dynamic some out of order IBM Power 2 (dynamic) Super scalardynamic HW dynamic in-order execution Pentium III/4, Alpha (speculative)with speculation with speculation HP PA8500, IBM RS64III VLIW/LIWstatic SW static no hazards between Trimedia,i860 issue packets EPICmostly mostly mostly explicit dependency Itanium static SW static marked by compiler
11 Super scalar 0-8 instruction per cycle Static scheduling all pipe line hazards are checked instructions in order Pipeline control logic will check hazards between the instructions in execution phase and the new instruction sequences. In case of hazard, only those instructions preceding that one in the instruction sequence will be issued. All instructions are checked at the same time by Issue HW Issue HW Pipeline Instruction Memory Issue Packet Complexity of HW This stage is pipelined in all dynamic super scalar system
12 Example: Superscalar of degree 3 fetch decode execute write back
13 A Superscalar MIPS –Issue 2 instructions simultaneously: 1 FP & 1 integer Fetch two instr./clock cycle; one integer and one FP Can only issue 2nd instruction if 1st instruction issues Need more ports to the register file TypePipe stages Int.IFIDEXMEMWB FPIFIDEXMEMWB Int.IFIDEXMEMWB FPIFIDEXMEMWB Int.IFIDEXMEMWB FPIFIDEXMEMWB
14 Limits to Superscalar Execution –Difficulties in scheduling within the constraints on number of functional units and the ILP in the code chunk Instruction decode complexity increases with the number of issued instructions Data and control dependences are in general more costly in a superscalar processor than in a single-issue processor Techniques to enlarge the instruction window to extract more ILP are important
15 Some VLIW Characteristics Can be hard to exploit parallelism n functional units and k pipeline stages implies n x k independent instructions Memory and register bandwidth Complexity increases with the number of functional units Code size Relies heavily on compiler technology
16 Unrolled Loop that Minimizes Stalls for 1-issue pipelines 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SD-16(R1),F12 12 SUBIR1,R1,#32 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles
17 Loop Unrolling in Superscalar Integer instructionFP instructionClock cycle Loop:LD F0,0(R1)1 LD F6,-8(R1)2 LD F10,-16(R1)ADDD F4,F0,F23 LD F14,-24(R1)ADDD F8,F6,F24 LD F18,-32(R1)ADDD F12,F10,F25 SD 0(R1),F4ADDD F16,F14,F26 SD -8(R1),F8ADDD F20,F18,F27 SD -16(R1),F128 SD -24(R1),F169 SUBI R1,R1,#4010 BNEZ R1,LOOP11 SD -32(R1),F clocks, or 2.4 clocks per iteration
18 Multiple Issue Challenges While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: –Exactly 50% FP operations AND No hazards If more instructions issue at same time, greater difficulty of decode and issue: –Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue; Reducing the stalls becomes extremely difficult. Use all the techniques we covered and more advanced ones.
19 VLIW Processors Very Long Instruction Word (VLIW) processors – Tradeoff instruction space for simple decoding –The long instruction word has room for many operations –By definition, all the operations the compiler puts in the long instruction word can execute in parallel –E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide –Need compiling technique that identify the instruction to be put
20 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration
21 Commercial Superscalar and VLIW Processors
22 1 Fetch 2 Fetch 3 Decode 4 Decode 5 Decode 6 Rename 7 ROB Rd 8 Rdy/Sch 9 Dispatch 10 Exec 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Typical P6 Pipeline Typical Pentium 4 Pipeline Pentium 4 Pipeline Stages vs. Pentium 3 Pipeline Stages
23 Pentium 3 Pipeline Architecture It is a 3-way issue supersclar It is a 3-way issue supersclar It has 5 execution units (Integer ALU, integer multiply, FP multiply, FP add, FP divide) It has 5 execution units (Integer ALU, integer multiply, FP multiply, FP add, FP divide)
24 Pentium 3 Pipeline stages 1 Fetch 2 3 Decode Rename registers 7 ROB (reordering instructions) 8 Rdy/Sch (Scheduling Instructions to be executed) 9 Dispatch 10 Exec
25 Pentium 4 pipeline stages StageWork 1Trace Cache next instruction pointer 2 3Trace Cache fetch 4 5Drive 6Allocation 7Rename 8 9Queue 10Schedule 11Schedule 12Schedule 13Dispatch 14Dispatch 15Register Files 16Register Files 17Execute 18Flags 19Branch Check 20Drive Increasing the number of pipeline stages increases the clock frequency It took the industry 28 years to hit 1 GHz and only 18 months to reach 2 GHz. The price paid for deeper pipelines is that it is very difficult to ovoid stalls (That is why when Pentium 4 was introduced its performance was worse than Pentium 3.) It is a 5-issue supersclar processor
TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch 3.2 GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB TC Nxt IP: Trace cache next instruction pointer Pointer indicating location of next instruction.
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch TC Fetch: Trace cache fetch Read the decoded instructions (uOPs)
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Drive: Wire delay Drive the uOPs to the allocator
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Alloc: Allocate resources required for execution. The resources include Load buffers, Store buffers, etc..
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Rename: Register renaming
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Que: Write into the uOP Queue uOPs are placed into the queues, where they are held until there is room in the schedulers
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Sch: Schedule Write into the schedulers and compute dependencies. Watch for dependency to resolve.
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Disp: Dispatch Send the uOPs to the appropriate execution unit.
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch RF: Register File Read the register file. These are the source(s) for the pending operation (ALU or other).
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Ex: Execute Execute the uOPs on the appropriate execution port.
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Flgs: Flags Compute flags (zero, negative, etc..). These are typically input to a branch instruction.
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Br Ck: Branch Check The branch operation compares result of actual branch direction with the prediction.
GB/s System Interface L2 Cache and Control BTB BTB & I-TLB Decoder Trace Cache Rename/Alloc op Queues Schedulers Integer RF FP RF Code ROM Store AGU Load AGU ALU FP move FP store Fmul Fadd MMX SSE L1 D-Cache and D-TLB 3 4 TC Fetch 5 Drive 6 Alloc 9 Que 10 Sch 12 Sch 13 Disp 14 Disp 15 RF 16 RF 17 Ex 18 Flgs 19 BrCk 20 Drive 1 2 TC Nxt IP 7 8 Rename 11 Sch Drive: Wire delay Drive the result of the branch check to the front end of the machine.
39 Commercial EPIC Processors Itanium
40 Itanium® Processor Family Architecture EPIC: explicitly parallel instruction computing Instruction encoding Bundles and templates Large register resources 128 integer 128 floating point Support for Software pipelining Predication Speculation (Control, Data, Load)
41 EPIC – Explicitly Parallel Instruction Computing Focused on parallel execution Instructions are issued in bundles Instructions distributed among processor’s execution units according to type Currently up to two complete bundles can be dispatched per clock cycle »Pipeline stages: 10 (Itanium®1), 8 (Itanium® 2)
42
43 Instruction Format: Bundles & Templates Bundle Set of three instructions (41 bits each) Template Identifies types of instructions in bundle
44 Instruction Format: Bundles & Templates Instruction types –M: Memory –I: Shifts and multimedia –A: Integer Arithmetic and Logical Unit –B: Branch –F: Floating point –L+X: Long (move, branch, …)
45 Bundle Templates Not all combinations of A, I, M, F, B, L and X are permitted Group “stops” are explicitly encoded as part of the template –can’t stop just anywhere Some bundles identical except for group stop
46 instr instr ;; instr instr ;; instr intsr instr instr ;; instr instr ;; instr … instr instr instr tmpl instr instr nop tmpl instr nop nop tmpl instr instr nop tmpl intsr instr instr tmpl … instr instr instr tmpl Handwritten code Code generator Instruction bundles Fetch Execution Code generator creates bundles, possibly including nops. Can the bundle pair Execute in parallel ? Itanium® fetches 2 bundles at a time for execution. They may or may not execute in parallel. There are two difficulties: 1)Finding instruction triplets matching the defined templates. 2)Matching pairs of bundles that can execute in parallel.
47 MEM INT FP B B B 128-bit instruction bundles from I-cache S2 S1S0T Fetch one or more bundles for execution (Implementation, Itanium® takes two.) Try to execute all instructions in parallel, depending on available units. Retired instruction bundles Processor Explicitly Parallel Instruction Computing EPIC functional units MEM INT FP B B B
48 Itanium 8-stage Pipelines In-order issue, out-of-order completion –All functional units are fully pipelined Small branch misprediction penalties FP1 FP2 IPGROT Instruction Buffer EXPRENREG MM1MM2 EXEDETWRB L1D1L1D2L1D3 FP3 FP4MemoryInt MultiMedia Floating Point
49 Itanium 2 Eight-stage Pipeline EXPRENROTIPGREGEXEDETWB FP1FP2FP3FP4WB L2NL2IL2AL2ML2DL2CL2W Core FP L2 IPG IP Generate, L1I cache (6 inst) and TLB accessEXE ALU Execute, L1D Cache and TLB Access + L2 Cache Tag Access ROT Instruction Rotate and Buffer (6 inst)DET Exception Detect, Branch Correction EXP Expand, Port assignment and routingWB Writeback, INT register update REN INT and FP register renameFP1-WB FP FMAC pipeline (2) + register write REG INT and FP register file readL2N-L2I L2 Queue Nominate/Issue (4) speculatively issued with L1 request (speculatively issued with L1 request) L2A-L2W L2 Access, Rotate, Correct, Write (4)