Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE 432/832 High Performance Processor Architectures Superscalar Organization Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

Similar presentations


Presentation on theme: "CSCE 432/832 High Performance Processor Architectures Superscalar Organization Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,"— Presentation transcript:

1 CSCE 432/832 High Performance Processor Architectures Superscalar Organization
Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti, John Shen, Mark Hill, David Wood, Guri Sohi, and Jim Smith CS252 S05

2 Limitations of Scalar Pipelines
Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Inefficient unified pipeline Long latency for each instruction Rigid pipeline stall policy One stalled instruction stalls all newer instructions 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

3 CSCE 432/832, Superscalar Organization
Parallel Pipelines Temporal vs. Spatial vs. Both Temporal vs. Spatial vs. Both 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

4 Intel Pentium Parallel Pipeline
486 pipeline on the left: 2 decode stages due to complex ISA Pentium paralle pipeline: U pipe is universal (can handle any op), V pipe can’t handle the most complex ops Stages: Fetch and align, decode & generate control word, decode control word & gen mem addr, ALU or D$ Used branch prediction Approximates dual-ported D$ with 8-way interleaved banks; have to deal with occasional bank conflicts 486 pipeline on the left: 2 decode stages due to complex ISA Pentium parallel pipeline: U pipe is universal (can handle any op), V pipe can’t handle the most complex ops 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

5 Diversified Pipelines
Unified pipelines are inefficient and unnecessary. In a scalar organization they make sense. With multiple issue, specialized pipelines make much more sense. Note that all instructions are treated identically in IF, ID, (also RD, more or less), and WB. Why? Because they behave very much the same way. What mix? 1st generation: Int & FPU, later: more; must determine for your workload. Unified pipelines are inefficient and unnecessary. In a scalar organization they make sense. With multiple issue, specialized pipelines make much more sense. Note that all instructions are treated identically in IF, ID, (also RD, more or less), and WB. Why? Because they behave very much the same way. What mix? 1st generation: Int & FPU, later: more; must determine for your workload. 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

6 Power4 Diversified Pipelines
PC I-Cache BR Scan Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit FX/LD 1 FX1 LD1 FX/LD 2 LD2 FX2 FP FP1 FP2 StQ D-Cache IBM Power4, introduced in Leadership performance at 1.3GHz. Most recently used in the Apple PowerMac G5, running at 2GHz. The G5 also has an Altivec unit for media processing instructions, not shown in this diagram. All units shown are fully pipelined Note two clusters of integer units, separate FP cluster, also BR and CR units. IBM Power4, introduced in Leadership performance at 1.3GHz. Most recently used in the Apple PowerMac G5, running at 2GHz. The G5 also has an Altivec unit for media processing instructions, not shown in this diagram. All units shown are fully pipelined Note two clusters of integer units, separate FP cluster, also BR and CR units. 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

7 Rigid Pipeline Stall Policy
Bypassing of Stalled Instruction Stalled Backward Propagation of Stalling Not Allowed Think McDonalds: even though looking over the menu is pipelined with ordering, food prep, and delivery, the two are stalled: if the person in front of you has a special order (no ketchup or pickles) and the employee can’t find it under the heatlamp, all customers behind this customer are stalled. Stupid. Wendy’s model is much better, where ordering is in-order, but food prep and order delivery occur out of order. Although I noted recently that Wendy’s is giving up on this as well. Think McDonalds: even though looking over the menu is pipelined with ordering, food prep, and delivery, the two are stalled: if the person in front of you has a special order (no ketchup or pickles) and the employee can’t find it under the heatlamp, all customers behind this customer are stalled. Stupid. Wendy’s model is much better, where ordering is in-order, but food prep and order delivery occur out of order. 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

8 CSCE 432/832, Superscalar Organization
Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end, OOO or dynamic execution in a micro-dataflow-machine, in-order backend Interlock hardware (later) maintains dependences Reorder buffer tracks completion, exceptions, provides precise interrupts: drain pipeline, restart Inorder machine state follows the sequential execution model inherited from nonpipelined/pipelined machines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end, OOO or dynamic execution in a micro-dataflow-machine, in-order backend Interlock hardware (later) maintains dependences Reorder buffer tracks completion, exceptions, provides precise interrupts: drain pipeline, restart Inorder machine state follows the sequential execution model inherited from nonpipelined/pipelined machines 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

9 CSCE 432/832, Superscalar Organization
Interstage Buffers Key differentiator for OOO pipelines Scalar pipe: just pipeline latches or flip-flops In-order superscalar pipe: just wider ones Out-of-order: start to look more like register files, with random access necessary, or shift registers. May require effective crossbar between slots before/after buffer May need to be a multiported CAM Key differentiator for OOO pipelines Scalar pipe: just pipeline latches or flip-flops In-order superscalar pipe: just wider ones Out-of-order: start to look more like register files, with random access necessary, or shift registers. May require effective crossbar between slots before/after buffer May need to be a multiported CAM 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

10 Superscalar Pipeline Stages
Program Order Out of Order In Program Order 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

11 Overcoming Limitations of Scalar Pipelines
Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Solution: wide (superscalar) pipeline Inefficient unified pipeline Long latency for each instruction Solution: diversified, specialized pipelines Rigid pipeline stall policy One stalled instruction stalls all newer instructions Solution: Out-of-order execution, distributed execution pipelines 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

12 Impediments to High IPC
11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

13 Superscalar Pipeline Design
Instruction Fetching Issues Instruction Decoding Issues Instruction Dispatching Issues Instruction Execution Issues Instruction Completion & Retiring Issues 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

14 CSCE 432/832, Superscalar Organization
Instruction Flow Objective: Fetch multiple instructions per cycle Challenges: Branches: control dependences Branch target misalignment Instruction cache misses Solutions Code alignment (static vs.dynamic) Prediction/speculation Instruction Memory PC 3 instructions fetched Don’t starve the pipeline: n/cycle Must fetch n/cycle from I$ Don’t starve the pipeline: n/cycle Must fetch n/cycle from I$ 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

15 CSCE 432/832, Superscalar Organization
I-Cache Organization R o w D e c d r Cache Line TAG Address 1 cache line = 1 physical row 1 cache line = 2 physical rows 1 cache line == 1 physical row 1 cache line == 2 physical rows 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

16 Fetch Alignment Fetch size n=4: losing fetch bandwidth if not aligned
Static/compiler: align branch targets at 00 (may need to pad with NOPs); implementation specific Fetch size n=4: losing fetch bandwidth if not aligned Static/compiler: align branch targets at 00 (may need to pad with NOPs); implementation specific 11/22/2018 CS252 S05

17 RIOS-I Fetch Hardware 1989 design used in the first IBM RS/6000 (POWER or RIOS-I): 4-wide machine with Int, FP, BR, CR (typically 2 or fewer issue) 2-way set-assoc, linesize 64B spans 4 physical rows, each instruction word interleaved Say fetch i is B10, i+1 is B11, i+2 is B12, i+3 is B13. T-logic detects misalignment and chooses appropriate index Mux selects based on tag match. I-buffer network rotates instructions so they leave in program order Efficiency: 13/16 x 4 + 1/16 x 3 + 1/16 x 2 + 1/16 x 1 = 3.65 I/cycle (pretty close to 4). Naïve would be ¼ x 4 + ¼ x 3 + ¼ x 2 + ¼ x 1 = 2.5 I/cycle “Interleaved sequential” improves by interleaving tag array; allows combining of ops from two cache lines. If both hit, can get 4 every cycle. 1989 design used in the first IBM RS/6000 (POWER or RIOS-I): 4-wide machine with Int, FP, BR, CR (typically 2 or fewer issue), 2-way set-assoc, linesize 64B spans 4 physical rows, each instruction word interleaved Say fetch i is B10, i+1 is B11, i+2 is B12, i+3 is B13. T-logic detects misalignment and chooses appropriate index Mux selects based on tag match. I-buffer network rotates instructions so they leave in program order Efficiency: 13/16 x 4 + 1/16 x 3 + 1/16 x 2 + 1/16 x 1 = 3.65 I/cycle (pretty close to 4). Naïve would be ¼ x 4 + ¼ x 3 + ¼ x 2 + ¼ x 1 = 2.5 I/cycle “Interleaved sequential” improves by interleaving tag array; allows combining of ops from two cache lines. If both hit, can get 4 every cycle. 11/22/2018 CS252 S05

18 CSCE 432/832, Superscalar Organization
Issues in Decoding Primary Tasks Identify individual instructions (!) Determine instruction types Determine dependences between instructions Two important factors Instruction set architecture Pipeline width RISC vs. CISC: inherently serial Find branches early: redirect fetch Detect dependences: n*n comparators (pair-wise) RISC: fixed length, regular format, easier CISC: can be multiple stages (lots of work), P6: I$ => decode is 5 cycles, often translates into internal RISC-like uops or ROPs RISC vs. CISC: inherently serial Find branches early: redirect fetch Detect dependences: nxn comparators (pairwise) RISC: fixed length, regular format, easier CISC: can be multiple stages (lots of work), P6: I$ => decode is 5 cycles, often translates into internal RISC-like uops or ROPs 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

19 Pentium Pro Fetch/Decode
16B/cycle delivered from I$ into FIFO instruction buffer Decoder 0 is fully general, 1 & 2 can handle only simple uops. Enter centralized window (reservation station); wait here until operands ready, structural hazards resolved. Why is this bad? Branch penalty; need a good branch predictor. Other option: predecode bits 16B/cycle delivered from I$ into FIFO instruction buffer Decoder 0 is fully general, 1 & 2 can handle only simple uops. Enter centralized window (reservation station); wait here until operands ready, structural hazards resolved. Why is this bad? Branch penalty; need a good branch predictor. Other option: predecode bits 11/22/2018 CS252 S05

20 Predecoding in the AMD K5
K5: notoriously late and slow, but still interesting (AMD’s first non-clone x86 processor) ~50% larger I$, predecode bits generated as instructions fetched from memory on a cache miss: Powerful principle in architecture: memoization! Predecode records start and end of x86 ops, # of ROPs, location of opcodes & prefixes Up to 4 ROPs per cycle. Also useful in RISCs: PPC 620 used 7 bits/inst PA8000, MIPS R10000 used 4/5 bits/inst These used to ID branches early, reduce branch penalty K5: notoriously late and slow, but still interesting (AMD’s first non-clone x86 processor) ~50% larger I$, predecode bits generated as instructions fetched from memory on a cache miss: Powerful principle in architecture: memorization! Predecode records start and end of x86 ops, # of ROPs, location of opcodes & prefixes Up to 4 ROPs per cycle. Also useful in RISCs: PPC 620 used 7 bits/inst; PA8000, MIPS R10000 used 4/5 bits/inst; These used to ID branches early, reduce branch penalty 11/22/2018 CS252 S05

21 Instruction Dispatch and Issue
Parallel pipeline Centralized instruction fetch Centralized instruction decode Diversified pipeline Distributed instruction execution 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

22 Necessity of Instruction Dispatch
Must have complex interstage buffers to hold instructions to avoid rigid pipeline Must have complex interstage buffers to hold instructions to avoid rigid pipeline 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

23 Centralized Reservation Station
Dispatch: based on type; Issue: when instruction enters functional unit to execute (same thing here) Centralized: efficient, shared resource; has scaling problems (later) Dispatch: based on type; Issue: when instruction enters functional unit to execute (same thing here) Centralized: efficient, shared resource; has scaling problems (later) 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

24 Distributed Reservation Station
Distributed, with localized control (easy win: break up based on data type, I.e. FP vs. integer) Less efficient utilization, but each unit is smaller since can be single-ported (for dispatch and issue) Must tune for proper utilization Must make 1000 little decisions (juggle 100 ping pong balls) Distributed, with localized control (easy win: break up based on data type, I.e. FP vs. integer) Less efficient utilization, but each unit is smaller since can be single-ported (for dispatch and issue) Must tune for proper utilization Must make 1000 little decisions (juggle 100 ping pong balls) 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

25 Issues in Instruction Execution
Current trends More parallelism  bypassing very challenging Deeper pipelines More diversity Functional unit types Integer Floating point Load/store  most difficult to make parallel Branch Specialized units (media) RAW/WAR/WAW for load/store requires 32-bit or 64-bit comparators (not 5-6 as in pipelined processor with register identifiers) RAW/WAR/WAW for load/store requires 32-bit or 64-bit comparators (not 5-6 as in pipelined processor with register identifiers) 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

26 CSCE 432/832, Superscalar Organization
Bypass Networks PC I-Cache BR Scan Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit FX/LD 1 FX1 LD1 FX/LD 2 LD2 FX2 FP FP1 FP2 StQ D-Cache O(n2) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) Use RF (register file) only (Power4) with no bypass network Decompose into clusters (21264) Draw bypass between integer/br/cr units; 4 sources, 12 sinks Draw bypass between integer/br/cr units; 4 sources, 12 sinks 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

27 CSCE 432/832, Superscalar Organization
Specialized units NOTE TO SELF: update this to look at e.g. staggered adders in Pentium 4 instead (lose HW problem in Ch 3, though…) TI SuperSPARC integer unit: inorder processor, didn’t want to stall dual issue of two dependent ops. Can still issue, second op executed by ALU C IBM POWER/PowerPC FMA or MAF: 3 source operands (loss of regularity in ISA) MIPS R8000 also had this MIPS R10000 (OOO) gave up on it, decode cracks FMA into M and A 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

28 CSCE 432/832, Superscalar Organization
New Instruction Types Subword parallel vector extensions Media data (pixels, quantized datum) often 1-2 bytes Several operands packed in single 32/64b register {a,b,c,d} and {e,f,g,h} stored in two 32b registers Vector instructions operate on 4/8 operands in parallel New instructions, e.g. motion estimation me = |a – e| + |b – f| + |c – g| + |d – h| Substantial throughput improvement Usually requires hand-coding of critical loops 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

29 Issues in Completion/Retirement
Out-of-order execution ALU instructions Load/store instructions In-order completion/retirement Precise exceptions Memory coherence and consistency Solutions Reorder buffer Store buffer Load queue snooping (later) Precise exception – clean instr. Boundary for restart Memory consistency – WAW, also subtle multiprocessor issues (in 757) Memory coherence – RAR expect later load to also see new value seen by earlier load Precise exception – clean instruction boundary for restart Memory consistency – WAW, also subtle multiprocessor issues Memory coherence – RAR expect later load to also see new value seen by earlier load 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

30 A Dynamic Superscalar Processor
ROB – preallocated at dispatch: bookkeeping, store results, forward results (possibly) Complete – commit results to RF (no way to undo) Retire – memory update: delay stores to let loads go early ROB – pre-allocated at dispatch: bookkeeping, store results, forward results (possibly) Complete – commit results to RF (no way to undo) Retire – memory update: delay stores to let loads go early 11/22/2018 CS252 S05

31 Impediments to High IPC
11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

32 CSCE 432/832, Superscalar Organization
Superscalar Summary Instruction flow Branches, jumps, calls: predict target, direction Fetch alignment Instruction cache misses Register data flow Register renaming: RAW/WAR/WAW Memory data flow In-order stores: WAR/WAW Store queue: RAW Data cache misses 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05


Download ppt "CSCE 432/832 High Performance Processor Architectures Superscalar Organization Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,"

Similar presentations


Ads by Google