CSCE 432/832 High Performance Processor Architectures Superscalar Organization Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Computer ArchitectureFall 2007 © October 29th, 2007 Majd F. Sakr CS-447– Computer Architecture.
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
1 Superscalar Pipelines 11/24/08. 2 Scalar Pipelines A single k stage pipeline capable of executing at most one instruction per clock cycle. All instructions,
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
PipeliningPipelining Computer Architecture (Fall 2006)
CS203 – Advanced Computer Architecture ILP and Speculation.
ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Morgan Kaufmann Publishers
PowerPC 604 Superscalar Microprocessor
Out of Order Processors
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
Appendix C Pipeline implementation
CS203 – Advanced Computer Architecture
ECE/CS 552: Pipelining to Superscalar
Flow Path Model of Superscalars
Superscalar Organization ECE/CS 752 Fall 2017
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Out of Order Processors
Pipelining Multicycle, MIPS R4000, and More
Superscalar Pipelines Part 2
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Introduction to Computer Engineering
Lecture 11: Memory Data Flow Techniques
Ka-Ming Keung Swamy D Ponpandi
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
CS203 – Advanced Computer Architecture
ECE/CS 552: Pipelining to Superscalar
CSC3050 – Computer Architecture
Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end,
The University of Adelaide, School of Computer Science
Ka-Ming Keung Swamy D Ponpandi
ECE/CS 552: Introduction to Superscalar Processors
Presentation transcript:

CSCE 432/832 High Performance Processor Architectures Superscalar Organization Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti, John Shen, Mark Hill, David Wood, Guri Sohi, and Jim Smith CS252 S05

Limitations of Scalar Pipelines Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Inefficient unified pipeline Long latency for each instruction Rigid pipeline stall policy One stalled instruction stalls all newer instructions 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

CSCE 432/832, Superscalar Organization Parallel Pipelines Temporal vs. Spatial vs. Both Temporal vs. Spatial vs. Both 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Intel Pentium Parallel Pipeline 486 pipeline on the left: 2 decode stages due to complex ISA Pentium paralle pipeline: U pipe is universal (can handle any op), V pipe can’t handle the most complex ops Stages: Fetch and align, decode & generate control word, decode control word & gen mem addr, ALU or D$ Used branch prediction Approximates dual-ported D$ with 8-way interleaved banks; have to deal with occasional bank conflicts 486 pipeline on the left: 2 decode stages due to complex ISA Pentium parallel pipeline: U pipe is universal (can handle any op), V pipe can’t handle the most complex ops 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Diversified Pipelines Unified pipelines are inefficient and unnecessary. In a scalar organization they make sense. With multiple issue, specialized pipelines make much more sense. Note that all instructions are treated identically in IF, ID, (also RD, more or less), and WB. Why? Because they behave very much the same way. What mix? 1st generation: Int & FPU, later: more; must determine for your workload. Unified pipelines are inefficient and unnecessary. In a scalar organization they make sense. With multiple issue, specialized pipelines make much more sense. Note that all instructions are treated identically in IF, ID, (also RD, more or less), and WB. Why? Because they behave very much the same way. What mix? 1st generation: Int & FPU, later: more; must determine for your workload. 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Power4 Diversified Pipelines PC I-Cache BR Scan Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit FX/LD 1 FX1 LD1 FX/LD 2 LD2 FX2 FP FP1 FP2 StQ D-Cache IBM Power4, introduced in 2001. Leadership performance at 1.3GHz. Most recently used in the Apple PowerMac G5, running at 2GHz. The G5 also has an Altivec unit for media processing instructions, not shown in this diagram. All units shown are fully pipelined Note two clusters of integer units, separate FP cluster, also BR and CR units. IBM Power4, introduced in 2001. Leadership performance at 1.3GHz. Most recently used in the Apple PowerMac G5, running at 2GHz. The G5 also has an Altivec unit for media processing instructions, not shown in this diagram. All units shown are fully pipelined Note two clusters of integer units, separate FP cluster, also BR and CR units. 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Stalled Backward Propagation of Stalling Not Allowed Think McDonalds: even though looking over the menu is pipelined with ordering, food prep, and delivery, the two are stalled: if the person in front of you has a special order (no ketchup or pickles) and the employee can’t find it under the heatlamp, all customers behind this customer are stalled. Stupid. Wendy’s model is much better, where ordering is in-order, but food prep and order delivery occur out of order. Although I noted recently that Wendy’s is giving up on this as well. Think McDonalds: even though looking over the menu is pipelined with ordering, food prep, and delivery, the two are stalled: if the person in front of you has a special order (no ketchup or pickles) and the employee can’t find it under the heatlamp, all customers behind this customer are stalled. Stupid. Wendy’s model is much better, where ordering is in-order, but food prep and order delivery occur out of order. 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

CSCE 432/832, Superscalar Organization Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end, OOO or dynamic execution in a micro-dataflow-machine, in-order backend Interlock hardware (later) maintains dependences Reorder buffer tracks completion, exceptions, provides precise interrupts: drain pipeline, restart Inorder machine state follows the sequential execution model inherited from nonpipelined/pipelined machines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end, OOO or dynamic execution in a micro-dataflow-machine, in-order backend Interlock hardware (later) maintains dependences Reorder buffer tracks completion, exceptions, provides precise interrupts: drain pipeline, restart Inorder machine state follows the sequential execution model inherited from nonpipelined/pipelined machines 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

CSCE 432/832, Superscalar Organization Interstage Buffers Key differentiator for OOO pipelines Scalar pipe: just pipeline latches or flip-flops In-order superscalar pipe: just wider ones Out-of-order: start to look more like register files, with random access necessary, or shift registers. May require effective crossbar between slots before/after buffer May need to be a multiported CAM Key differentiator for OOO pipelines Scalar pipe: just pipeline latches or flip-flops In-order superscalar pipe: just wider ones Out-of-order: start to look more like register files, with random access necessary, or shift registers. May require effective crossbar between slots before/after buffer May need to be a multiported CAM 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Superscalar Pipeline Stages Program Order Out of Order In Program Order 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Overcoming Limitations of Scalar Pipelines Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Solution: wide (superscalar) pipeline Inefficient unified pipeline Long latency for each instruction Solution: diversified, specialized pipelines Rigid pipeline stall policy One stalled instruction stalls all newer instructions Solution: Out-of-order execution, distributed execution pipelines 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Impediments to High IPC 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Superscalar Pipeline Design Instruction Fetching Issues Instruction Decoding Issues Instruction Dispatching Issues Instruction Execution Issues Instruction Completion & Retiring Issues 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

CSCE 432/832, Superscalar Organization Instruction Flow Objective: Fetch multiple instructions per cycle Challenges: Branches: control dependences Branch target misalignment Instruction cache misses Solutions Code alignment (static vs.dynamic) Prediction/speculation Instruction Memory PC 3 instructions fetched Don’t starve the pipeline: n/cycle Must fetch n/cycle from I$ Don’t starve the pipeline: n/cycle Must fetch n/cycle from I$ 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

CSCE 432/832, Superscalar Organization I-Cache Organization R o w D e c d r • Cache Line TAG Address 1 cache line = 1 physical row 1 cache line = 2 physical rows 1 cache line == 1 physical row 1 cache line == 2 physical rows 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Fetch Alignment Fetch size n=4: losing fetch bandwidth if not aligned Static/compiler: align branch targets at 00 (may need to pad with NOPs); implementation specific Fetch size n=4: losing fetch bandwidth if not aligned Static/compiler: align branch targets at 00 (may need to pad with NOPs); implementation specific 11/22/2018 CS252 S05

RIOS-I Fetch Hardware 1989 design used in the first IBM RS/6000 (POWER or RIOS-I): 4-wide machine with Int, FP, BR, CR (typically 2 or fewer issue) 2-way set-assoc, linesize 64B spans 4 physical rows, each instruction word interleaved Say fetch i is B10, i+1 is B11, i+2 is B12, i+3 is B13. T-logic detects misalignment and chooses appropriate index Mux selects based on tag match. I-buffer network rotates instructions so they leave in program order Efficiency: 13/16 x 4 + 1/16 x 3 + 1/16 x 2 + 1/16 x 1 = 3.65 I/cycle (pretty close to 4). Naïve would be ¼ x 4 + ¼ x 3 + ¼ x 2 + ¼ x 1 = 2.5 I/cycle “Interleaved sequential” improves by interleaving tag array; allows combining of ops from two cache lines. If both hit, can get 4 every cycle. 1989 design used in the first IBM RS/6000 (POWER or RIOS-I): 4-wide machine with Int, FP, BR, CR (typically 2 or fewer issue), 2-way set-assoc, linesize 64B spans 4 physical rows, each instruction word interleaved Say fetch i is B10, i+1 is B11, i+2 is B12, i+3 is B13. T-logic detects misalignment and chooses appropriate index Mux selects based on tag match. I-buffer network rotates instructions so they leave in program order Efficiency: 13/16 x 4 + 1/16 x 3 + 1/16 x 2 + 1/16 x 1 = 3.65 I/cycle (pretty close to 4). Naïve would be ¼ x 4 + ¼ x 3 + ¼ x 2 + ¼ x 1 = 2.5 I/cycle “Interleaved sequential” improves by interleaving tag array; allows combining of ops from two cache lines. If both hit, can get 4 every cycle. 11/22/2018 CS252 S05

CSCE 432/832, Superscalar Organization Issues in Decoding Primary Tasks Identify individual instructions (!) Determine instruction types Determine dependences between instructions Two important factors Instruction set architecture Pipeline width RISC vs. CISC: inherently serial Find branches early: redirect fetch Detect dependences: n*n comparators (pair-wise) RISC: fixed length, regular format, easier CISC: can be multiple stages (lots of work), P6: I$ => decode is 5 cycles, often translates into internal RISC-like uops or ROPs RISC vs. CISC: inherently serial Find branches early: redirect fetch Detect dependences: nxn comparators (pairwise) RISC: fixed length, regular format, easier CISC: can be multiple stages (lots of work), P6: I$ => decode is 5 cycles, often translates into internal RISC-like uops or ROPs 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Pentium Pro Fetch/Decode 16B/cycle delivered from I$ into FIFO instruction buffer Decoder 0 is fully general, 1 & 2 can handle only simple uops. Enter centralized window (reservation station); wait here until operands ready, structural hazards resolved. Why is this bad? Branch penalty; need a good branch predictor. Other option: predecode bits 16B/cycle delivered from I$ into FIFO instruction buffer Decoder 0 is fully general, 1 & 2 can handle only simple uops. Enter centralized window (reservation station); wait here until operands ready, structural hazards resolved. Why is this bad? Branch penalty; need a good branch predictor. Other option: predecode bits 11/22/2018 CS252 S05

Predecoding in the AMD K5 K5: notoriously late and slow, but still interesting (AMD’s first non-clone x86 processor) ~50% larger I$, predecode bits generated as instructions fetched from memory on a cache miss: Powerful principle in architecture: memoization! Predecode records start and end of x86 ops, # of ROPs, location of opcodes & prefixes Up to 4 ROPs per cycle. Also useful in RISCs: PPC 620 used 7 bits/inst PA8000, MIPS R10000 used 4/5 bits/inst These used to ID branches early, reduce branch penalty K5: notoriously late and slow, but still interesting (AMD’s first non-clone x86 processor) ~50% larger I$, predecode bits generated as instructions fetched from memory on a cache miss: Powerful principle in architecture: memorization! Predecode records start and end of x86 ops, # of ROPs, location of opcodes & prefixes Up to 4 ROPs per cycle. Also useful in RISCs: PPC 620 used 7 bits/inst; PA8000, MIPS R10000 used 4/5 bits/inst; These used to ID branches early, reduce branch penalty 11/22/2018 CS252 S05

Instruction Dispatch and Issue Parallel pipeline Centralized instruction fetch Centralized instruction decode Diversified pipeline Distributed instruction execution 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Necessity of Instruction Dispatch Must have complex interstage buffers to hold instructions to avoid rigid pipeline Must have complex interstage buffers to hold instructions to avoid rigid pipeline 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Centralized Reservation Station Dispatch: based on type; Issue: when instruction enters functional unit to execute (same thing here) Centralized: efficient, shared resource; has scaling problems (later) Dispatch: based on type; Issue: when instruction enters functional unit to execute (same thing here) Centralized: efficient, shared resource; has scaling problems (later) 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Distributed Reservation Station Distributed, with localized control (easy win: break up based on data type, I.e. FP vs. integer) Less efficient utilization, but each unit is smaller since can be single-ported (for dispatch and issue) Must tune for proper utilization Must make 1000 little decisions (juggle 100 ping pong balls) Distributed, with localized control (easy win: break up based on data type, I.e. FP vs. integer) Less efficient utilization, but each unit is smaller since can be single-ported (for dispatch and issue) Must tune for proper utilization Must make 1000 little decisions (juggle 100 ping pong balls) 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Issues in Instruction Execution Current trends More parallelism  bypassing very challenging Deeper pipelines More diversity Functional unit types Integer Floating point Load/store  most difficult to make parallel Branch Specialized units (media) RAW/WAR/WAW for load/store requires 32-bit or 64-bit comparators (not 5-6 as in pipelined processor with register identifiers) RAW/WAR/WAW for load/store requires 32-bit or 64-bit comparators (not 5-6 as in pipelined processor with register identifiers) 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

CSCE 432/832, Superscalar Organization Bypass Networks PC I-Cache BR Scan Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit FX/LD 1 FX1 LD1 FX/LD 2 LD2 FX2 FP FP1 FP2 StQ D-Cache O(n2) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) Use RF (register file) only (Power4) with no bypass network Decompose into clusters (21264) Draw bypass between integer/br/cr units; 4 sources, 12 sinks Draw bypass between integer/br/cr units; 4 sources, 12 sinks 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

CSCE 432/832, Superscalar Organization Specialized units NOTE TO SELF: update this to look at e.g. staggered adders in Pentium 4 instead (lose HW problem in Ch 3, though…) TI SuperSPARC integer unit: inorder processor, didn’t want to stall dual issue of two dependent ops. Can still issue, second op executed by ALU C IBM POWER/PowerPC FMA or MAF: 3 source operands (loss of regularity in ISA) MIPS R8000 also had this MIPS R10000 (OOO) gave up on it, decode cracks FMA into M and A 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

CSCE 432/832, Superscalar Organization New Instruction Types Subword parallel vector extensions Media data (pixels, quantized datum) often 1-2 bytes Several operands packed in single 32/64b register {a,b,c,d} and {e,f,g,h} stored in two 32b registers Vector instructions operate on 4/8 operands in parallel New instructions, e.g. motion estimation me = |a – e| + |b – f| + |c – g| + |d – h| Substantial throughput improvement Usually requires hand-coding of critical loops 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

Issues in Completion/Retirement Out-of-order execution ALU instructions Load/store instructions In-order completion/retirement Precise exceptions Memory coherence and consistency Solutions Reorder buffer Store buffer Load queue snooping (later) Precise exception – clean instr. Boundary for restart Memory consistency – WAW, also subtle multiprocessor issues (in 757) Memory coherence – RAR expect later load to also see new value seen by earlier load Precise exception – clean instruction boundary for restart Memory consistency – WAW, also subtle multiprocessor issues Memory coherence – RAR expect later load to also see new value seen by earlier load 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

A Dynamic Superscalar Processor ROB – preallocated at dispatch: bookkeeping, store results, forward results (possibly) Complete – commit results to RF (no way to undo) Retire – memory update: delay stores to let loads go early ROB – pre-allocated at dispatch: bookkeeping, store results, forward results (possibly) Complete – commit results to RF (no way to undo) Retire – memory update: delay stores to let loads go early 11/22/2018 CS252 S05

Impediments to High IPC 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05

CSCE 432/832, Superscalar Organization Superscalar Summary Instruction flow Branches, jumps, calls: predict target, direction Fetch alignment Instruction cache misses Register data flow Register renaming: RAW/WAR/WAW Memory data flow In-order stores: WAR/WAW Store queue: RAW Data cache misses 11/22/2018 CSCE 432/832, Superscalar Organization CS252 S05