Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end,

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

Computer Organization and Architecture
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
PipeliningPipelining Computer Architecture (Fall 2006)
CS203 – Advanced Computer Architecture ILP and Speculation.
ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Advanced Architectures
CSL718 : Superscalar Processors
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Tomasulo’s Algorithm Born of necessity
PowerPC 604 Superscalar Microprocessor
Out of Order Processors
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
Appendix C Pipeline implementation
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CDA 3101 Spring 2016 Introduction to Computer Organization
Flow Path Model of Superscalars
Superscalar Organization ECE/CS 752 Fall 2017
Processor Pipelining Yasser Mohammad.
Pipelining: Advanced ILP
Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers The Processor
Out of Order Processors
Superscalar Pipelines Part 2
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CSCE 432/832 High Performance Processor Architectures Superscalar Organization Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,
Lecture 11: Memory Data Flow Techniques
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
CS5100 Advanced Computer Architecture Dynamic Scheduling
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CS203 – Advanced Computer Architecture
September 20, 2000 Prof. John Kubiatowicz
CSC3050 – Computer Architecture
The University of Adelaide, School of Computer Science
ECE/CS 552: Introduction to Superscalar Processors
Presentation transcript:

Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end, OOO or dynamic execution in a micro-dataflow-machine, in-order backend Interlock hardware (later) maintains dependences Reorder buffer tracks completion, exceptions, provides precise interrupts: drain pipeline, restart Inorder machine state follows the sequential execution model inherited from nonpipelined/pipelined machines

Interstage Buffers Key differentiator for OOO pipelines Scalar pipe: just pipeline latches or flip-flops In-order superscalar pipe: just wider ones Out-of-order: start to look more like register files, with random access necessary, or shift registers. May require effective crossbar between slots before/after buffer May need to be a multiported CAM

Superscalar Pipeline Stages Program Order Out of Order In Program Order

Impediments to High IPC

Superscalar Pipeline Design Instruction Fetching Issues Instruction Decoding Issues Instruction Dispatching Issues Instruction Execution Issues Instruction Completion & Retiring Issues

Instruction Flow Objective: Fetch multiple instructions per cycle Challenges: Branches: control dependences Branch target misalignment Instruction cache misses Solutions Code alignment (static vs.dynamic) Prediction/speculation Instruction Memory PC 3 instructions fetched Don’t starve the pipeline: n/cycle Must fetch n/cycle from I$

I-Cache Organization Address 1 cache line = 1 physical row • Cache Line TAG Address 1 cache line = 1 physical row 1 cache line = 2 physical rows 1 cache line == 1 physical row 1 cache line == 2 physical rows

Issues in Decoding Primary Tasks Two important factors Identify individual instructions (!) Determine instruction types Determine dependences between instructions Two important factors Instruction set architecture Pipeline width RISC vs. CISC: inherently serial Find branches early: redirect fetch Detect dependences: nxn comparators (pairwise) RISC: fixed length, regular format, easier CISC: can be multiple stages (lots of work), P6: I$ => decode is 5 cycles, often translates into internal RISC-like uops or ROPs

Predecoding in the AMD K5 K5: notoriously late and slow, but still interesting (AMD’s first non-clone x86 processor) ~50% larger I$, predecode bits generated as instructions fetched from memory on a cache miss: Powerful principle in architecture: memoization! Predecode records start and end of x86 ops, # of ROPs, location of opcodes & prefixes Up to 4 ROPs per cycle. Also useful in RISCs: PPC 620 used 7 bits/inst PA8000, MIPS R10000 used 4/5 bits/inst These used to ID branches early, reduce branch penalty

Instruction Dispatch and Issue Parallel pipeline Centralized instruction fetch Centralized instruction decode Diversified pipeline Distributed instruction execution

Necessity of Instruction Dispatch Must have complex interstage buffers to hold instructions to avoid rigid pipeline

Centralized Reservation Station Dispatch: based on type; Issue: when instruction enters functional unit to execute (same thing here) Centralized: efficient, shared resource; has scaling problems (later)

Distributed Reservation Station Distributed, with localized control (easy win: break up based on data type, I.e. FP vs. integer) Less efficient utilization, but each unit is smaller since can be single-ported (for dispatch and issue) Must tune for proper utilization Must make 1000 little decisions (juggle 100 ping pong balls)

Issues in Instruction Execution Current trends More parallelism  bypassing very challenging Deeper pipelines More diversity Functional unit types Integer Floating point Load/store  most difficult to make parallel Branch Specialized units (media) RAW/WAR/WAW for load/store requires 32-bit or 64-bit comparators (not 5-6 as in pipelined processor with register identifiers)

Bypass Networks O(n2) interconnect from/to FU inputs and outputs PC I-Cache BR Scan Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit FX/LD 1 FX1 LD1 FX/LD 2 LD2 FX2 FP FP1 FP2 StQ D-Cache O(n2) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) Use RF only (Power4) with no bypass network Decompose into clusters (21264) Draw bypass between integer/br/cr units; 4 sources, 12 sinks

Specialized units NOTE TO SELF: update this to look at e.g. staggered adders in Pentium 4 instead (lose HW problem in Ch 3, though…) TI SuperSPARC integer unit: inorder processor, didn’t want to stall dual issue of two dependent ops. Can still issue, second op executed by ALU C IBM POWER/PowerPC FMA or MAF: 3 source operands (loss of regularity in ISA) MIPS R8000 also had this MIPS R10000 (OOO) gave up on it, decode cracks FMA into M and A

New Instruction Types Subword parallel vector extensions Media data (pixels, quantized datum)often 1-2 bytes Several operands packed in single 32/64b register {a,b,c,d} and {e,f,g,h} stored in two 32b registers Vector instructions operate on 4/8 operands in parallel New instructions, e.g. motion estimation me = |a – e| + |b – f| + |c – g| + |d – h| Substantial throughput improvement Usually requires hand-coding of critical loops

Issues in Completion/Retirement Out-of-order execution ALU instructions Load/store instructions In-order completion/retirement Precise exceptions Memory coherence and consistency Solutions Reorder buffer Store buffer Load queue snooping (later) Precise exception – clean instr. Boundary for restart Memory consistency – WAW, also subtle multiprocessor issues (in 757) Memory coherence – RAR expect later load to also see new value seen by earlier load

A Dynamic Superscalar Processor ROB – preallocated at dispatch: bookkeeping, store results, forward results (possibly) Complete – commit results to RF (no way to undo) Retire – memory update: delay stores to let loads go early

Impediments to High IPC

Superscalar Summary Instruction flow Register data flow Branches, jumps, calls: predict target, direction Fetch alignment Instruction cache misses Register data flow Register renaming: RAW/WAR/WAW Memory data flow In-order stores: WAR/WAW Store queue: RAW Data cache misses