Flow Path Model of Superscalars

Slides:

Advertisements

Similar presentations

CSE502: Computer Architecture Superscalar Decode.

Advertisements

Computer Organization and Architecture

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Superscalar Implementation Simultaneously fetch multiple instructions Logic to determine true dependencies involving register values Mechanisms to communicate.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Superscalar Processors by

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Protection in Virtual Mode

Prof. Hsien-Hsin Sean Lee

Instruction Level Parallelism

William Stallings Computer Organization and Architecture 8th Edition

Multiscalar Processors

Simultaneous Multithreading

Morgan Kaufmann Publishers

PowerPC 604 Superscalar Microprocessor

CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic

Chapter 14 Instruction Level Parallelism and Superscalar Processors

ECE/CS 552: Pipelining to Superscalar

Introduction to Pentium Processor

Superscalar Pipelines Part 2

Module 3: Branch Prediction

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

CSCE 432/832 High Performance Processor Architectures Instruction Flow & Branch Prediction Adopted from Lecture notes based in part on slides created by.

Pipelining to Superscalar ECE/CS 752 Fall 2017

Ka-Ming Keung Swamy D Ponpandi

Lecture 20: OOO, Memory Hierarchy

Lecture 20: OOO, Memory Hierarchy

Instruction Flow Techniques

* From AMD 1996 Publication #18522 Revision E

ECE/CS 552: Pipelining to Superscalar

Chapter 13 Instruction-Level Parallelism and Superscalar Processors

Branch Prediction Prof. Mikko H. Lipasti

Dynamic Hardware Prediction

Created by Vivi Sahfitri

Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end,

Chapter 11 Processor Structure and function

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Flow Path Model of Superscalars I-cache Instruction Branch FETCH Flow Predictor Instruction Buffer DECODE Integer Floating-point Media Memory Memory Data EXECUTE Flow Reorder Register Buffer (ROB) Data COMMIT Flow Store D-cache Queue

Instruction Fetch Buffer Unit Out-of-order Core Fetch buffer smoothes out the rate mismatch between fetch and execution neither the fetch bandwidth nor the execution bandwidth is consistent Fetch bandwidth should be higher than execution bandwidth we prefer to have a stockpile of instructions in the buffer to hide cache miss latencies. This requires both raw cache bandwidth + control flow speculation

Instruction Flow Bandwidth

Instruction Cache Basic 00 01 10 11 000 001 Row Decoder PC=..xxRRRCC00 111 Mutiplexer Instruction example: 4 instructions per cache line

Spatial Locality and Fetch Bandwidth 00 01 10 11 000 001 Row Decoder PC=..xxRRRCC00 111 Inst0 Inst1 Inst2 Inst3

Fetch Group Miss Alignment 00 01 10 11 000 001 Row Decoder PC=..xx0000100 111 Inst0 Inst1 Inst2 Cycle i Cycle i+1 Inst3??

IBM RS/6000 Auto-alignment 1 2 3 255 mux T logic A1 A5 A9 A13 B1 B5 B9 B13 A2 A6 A10 A14 B2 B6 B10 B14 A3 A7 A11 A15 B3 B7 B11 B15 TLB hit control and buffer Odd Directory Sets A & B Even Instruction buffer network Interlock, dispatch, b r a n c h , execution, D I s t u i o + IFAR - 2-way set associative I-Cache, 8 256-inst SRAM modules - 16 instruction per cache line (**What is a cache line?)

Instruction Decoding Issues Primary tasks: Identify individual instructions Determine instruction types Detect inter-instruction dependences Two important factors: Instruction set architecture Width of parallel pipeline

Intel Pentium Pro Fetch/Decode Unit x86 Macro-Instruction Bytes from IFU Instruction Buffer 16 bytes To Next Address Calc. uROM Decoder Decoder Decoder 1 2 Branch Address Calc. 4 uops 1 uop 1 uop uop Queue (6) Up to 3 uops Issued to dispatch

Predecoding in the AMD K5 From Memory 8 Instruction Bytes 64 Byte1 Byte2 Byte8 5 Bits • Predecode Logic 8 Instr. Bytes + 64 + 40 Predecode Bits I-Cache 16 Instr. Bytes + 128 + 80 Predecode Bits Decode, Translate and Dispatch ROP1 ROP2 ROP3 ROP4 Predecoding is also useful for RISC ISAs!! Cost: cache size, refill time Up to 4 ROP’s

Control Dependence

IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Code Characteristics (dynamic) loads - 25% stores - 15% ALU/RR - 40% branches - 20% 1/3 unconditional (always taken) unconditional - 100% schedulable 1/3 conditional taken 1/3 conditional not taken conditional - 50% schedulable

Control Flow Graph Shows possible paths of control flow through basic blocks Control Dependence Node X is control dependant on Node Y if the computation in Y determines whether X executes

Mapping CFG to Linear Instruction Sequence B C D C B D D B C

Branch Types and Implementation Types of Branches Conditional or Unconditional? Subroutine Call (aka Link), needs to save PC? How is the branch target computed? Static Target e.g. immediate, PC-relative Dynamic targets e.g. register indirect Conditional Branch Architectures Condition Code ‘N-Z-C-V’ e.g. PowerPC General Purpose Register e.g. Alpha, MIPS Special Purposes register e.g. Power’s Loop Count

Condition Resolution

Target Address Generation

What’s So Bad About Branches? Performance Penalties Use up execution resources Fragmentation of I-Cache lines Disruption of sequential control flow Need to determine branch direction (conditional branches) Need to determine branch target Robs instruction fetch bandwidth and ILP

Riseman and Foster’s Study 7 benchmark programs on CDC-3600 Assume infinite machine: Infinite memory and instruction stack, register file, fxn units Consider only true dependency at data-flow limit If bounded to single basic block, i.e. no bypassing of branches  maximum speedup is 1.72 Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution) Br. Bypassed: 0 1 2 8 32 128 Max Speedup: 1.72 2.72 3.62 7.21 24.4 51.2