Wrap-Up CSC 333. – 2 – Overview Wrap-Up of PIPE Design Performance analysis Fetch stage design Exceptional conditions Modern High-Performance Processors.

Slides:



Advertisements
Similar presentations
Lecture 4: CPU Performance
Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
Computer Architecture Carnegie Mellon University
PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.
Chapter 12 Pipelining Strategies Performance Hazards.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Appendix A Pipelining: Basic and Intermediate Concepts
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Pipelining III Topics Hazard mitigation through pipeline forwarding Hardware support for forwarding Forwarding to mitigate control (branch) hazards Systems.
David O’Hallaron Carnegie Mellon University Processor Architecture PIPE: Pipelined Implementation Part I Processor Architecture PIPE: Pipelined Implementation.
Architecture Basics ECE 454 Computer Systems Programming
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
1 Seoul National University Pipelined Implementation : Part I.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.
Pipeline Architecture I Slides from: Bryant & O’ Hallaron
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
PipelinedImplementation Part II PipelinedImplementation.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
1 Pipelined Implementation. 2 Outline Handle Control Hazard Handle Exception Performance Analysis Suggested Reading 4.5.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
PipeliningPipelining Computer Architecture (Fall 2006)
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Real-World Pipelines Idea Divide process into independent stages
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Computer Architecture Chapter (14): Processor Structure and Function
Systems I Pipelining IV
Morgan Kaufmann Publishers
Administrivia Midterm to be posted on Tuesday after class
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Pipelined Implementation : Part I
Seoul National University
Seoul National University
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Systems I Pipelining III
Systems I Pipelining II
Pipelined Implementation : Part I
Seoul National University
Control unit extension for data hazards
Pipeline Architecture I Slides from: Bryant & O’ Hallaron
Pipelined Implementation : Part I
Pipelined Implementation
Computer Architecture
Control unit extension for data hazards
Systems I Pipelining II
Systems I Pipelining II
Control unit extension for data hazards
Presentation transcript:

Wrap-Up CSC 333

– 2 – Overview Wrap-Up of PIPE Design Performance analysis Fetch stage design Exceptional conditions Modern High-Performance Processors Out-of-order execution

– 3 – Performance Metrics Clock rate Measured in Megahertz or Gigahertz Function of stage partitioning and circuit design Keep amount of work per stage small Rate at which instructions executed CPI: cycles per instruction On average, how many clock cycles does each instruction require? Function of pipeline design and benchmark programs E.g., how frequently are branches mispredicted?

– 4 – CPI for PIPE CPI  1.0 Fetch instruction each clock cycle Effectively process new instruction almost every cycle Although each individual instruction has latency of 5 cycles CPI > 1.0 Sometimes must stall or cancel branches Computing CPI C clock cycles I instructions executed to completion B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = B/I Factor B/I represents average penalty due to bubbles

– 5 – CPI for PIPE (Cont.) B/I = LP + MP + RP LP: Penalty due to load/use hazard stalling Fraction of instructions that are loads0.25 Fraction of load instructions requiring stall0.20 Number of bubbles injected each time1  LP = 0.25 * 0.20 * 1 = 0.05 MP: Penalty due to mispredicted branches Fraction of instructions that are cond. jumps 0.20 Fraction of cond. jumps mispredicted0.40 Number of bubbles injected each time 2  MP = 0.20 * 0.40 * 2 = 0.16 RP: Penalty due to ret instructions Fraction of instructions that are returns0.02 Number of bubbles injected each time 3  RP = 0.02 * 3 = 0.06 Net effect of penalties = 0.27  CPI = 1.27 (Not bad!) Typical Values

– 6 – Fetch Logic Revisited During Fetch Cycle  Select PC  Read bytes from instruction memory  Examine icode to determine instruction length  Increment PCTiming Steps 2 & 4 require significant amount of time

– 7 – Standard Fetch Timing Must Perform Everything in Sequence Can’t compute incremented PC until know how much to increment it by Select PC Mem. ReadIncrement need_regids, need_valC 1 clock cycle

– 8 – A Fast PC Increment Circuit 3-bit adder need_ValC need_regids 0 29-bit incre- menter MUX High-order 29 bits Low-order 3 bits High-order 29 bitsLow-order 3 bits 01 PC incrPC SlowFast carry

– 9 – Modified Fetch Timing 29-Bit Incrementer Acts as soon as PC selected Output not needed until final MUX Works in parallel with memory read Select PC Mem. Read Incrementer need_regids, need_valC 3-bit add MUX 1 clock cycle Standard cycle

– 10 – More Realistic Fetch Logic Fetch Box Integrated into instruction cache Fetches entire cache block (16 or 32 bytes) Selects current instruction from current block Works ahead to fetch next block As reaches end of current block At branch target

– 11 – Exceptions Conditions under which pipeline cannot continue normal operationCauses Halt instruction(Current) Bad address for instruction or data(Previous) Invalid instruction(Previous) Pipeline control error(Previous) Desired Action Complete some instructions Either current or previous (depends on exception type) Discard others Call exception handler Like an unexpected procedure call

– 12 – Exception Examples Detect in Fetch Stage irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address jmp $-1 # Invalid jump target.byte 0xFF # Invalid instruction code halt # Halt instruction Detect in Memory Stage

– 13 – Exceptions in Pipeline Processor #1 Desired Behavior rmmovl should cause exception # demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop.byte 0xFF # Invalid instruction code 0x000: irmovl $100,%eax 1234 FDEM FDE 0x006: rmmovl %eax,0x1000(%eax) 0x00c: nop 0x00d:.byte 0xFF FD F W 5 M E D Exception detected

– 14 – Exceptions in Pipeline Processor #2 Desired Behavior No exception should occur # demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t:.byte 0xFF # Target 0x000: xorl %eax,%eax 123 FDE FD 0x002: jne t 0x014: t:.byte 0xFF 0x???: (I’m lost!) F Exception detected 0x007: irmovl $1,%eax 4 M E F D W 5 M D F E E D M 6 M E W 7 W M 8 W 9

– 15 – Maintaining Exception Ordering Add exception status field to pipeline registers Fetch stage sets to either “AOK,” “ADR” (when bad fetch address), or “INS” (illegal instruction) Decode & execute pass values through Memory either passes through or sets to “ADR” Exception triggered only when instruction hits write back F predPC W icodevalEvalMdstEdstMexc M BchicodevalEvalAdstEdstMexc E icodeifunvalCvalAvalBdstEdstMsrcAsrcBexc D rBvalCvalPicodeifunrAexc

– 16 – Side Effects in Pipeline Processor Desired Behavior rmmovl should cause exception No following instruction should have any effect # demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes 0x000: irmovl $100,%eax 1234 FDEM FDE 0x006: rmmovl %eax,0x1000(%eax) 0x00c: addl %eax,%eax FD W 5 M E Exception detected Condition code set

– 17 – Avoiding Side Effects Presence of Exception Should Disable State Update When detect exception in memory stage Disable condition code setting in execute Must happen in same clock cycle When exception passes to write-back stage Disable memory write in memory stage Disable condition code setting in execute stageImplementation Hardwired into the design of the PIPE simulator You have no control over this

– 18 – Rest of Exception Handling Calling Exception Handler Push PC onto stack Either PC of faulting instruction or of next instruction Usually pass through pipeline along with exception status Jump to handler address Usually fixed address Defined as part of ISAImplementation Haven’t tried it yet!

– 19 – Modern CPU Design

– 20 – Instruction Control Grabs Instruction Bytes From Memory Based on Current PC + Predicted Targets for Predicted Branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch target Translates Instructions Into Operations Primitive steps required to perform instruction Typical instruction requires 1–3 operations Converts Register References Into Tags Abstract identifier linking destination of one operation with sources of later operations

– 21 – Execution Unit Multiple functional units Each can operate in independently Operations performed as soon as operands available Not necessarily in program order Within limits of functional units Control logic Ensures behavior equivalent to sequential program execution Execution Functional Units Integer/ Branch FP Add FP Mult/Div LoadStore Data Cache Prediction OK? Data Addr.. General Integer Operation Results Register Updates Operations

– 22 – CPU Capabilities of Pentium III Multiple Instructions Can Execute in Parallel 1 load 1 store 2 integer (one may be branch) 1 FP Addition 1 FP Multiplication or Division Some Instructions Take > 1 Cycle, but Can be Pipelined InstructionLatencyCycles/Issue Load / Store31 Integer Multiply41 Integer Divide3636 Double/Single FP Multiply52 Double/Single FP Add31 Double/Single FP Divide3838

PentiumPro Block Diagram P6 Microarchitecture PentiumPro Pentium II Pentium III Microprocessor Report 2/16/95

– 24 – PentiumPro Operation Translates instructions dynamically into “Uops” 118 bits wide Holds operation, two sources, and destination Executes Uops with “Out of Order” engine Uop executed when Operands available Functional unit available Execution controlled by “Reservation Stations” Keeps track of data dependencies between uops Allocates resources

– 25 – PentiumPro Branch Prediction Critical to Performance 11–15 cycle penalty for misprediction Branch Target Buffer 512 entries 4 bits of history Adaptive algorithm Can recognize repeated patterns, e.g., alternating taken–not taken Handling BTB misses Detect in cycle 6 Predict taken for negative offset, not taken for positive Loops vs. conditionals

– 26 – Example Branch Prediction Branch History Encode information about prior history of branch instructions Predict whether or not branch will be taken State Machine Each time branch taken, transition to right When not taken, transition to left Predict branch taken when in state Yes! or Yes? TTT Yes!Yes?No?No! NT T

– 27 – Pentium 4 Block Diagram Next generation microarchitecture Intel Tech. Journal Q1, 2001

– 28 – Pentium 4 Features Trace Cache Replaces traditional instruction cache Caches instructions in decoded form Reduces required rate for instruction decoder Double-Pumped ALUs Simple instructions (add) run at 2X clock rate Very Deep Pipeline 20+ cycle branch penalty Enables very high clock rates Slower than Pentium III for a given clock rate L2 Cache Instruct. Decoder Trace Cache IA32 Instrs. uops Operations

– 29 – Processor Summary Design Technique Create uniform framework for all instructions Want to share hardware among instructions Connect standard logic blocks with bits of control logicOperation State held in memories and clocked registers Computation done by combinational logic Clocking of registers/memories sufficient to control overall behavior Enhancing Performance Pipelining increases throughput and improves resource utilization Must make sure maintains ISA behavior