Computer System Design

Slides:



Advertisements
Similar presentations
Branch prediction Titov Alexander MDSP November, 2009.
Advertisements

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 12 Pipelining Strategies Performance Hazards.
EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Appendix A Pipelining: Basic and Intermediate Concepts
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
CS203 – Advanced Computer Architecture
PowerPC 604 Superscalar Microprocessor
Pipeline Implementation (4.6)
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
Out of Order Processors
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Ka-Ming Keung Swamy D Ponpandi
How to improve (decrease) CPI
Advanced Computer Architecture
Control unit extension for data hazards
Lecture 10: Branch Prediction and Instruction Delivery
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Control unit extension for data hazards
Conceptual execution on a processor which exploits ILP
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Computer System Design Chapter 3 Processors Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011 (copyright 2011)

Processor design: simple processor Processor core selection Baseline processor pipeline in-order execution performance 3. Buffer design maximum-Rate mean-Rate 4. Dealing with branches branch target capture branch prediction

Processor design: robust processor vector processors VLIW processors superscalar processors our of order execution ensuring correct program execution

1. Processor core selection constraints compute limited real-time limit must address first other limitation balance design to achieve constraints secondary targets software design effort fault tolerance

Types of pipelined processors

2. Baseline processor pipeline Optimum pipelining Depends on probability b of pipeline break Optimal number of stages Sopt =f(b) Need to minimize b to increase Sopt, so must minimize effects of Branches Data dependencies Resource limitations Also must manage cache misses

Simple pipelined processors Interlocks: used to stall subsequent instructions

Interlocks

In-order processor performance instruction execution time: linear sum of decode + pipeline delays + memory delays processor performance breakdown TTOTAL = TEX + TD + TM TEX = Execution time (1 + Run-on execution) TD = Pipeline delays (Resource,Data,Control) TM = Memory delays (TLB, Cache Miss)

3. Buffer design buffers minimize memory delays delays caused by variation in throughput between the pipeline and memory two types of buffer design criteria maximum rate for units that have high request rates the buffer is sized to mask the service latency generally keep buffers full (often fixed data rate) e.g. instruction or video buffers mean rate buffers for units with a lower expected request rate size buffer design: minimize probability of overflowing e.g. store buffer

Maximum-rate buffer design buffer is sized to avoid runout processor stalls, while buffer is empty awaiting service example: instruction buffer need buffer input rate > buffer output rate then size to cover latency at maximum demand buffer size (BF) should be: s: items processed (used or serviced) per cycle p: items fetched in an access First term: allow processing during current cycle

Maximum-rate buffer: example Branch Target Fetch assumptions: - decode consumes max 1 inst/clock - Icache supplies 2 inst/clock bandwidth at 6 clocks latency

Mean-rate buffer design use inequalities from probability theory to determine buffer size Little’s theorem: Mean request size = Mean request rate (requests / cycle) * Mean time to service request for infinite buffer, assume: distribution of buffer occupancy = q, mean occupancy = Q, with standard deviation =  use Markov’s inequality for buffer of size BF Prob. of overflow = p(q ≥ BF) ≤ Q/BF use Chebyshev’s inequality for buffer of size BF Prob. of overflow = p(q ≥ BF) ≤ 2/(BF-Q)2 given probability of overflow (p), conservatively select BF BF = min(Q/p, Q + /√p) pick correct BF that causes overflow/stall

Mean-rate buffer: example Reads Memory References from Pipeline Data Cache Store Buffer Writes Assumptions: when store buffer is full, writes have priority write request rate = 0.15 inst/cycle store latency to data cache = 2 clocks - so Q = 0.15 * 2 = 0.3 (Little’s theorem) given σ2 = 0.3 if we use a 2 entry write buffer, BF=2 P = min(Q/BF, σ2 / (BF-Q)2) = 0.10

4. Dealing with branches need to eliminate branch delay branch target capture: branch table buffer (BTB) need to predict outcome branch prediction: static prediction bimodal 2 level adaptive combined simplest, least accurate most expensive, most accurate

Branch problem - if 20% of instructions are BC (conditional branch), may add delay of .2 x 5 cpi to each instruction

Prediction based on history

Branch prediction Fixed: simple / trivial, e.g. Always fetch in-line unless branch Static: varies by opcode type or target direction Dynamic: varies with current program behaviour

Branch target buffer: branch delay to zero if guessed correctly can use with I-cache if hit in BTB, BTB returns target instruction and address no delay if prediction correct if miss in BTB, cache returns branch 70%-98% effective - 512 entries - depends on code

Branch target buffer

Static branch prediction See ** based on: branch opcode (e.g. BR, BC, etc.) branch direction (forward, backward) 70%-80% effective

Dynamic branch prediction: bimodal Base on past history: branch taken / not taken Use n = 2 bit saturating counter of history set initially by static predictor increment when taken decrement when not taken If supported by BTB (same penalty for missed guess of path) then predict not taken for 00, 01 predict taken for 10, 11 store bits in table addressed by low order instruction address or in cache line large tables: 93.5% correct for SPEC

Dynamic branch prediction: Two level adaptive How it works: Create branch history table of outcome of last n branch occurrences (one shift register per entry) Addressed by branch instruction address bits (pattern table) so TTUU (T=taken, U=not) is 1100 becomes address of entry in bimodal table Bimodal table addressed by content of pattern table (pattern history table) Average gives up to 95% correct Up to 97.1 % correct on SPEC Slow: needs two table accesses Uses much support hardware

2 level adaptive predictor: average & SPECmark performance 2-level adaptive (average) 2 bit bimodal static

Combined branch predictor use both bimodal and 2-level predictors usually the pattern table in 2-level is replaced by a single global branch shift register best in mixed program environment of small and large programs instruction address bits address both plus another 2 bit saturating counter (voting table) this stores the result of the recent branch contests both wrong or right no change; otherwise increment / decrement. Also 97+% correct

Branch management: summary Simplest, Cheapest, Least effective Simple approaches (not covered) BTB Most Complex, Most expensive, Most effective

More robust processors vector processors VLIW (very long instruction word) processors superscalar

Vector stride corresponds to access pattern

Vector registers: essential to a vector processor

Vector instruction execution depends on VR read ports

Vector instruction execution with dependency

Vector instruction chaining

Chaining path

Generic vector processor

Multiple issue machines: VLIW VLIW: typically over 200 bit instruction word for VLIW most of the work is done by compiler trace scheduling

Generic VLIW processor

Multiple issue machines: superscalar Detecting independent instructions. Three types of dependencies: RAW (read after write) instruction needs result of previous instruction … an essential dependency. ADD R1, R2, R3 MUL R6, R1, R7 WAR (write after read) instruction writes before a previously issued instruction can read value from same location…. Ordering dependency DIV R1, R2, R3 ADD R2, R6, R7 WAW (write after write) write hazard to the same location … shouldn’t occur with well compiled code. ADD R1, R6, R7 Format is opcode dest, src1, src2

Reducing dependencies: renaming WAR and WAW caused by reusing the same register for 2 separate computations can be eliminated by renaming the register used by the second computation, using hidden registers so ST A, R1 LD R1, B where Rs1 is a new rename register becomes ST A, R1 LD Rs1, B

Instruction issuing process detect independent instructions instruction window rename registers typically 32 user-visible registers extend to 45-60 total registers dispatch send renamed instructions to functional units schedule the resources can’t necessarily issue instructions even if independent

Detect and rename (issue) Instruction window: N instructions checked Up to M instructions may be issued per cycle

Generic superscalar processor (M issue)

Dataflow management: issue and rename Tomosulo’s algorithm issue instructions to functional units (reservation stations) with available operand values unavailable source operands given name (tag) of reservation station whose result is the operand continue issuing until unit reservation stations are full un-issued instructions: pending and held in buffer new instructions that depend on pending are also pending

Dataflow issue with reservation stations Each reservation station: Registers to hold S1 and S2 values (if available), or Tags to indicate where values will come from

Generic Superscalar

Managing out of order execution Simple register file organization Centralised reorder buffer

Managing out of order execution Distributed reorder buffer

ARM processor (ARM 1020) 6-8 stage pipeline widely used in SOCs (in-order) simple, in-order 6-8 stage pipeline widely used in SOCs

Freescale E600 data paths used in complex SOCs out-of-order branch history vector instructions multiple caches

Summary: processor design Processor core selection Baseline processor pipeline in-order execution performance 3. Buffer design maximum-Rate mean-Rate 4. Dealing with branches branch target capture branch prediction