Laboratory for Computer Science

Slides:



Advertisements
Similar presentations
CH14 Instruction Level Parallelism and Superscalar Processors
Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture
Computer Organization and Architecture
Morgan Kaufmann Publishers The Processor
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Instructor: Yuzhuang Hu Final Exam! The final exam is scheduled on 7 th, August, Friday 7:00 pm – 10:00 pm.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
OMSE 510: Computing Foundations 4: The CPU!
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
ELEN 468 Advanced Logic Design
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Pipelining III Steve Ko Computer Sciences and Engineering University at Buffalo.
Computer Organization and Architecture The CPU Structure.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Appendix A Pipelining: Basic and Intermediate Concepts
S. Barua – CPSC 440 CHAPTER 5 THE PROCESSOR: DATAPATH AND CONTROL Goals – Understand how the various.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Chun Chiu. Overview What is RISC? Characteristics of RISC What is CISC? Why using RISC? RISC Vs. CISC RISC Pipelines Advantage of RISC / disadvantage.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
RISC and CISC. What is CISC? CISC is an acronym for Complex Instruction Set Computer and are chips that are easy to program and which make efficient use.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Pipelining and Parallelism Mark Staveley
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
CS203 – Advanced Computer Architecture Pipelining Review.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Advanced Architectures
Pipelining Chapter 6.
CSCI206 - Computer Organization & Programming
Morgan Kaufmann Publishers
ELEN 468 Advanced Logic Design
Overview Introduction General Register Organization Stack Organization
Single Clock Datapath With Control
CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining
Pipelining Chapter 6.
Central Processing Unit
CSC 4250 Computer Architectures
CSCI206 - Computer Organization & Programming
Control unit extension for data hazards
Control unit extension for data hazards
Computer Architecture
Introduction to Computer Organization and Architecture
Control unit extension for data hazards
Lecture 4: Instruction Set Design/Pipelining
Presentation transcript:

Laboratory for Computer Science Pipeline Hazards Krste Asanovic Laboratory for Computer Science M.I.T.

Pipelined DLX Datapath without interlocks and jumps

Data Hazards ... r1 ← r0 + 10 r4 ← r1 + 17 Oops!

Resolving Data Hazards 1. Interlocks Freeze earlier pipeline stages until data becomes available 2. Bypasses If data is available somewhere in the datapath provide a bypass to get it to the right stage

Interlocks to resolve Data Hazards ... r1 ← r0 + 10 r4 ← r1 + 17

Stalled Stages and Pipeline Bubbles

Interlock Control Logic worksheet Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.

Interlock Control Logic ignoring jumps & branches Should we always stall if the rs field matches some rd? not every instruction writes registers ⇒ we not every instruction reads registers ⇒ re

Source & Destination Registers

Deriving the Stall Signal Stall if the source registers of the instruction in the decode stage matches the destination register of the uncommitted instructions.

The Stall Signal

Hazards due to Loads & Stores Is there any possible data hazard in this instruction sequence?

Hazards due to Loads & Stores depends on the memory system However, the hazard is avoided because our memory system completes writes in one cycle !

VAX: A Complex Instruction Set

VAX Extremes Long instructions, e.g., emodh #1.2334, 70000(r1)[r2], – 54 byte encoding (rumors of even longer instruction sequences...) Memory referencing behavior – In worst case, a single instruction could require at least 41 different memory pages to be resident in memory to complete execution – Doesn’t include string instructions which were designed to be interruptable

Reduced Instruction Set Computers (Cocke, IBM; Patterson, UC Berkeley; Hennessy, Stanford) Compilers have difficulty using complex instructions VAX: 60% of microcode for 20% of instructions, only responsible for 0.2% execution time IBM experiment retargets 370 compiler to use simple subset of ISA => Compiler generated faster code! Simple instruction sets don’t need microcode Use fast memory near processor as cache, not microcode storage Design ISA for simple pipelined implementation Fixed length, fixed format instructions Load/store architecture with up to one memory access/instruction Few addressing modes, synthesize others with code sequence Register-register ALU operations Delayed branch

MIPS R2000 (One of first commercial RISCs, 1986) Load/Store architecture • 32x32-bit GPR (R0 is wired), HI & LO SPR (for multiply/divide) • 74 instructions • Fixed instruction size (32 bits), only 3 formats • PC-relative branches, register indirect jumps • Only base+displacement addressing mode • No condition bits, compares write GPRs, branches test GPRs • Delayed loads and branches Five-stage instruction pipeline • Fetch, Decode, Execute, Memory, Write Back • CPI of 1 for register-to-register ALU instructions • 8 MHz clock •Tightly-coupled off-chip FP accelerator (R2010)

RISC/CISC Comparisons (late 80s/early 90s) R2000 vs VAX 8700 [Bhandarkar and Clark, ‘91] • R2000 has ~2.7x advantage with equivalent technology Intel 80486 vs Intel i860 (both 1989) • Same company, same CAD tools, same process • i860 2-4x faster - even more on some floating-point tasks DEC nVAX vs Alpha 21064 (both 1992) • Alpha 2-4x faster

Complications due to Jumps A jump instruction kills (not stalls) the following instruction How?

Pipelining Jumps Any interaction between stall and jump?

Jump Pipeline Diagrams

Pipelining Conditional Branches Branch condition is not known until the execute stage what action should be taken in the decode stage ?

Conditional Branches: solution 1 If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid ⇒ stall signal is not valid

New Stall Signal stall = (((rf1D =wsE).weE + (rf1D =wsM).weM + (rf1D =wsW).weW).re1D + ((rf2D =wsE).weE + (rf2D =wsM).weM + (rf2D =wsW).weW).re2D) . !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z)

Control Equations for PC Muxes Solution 1

Conditional Branches: solution 2 Test for zero at the decode stage Need to kill only one instruction ! Wouldn’t work if DLX had general branch conditions (i.e., r1>r2)?

Conditional Branches: solution 3 Delayed Branches Change the semantics of branches and jumps ⇒ Instruction after branch always executed, regardless if branch taken or not taken. Need not kill any instructions !

Annulling Branches

Branch Delay Slots Advantages Disadvantages First introduced in pipelined microcode engines. Adopted for early RISC machines with single-issue pipelines, made visible to user-level software. Advantages • simplifies control logic for simple pipeline • compiler helps reduce branch hazard penalties – ~70% of single delay slots usefully filled Disadvantages • complicates ISA specification and programming – adds extra “next-PC” state to programming model • complicates control logic for more aggressive implementations – e.g., out-of-order superscalar designs ⇒Out of favor for new (post-1990) general-purpose ISAs Later lectures will cover advanced techniques for control hazards: • dynamic (run-time) schemes, e.g., branch prediction • static (compile-time) schemes, e.g., predicated execution

Pipelining Delayed Jumps & Links

Bypassing Each stall or kill introduces a bubble in the pipeline ⇒ CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input

Adding Bypasses ... (I1) r1 ← r0 + 10 (I2) r4 ← r1 + 17 Of course you can add many more bypasses!

The Bypass Signal deriving it from the stall signal

Usefulness of a Bypass

Bypass and Stall Signals

Fully Bypassed Datapath

Why an Instruction may not be dispatched every cycle (CPI>1) Full bypassing may be too expensive to implement – typically all frequently used paths provided – some infrequently used bypass paths may increase cycle time and/or cost more than their benefit in reducing CPI Loads have two cycle latency – Instruction after load cannot use load result – MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II. Conditional branches may cause bubbles – kill following instruction(s) if no delay slots (Machines with software-visible delay slots may execute significant number of NOP instructions inserted by compiler code scheduling)