CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Slides:

Advertisements

Similar presentations

Computer architecture

Advertisements

CSCI 4717/5717 Computer Architecture

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.

Lecture 8Fall 2006 Chapter 6: Superscalar Adapted from Mary Jane Irwin at Penn State University for Computer Organization and Design, Patterson & Hennessy,

CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (

1 A Superscalar Pipeline [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005 and Instruction Issue Logic, IEEETC, 39:3, Sohi,

CSIE30300 Computer Architecture Unit 13: Introduction to Multiple Issue Hsin-Chou Chi [Adapted from material by and

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE431 L17 VLIW Processors.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 17: VLIW Processors Mary Jane Irwin (

CSE431 L09 Multiple Issue Intro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 09: Multiple Issue Introduction Mary Jane Irwin (

CS 352H: Computer Systems Architecture

Instruction Level Parallelism

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Lynn Choi Dept. Of Computer and Electronics Engineering

CSE 431 Computer Architecture Fall 2005 Lecture 12: SS Front End (Fetch , Decode & Dispatch) Mary Jane Irwin ( )

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

CC 423: Advanced Computer Architecture Limits to ILP

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

CSE 431 Computer Architecture Fall Lecture 15: Midterm Exam Review

Lecture 8: Dynamic ILP Topics: out-of-order processors

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Control unit extension for data hazards

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Control unit extension for data hazards

Control unit extension for data hazards

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane Irwin ( ) [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005 and Superscalar Microprocessor Design, Johnson, © 1992]

CSE431 L13 SS Execute & Commit.2Irwin, PSU, 2005 SS Pipeline Fetch multiple instructions Decode instructions 1 st RUU function (RUU allocation, src operand copying (or Tag field set), dst Tag field set) 3 nd RUU function (when both src operands Ready and FU free, schedule Result Bus and issue instr for execution) 2 nd RUU function (copy Result Bus data to matching src’s and to RUU dst entry) 4 th RUU function (for instr at RUU_Head (if executed), write dst Contents to RegFile) FETCHDECODE & DISPATCH ISSUE & EXECUTE WRITE BACK RESULT COMMIT In Order Out of Order In Order

CSE431 L13 SS Execute & Commit.3Irwin, PSU, 2005 Result Buses Utilization  Our SS model has only one Result Bus to carry results generated by the FU’s to the RUU and LSQ l Even at the high levels of performance, the utilization of the Result Bus is only about 70% (i.e., the fraction of capacity actually used)  If a FU requests for the Result Bus is not granted, instruction issue to that FU is stalled until the bus request can be granted (i.e., the FU remains “busy”) Avg # of results waiting for the Result Bus # of Result Buses From Johnson, 1992

CSE431 L13 SS Execute & Commit.4Irwin, PSU, 2005 Arbitrating For Result Buses  And since there are usually fewer Result Buses than FUs, the FUs must continue to arbitrate for use of the existing Result Buses l The arbiter not only decides which FU is granted use of the Result Buses, but also which of the two buses is to be used -Prioritizing old requests over new helps prevent starvation  Reducing the impact of bus contention by adding a second Result Bus improves performance (by almost 19%)  But adding a third Results Buses yields only a very small improvement in performance (less than 3%) From Johnson, 1992

CSE431 L13 SS Execute & Commit.5Irwin, PSU, 2005 Result Forwarding  Result forwarding supplies operands directly to the waiting instr’s in the RUU to resolve true dependencies that could not be resolved during decode l The cost of forwarding is the comparison logic in the RUU to compare the Result Bus Tag to the source operand Tags l Need a set of comparators for each Result Bus Fraction of Results # of RUU Entries Receiving Result as Input Operand  About 2/3 rd of all results are forwarded to one waiting operand, and about 1/6 th are forwarded to more than one From Johnson, 1992

CSE431 L13 SS Execute & Commit.6Irwin, PSU, 2005  Add a slide on impact of different number of FUs and their type?  Add a slide on impact of being able to commit more than one instruction at a time? (If any, probably small?)

CSE431 L13 SS Execute & Commit.7Irwin, PSU, 2005 Performance Advantages  Hardware complexity arises from four major hardware features l Out-of-order issue l Register renaming l Branch prediction l 4-way instruction fetch and decode OOIRegister Renaming Branch Prediction 4-way Fetch & Decode 52%36%30%18% From Johnson, 1992

CSE431 L13 SS Execute & Commit.8Irwin, PSU, 2005 ILP in a Perfect OOI-OOC Processor  The perfect processor has l An infinite number of rename registers that eliminates all storage hazards (i.e., write-before-write and write-before-read) l No (fetch, decode, dispatch, issue, FU, buses, ports) limit on the number of instr’s that can begin execution simultaneously as long as read-before-write true data hazards are not present l Perfect branch and jump (including jump register) prediction l Loads can be moved before stores (as long as the addresses are not identical) with memory address analysis l All FU’s have a 1 cycle latency l Perfect caches with 1 cycle latency From H&P, 2003

CSE431 L13 SS Execute & Commit.9Irwin, PSU, 2005 Effect of Instruction Window Size on ILP  Instruction window – the set of instructions that are examined simultaneously for execution From H&P, 2003

CSE431 L13 SS Execute & Commit.10Irwin, PSU, 2005 Effect of Realistic Branch Prediction on ILP  On a processor with an instruction window size of 2K and maximum 64-way issue capability From H&P, 2003

CSE431 L13 SS Execute & Commit.11Irwin, PSU, 2005 Effect of Finite Rename Registers  On a processor with an instruction window size of 2K, maximum 64-way issue capability, and a tournament branch predictor with 8K entries From H&P, 2003

CSE431 L13 SS Execute & Commit.12Irwin, PSU, 2005 A SS Example  Intel Pentium 4 (IA-32 ISA) l Decodes the IA-32 instructions into microoperations l Does register renaming with a RUU-like structure l Has a 20 stage pipeline T$ access (Bpredict) RUU allocation Instr dispatch RegFile access ExecutionCommit # cycles RUU queue FU queues µop queue l 7 FUs: 2 integer ALUs, 1 FP ALU, 1FP move, load, store, complex l Up to 126 instructions in flight, including 48 loads and 24 stores l 4K entry branch predictor

CSE431 L13 SS Execute & Commit.13Irwin, PSU, 2005 Another SS Example  IBM Power4, 970 (64-bit data width) l Fetch 8 instr. per cycle, dispatch up to five, issue (execute) up to 8, commit up to 5 – 5-way superscalar – more than 200 instructions “in flight” at any given time (5x20 in the “dispatch groups”, the rest in fetch, decode, and commit) l 7 FUs: 2 integer ALUs, 2 FP ALUs, 2 load/store, vector engine l Varying pipeline depth – !9! fetch and decode stages + 16 stages for integer, 17 stages for load/store, 21 stages for floating point, and up to 25 stages for vector

CSE431 L13 SS Execute & Commit.14Irwin, PSU, 2005 Next Lecture and Reminders  Next lecture l Exam review  Next lecture after exam -Reading assignment – PH 6.9  Reminders l HW3, Part 2 (the SimpleScalar experiments) due Friday, October 21 th ( solution to the TA, by l Evening midterm exam scheduled (Tuesday night) -Tuesday, October 18 th, 20:15 to 22:15, Location 113 IST -One student is taking the conflict exam – you know who you are and when it is scheduled l Final exam (tentatively) schedule -Tuesday, December 13 th, 2:30 to 4:20, Location TBD