CSE502: Computer Architecture Instruction Commit.

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
CS6290 Pentiums. Case Study1 : Pentium-Pro Basis for Centrinos, Core, Core 2 (We’ll also look at P4 after this.)
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
Lecture 7: Register Renaming. 2 A: R1 = R2 + R3 B: R4 = R1 * R R1 R2 R3 R4 Read-After-Write A A B B
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 12 Pipelining Strategies Performance Hazards.
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
In-Line Interrupt Handling for Software Managed TLBs Aamer Jaleel and Bruce Jacob Electrical and Computer Engineering University of Maryland at College.
Lecture 13: Commit, Exceptions, Interrupts. Commit is typically the last stage of the pipeline Anything that an instruction does at this point is irrevocable.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Processor Architecture
Operating Systems 1 K. Salah Module 1.2: Fundamental Concepts Interrupts System Calls.
Processes and Virtual Memory
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
Computer Architecture Chapter (14): Processor Structure and Function
Data Prefetching Smruti R. Sarangi.
William Stallings Computer Organization and Architecture 8th Edition
Smruti R. Sarangi IIT Delhi
Computer Structure Multi-Threading
OOO Execution of Memory Operations
OOO Execution of Memory Operations
CSE 502: Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
Module: Handling Exceptions
Pipelining: Advanced ILP
Sequential Execution Semantics
Lecture 6: Advanced Pipelines
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
ECE 2162 Reorder Buffer.
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Super-scalar.
Data Prefetching Smruti R. Sarangi.
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Additional ILP Topics Prof. Eric Rotenberg
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Control Hazards Branches (conditional, unconditional, call-return)
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

CSE502: Computer Architecture Instruction Commit

CSE502: Computer Architecture The End of the Road (um… Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable – Only actions following sequential execution allowed – E.g., wrong path instructions may not commit They do not exist in the sequential execution

CSE502: Computer Architecture Everything Should Appear In-Order ISA defines program execution in sequential order To the outside, CPU must appear to execute in order When is someone looking?

CSE502: Computer Architecture “Looking” at CPU State When OS swaps contexts – OS saves the current program state (requires “looking”) – Allows restoring the state later When program has a fault (e.g., page fault) – OS steps in and “looks” at the “current” CPU state

CSE502: Computer Architecture Implementation in the CPU ARF keeps state corresponding to committed insns. – Commit from ROB happens in order – ARF always contains some RF state of sequential execution Whoever wants to “look” should look in ARF – What about insns. that executed out of order?

CSE502: Computer Architecture Only the Sequential Part Matterns LSQ PRF RS fPC ROB PC Memory Sequential View of the Processor PC Memory RF State of the Superscalar Out-of-Order Processor RF What if there’s no ARF?

CSE502: Computer Architecture View of the Unified Register File PRFsRAT aRAT If you need to “see” a register, you go through the aRAT first. ARF

CSE502: Computer Architecture View of Branch Mispredictions ROB LSQ PRF RS fPC PC Memory ARF Mispredicted Branch Wrong-path instructions are flushed… architected state has never been touched Fetch correct path instructions Which can update the architected state when they commit

CSE502: Computer Architecture Committing Instructions (1/2) “Retire” vs. “Commit” – Sometimes people use this to mean the same thing – Sometimes they mean different things Check the context! Insn. commits by making “effects” visible – (A)RF, Memory/$, PC

CSE502: Computer Architecture Committing Instructions (2/2) When an insn. executes, it modifies processor state – Update a register – Update memory – Update the PC To make “effects” visible, core copies values – Value from Physical Reg to Architected Reg – Value from LSQ to memory/cache – Value from ROB to Architected PC

CSE502: Computer Architecture x86 Commit (1/2) ROB contains uops, outside world knows insns. ROB uop 1 (ADD) uop 1 (SUB) commit ADD EAX, EBX SUB EBX, ECX ADD EDX, [EAX] commit POPA ???? If we take an interrupt right now, we’ll see a half-executed instruction! uop 1 (LD) uop 2 (ADD) uop 1 (POP) uop 2 (POP) uop 3 (POP) uop 4 (POP) uop 5 (POP) uop 6 (POP) uop 7 (POP) uop 8 (POP)

CSE502: Computer Architecture x86 Commit (2/2) Works when uop-flow length ≤ commit width What to do with long flows? – In all cases: can’t commit until all uops in a flow completed – Just commit N uops per cycle... but make commit uninterruptable ROB uop 1 uop 2 uop 3 uop 4 uop 5 uop 6 uop 7 uop 8 commit POPA Timer interrupt! Defer: Can’t act on this yet... Now do something about the interrupt.

CSE502: Computer Architecture Handling REP-prefixed Instructions (1/2) Ex. REP STOSB (memset EAX value ECX times) – Entire sequence is one x86 instruction – What if REPs for 1,000,000,000 iterations? Can’t “lock up” for a billion cycles while uops commit Can’t wait to commit until all uops are done – Can’t even fetch the entire instruction – not enough space in ROB At the ISA level, REP iterations are interruptible... – Treat each iteration as a separate “macro-op”

CSE502: Computer Architecture Handling REP-prefixed Instructions (2/2) MOV EDI, ; the array we want to memset SUB EAX, EAX; zero CLD; clear direction flag (REP fwd) MOV ECX, 4; do for 100 iterations REP STOSB; memset! ADD EBX, EDX; unrelated instruction MOV EDI, xxx SUB EAX, EAX CLD MOV ECX, 4 uCMP ECX, 0 uJCZ STA tmp, EDI STD EAX, tmp SUB ECX, 1 ADD EDI, 1 uCMP ECX, 0 uJCZ STA tmp, EDI STD EAX, tmp SUB ECX, 1 ADD EDI, 1 uCMP ECX, 0 uJCZ STA tmp, EDI STD EAX, tmp SUB ECX, 1 ADD EDI, 1 uCMP ECX, 0 uJCZ STA tmp, EDI STD EAX, tmp SUB ECX, 1 ADD EDI, 1 uCMP ECX, 0 uJCZ ADD EBX, EDX Check for zero iterations (could happen with MOV ECX, 0 ) MOVS flow REP overhead for 1 st iteration MOVS flow REP overhead for 2nd iter. MOVS flow REP overhead for 3 rd iter MOVS flow REP 4 th iter A: B: C: D: All of these are interruptible points (commit can stop and effects be seen by outside world), since they all have well- defined ISA-level states: A: ECX=3, EDI = ptr+1 B: ECX=2, EDI = ptr+2 C: ECX=1, EDI = ptr+3 D: ECX=0, EDI = ptr+4

CSE502: Computer Architecture Blocked Commit To commit N insns. per cycle, ROB needs N ports – (in addition to ports for dispatch, issue, exec, and WB) inst 1 inst 2 inst 3 inst 4 Four read ports for four commits ROB inst 1 inst 2 inst 3 inst 4 One wide read port ROB Can’t reuse ROB entries until all in block have committed. Can’t commit across blocks. Reduces cost, lowers IPC due to constraints.

CSE502: Computer Architecture Faults Divide-by-Zero, Overflow, Page-Fault All occur at a specific point in execution (precise) DBZ! Trap (resume execution) Divide may have executed before other instructions due to OoO scheduling! DBZ! Trap? (when?) CPU maintains appearance of sequential execution

CSE502: Computer Architecture Timing of DBZ Fault Need to hold on to your faults RS ROB Exec: DBZ Architected State Architected State Let earlier instructions commit The arch. state is the same as just before the divide executed in the sequential order Now, raise the DBZ fault and when you switch to the kernel, everything appears as it should On a fault, flush the machine and switch to the kernel Just make note of the fault, but don’t do anything (yet)

CSE502: Computer Architecture Speculative Faults Faults might not be faults… ROB DBZ! Branch Mispredict The fault goes away Which is what we want, since in a sequential execution, the wrong-path divide would not have executed (and faulted) Which is what we want, since in a sequential execution, the wrong-path divide would not have executed (and faulted) (flush wrong-path) Buffer faults until commit to avoid speculative faults

CSE502: Computer Architecture Timing of TLB Miss Store must re-execute (or re-commit) – Cannot leave the ROB TLB miss Trap (resume execution) … Walk page-table, may find a page fault Re-execute store Store TLB miss can stall the core

CSE502: Computer Architecture Load Faults are Similar Load issues, misses in TLB – When load is oldest, switch to kernel for page-table walk …could be painful; there are lots of loads Modern processors use hardware page-table walkers – OS loads a few registers with PT information (pointers) – Simple logic fetches mapping info from memory – Requires page-table format is specified by the ISA

CSE502: Computer Architecture Asynchronous Interrupts Some interrupts are not associated with insns. – Timer interrupt – I/O interrupt (disk, network, etc…) – Low battery, UPS shutdown When the CPU “notices” doesn’t matter (too much) Key Pressed Key Pressed Key Pressed

CSE502: Computer Architecture Two Options for Handling Async Interrupts Handle immediately – Use current architected state and flush the pipeline Deferred – Stop fetching, let processor drain, then switch to handler What if CPU takes a fault in the mean time? Which came “first”, the async. interrupt or the fault?

CSE502: Computer Architecture Store Retirement (1/2) Stores forward to later loads (for same address) – Normally, LSQ provides this facility ld 17 st ld 33 st ld D$ 17 At commit, store Updates cache st ld After store has left the LSQ, the D$ can provide the correct value D$

CSE502: Computer Architecture Store Retirement (2/2) Can’t free LSQ Store entry until write is done – Enables forwarding until loads can get value from cache Have to re-check TLB when doing write – TLB contents at Execute were speculative Store may stall commit for a long time – If there’s a cache miss – If there’s a TLB miss (with HW TLB walk) store All instructions may have successfully executed, but none can commit! All instructions may have successfully executed, but none can commit! Don’t we check DTLB during store- address computation anyway? Do we need to do it again here? Don’t we check DTLB during store- address computation anyway? Do we need to do it again here?

CSE502: Computer Architecture Writeback Buffer (1/2) Want to get stores out of the way quickly store D$ store WB Buffer ld Even if store misses in cache, entering WB buffer counts as committing. Allows other insns. to commit. WB buffer is part of the cache hierarchy. May need to provide values to later loads. Eventually, the cache update occurs, the WB buffer entry is emptied. ld Cache can now provide the correct value. Usually fast, but potential structural hazard

CSE502: Computer Architecture Writeback Buffer (2/2) Stores enter WB Buffer in program order Multiple stores can exist to same address – Only the last store is “visible” AddrValue oldest youngest next to write to cache Load Store 42 Load 42 No one can “see” this store anymore!

CSE502: Computer Architecture Write Combining Buffer (1/2) Augment WBB to combine writes together AddrValue Load Store 42 Load 42 Only one writeback Now instead of two If Stores to same address, combine the writes

CSE502: Computer Architecture Write Combining Buffer (2/2) Can combine stores to same cache line $-Line Addr Cache Line Data Store 84 One cache write can serve multiple original store instructions Benefit: reduces cache traffic, reduces pressure on store buffers Benefit: reduces cache traffic, reduces pressure on store buffers 5678 Aggressiveness of write-combining may be limited by memory ordering model Writeback/combining buffer can be implemented in/integrated with the MSHRs Only certain memory regions may be “write-combinable” (e.g., USWC in x86)

CSE502: Computer Architecture Senior Store Queue Use STQ as WBB (not necessarily write combining) STQ Store STQ head STQ tail DL1 L2 STQ head Store STQ tail New stores cannot allocate into Senior STQ entries until stores complete No WBB and no stall on Store commit While stores are completing, other accesses (loads, etc…) can continue getting the values from the “senior” STQ

CSE502: Computer Architecture Retire Insn. retires by cleaning up all related state Besides updating architected state … needs to deallocate resources – ROB/LSQ entries – Physical register – “colors” of various sorts – RAT checkpoints Most are FIFO’s or Queues – Alloc/dealloc is usually just inc/dec head/tail pointers Unified PRF requires a little more work – Have to return “old” mapping to free list