Advanced Microarchitecture Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by Ilhyun Kim Updated by Mikko Lipasti.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team.

ECE/CS 752: Midterm 2 Review Instructor:Mikko H Lipasti Spring 2012 University of Wisconsin-Madison.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23 Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

CS 352H: Computer Systems Architecture

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Microprocessor Microarchitecture Dynamic Pipeline

Advanced Microarchicture ECE/CS 752 Fall 2017

Half-Price Architecture

Power-Aware Operand Delivery

Instruction scheduling

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Adapted from the slides of Prof

Advanced Microarchitecture

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

ECE/CS 752: Midterm 2 Review ECE/CS 752 Fall 2017

Out-of-Order Execution Structures Optimizations

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Prof. Onur Mutlu Carnegie Mellon University

CSC3050 – Computer Architecture

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Presentation transcript:

Advanced Microarchitecture Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by Ilhyun Kim Updated by Mikko Lipasti

Outline Instruction scheduling overview – Scheduling atomicity – Speculative scheduling – Scheduling recovery Complexity-effective instruction scheduling techniques – CRIB reading Scalable load/store handling – NoSQ reading Building large instruction windows – Runahead, CFP, iCFP Control Independence 3D die stacking

Readings Read on your own: – Shen & Lipasti Chapter 10 on Advanced Register Data Flow – skim – I. Kim and M. Lipasti, “Understanding Scheduling Replay Schemes,” in Proceedings of the 10th International Symposium on High-performance Computer Architecture (HPCA-10), February – Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Mike Upton, “Continual Flow Pipelines”, in Proceedings of ASPLOS 2004, October – Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric Rotenberg, Haitham H. Akkary, “Transparent Control Independence,” in Proceedings of ISCA-34, To be discussed in class: – T. Shaw, M. Martin, A. Roth, “NoSQ: Store-Load Communication without a Store Queue, ” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, – Erika Gunadi, Mikko Lipasti: CRIB: Combined Rename, Issue, and Bypass, ISCA – Andrew Hilton, Amir Roth, "BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution," Proceedings of HPCA – Loh, G. H., Xie, Y., and Black, B Processor Design in 3D Die-Stacking Technologies. IEEE Micro 27, 3 (May. 2007),

Register Dataflow

Instruction scheduling A process of mapping a series of instructions into execution resources – Decides when and where an instruction is executed Data dependence graph FU0FU1 n n+1 n+2 n Mapped to two FUs

Instruction scheduling A set of wakeup and select operations – Wakeup Broadcasts the tags of parent instructions selected Dependent instruction gets matching tags, determines if source operands are ready Resolves true data dependences – Select Picks instructions to issue among a pool of ready instructions Resolves resource conflicts – Issue bandwidth – Limited number of functional units / memory ports

Scheduling loop Basic wakeup and select operations == == OR readyLtagLreadyRtagR == == OR readyLtagLreadyRtagR tag Wtag 1 … … … ready - request request n grant n grant 0 request 0 grant 1 request 1 …… selected issue to FU broadcast the tag of the selected inst Select logic Wakeup logic scheduling loop

Wakeup and Select FU0FU1 n n+1 n+2 n Select 1 Wakeup 2,3,4 Wakeup / select Select 2, 3 Wakeup 5, 6 Select 4, 5 Wakeup 6 Select 6 Ready inst to issue 1 2, 3, 4 4,

Scheduling Atomicity Operations in the scheduling loop must occur within a single clock cycle – For back-to-back execution of dependent instructions n n+1 n+2 n+3 n+4 select 1 wakeup 2, 3 select 2, 3 wakeup 4 select 4 select 1 wakeup 2, 3 Select 2, 3 wakeup 4 Select 4 Atomic scheduling Non-Atomic 2-cycle scheduling cycle

Implication of scheduling atomicity Pipelining is a standard way to improve clock frequency Hard to pipeline instruction scheduling logic without losing ILP – ~10% IPC loss in 2-cycle scheduling – ~19% IPC loss in 3-cycle scheduling A major obstacle to building high-frequency microprocessors

Scheduler Designs Data-Capture Scheduler – keep the most recent register value in reservation stations – Data forwarding and wakeup are combined Register File Data-captured scheduling window (reservation station) Functional Units Forwarding and wakeup Register update

Scheduler Designs Non-Data-Capture Scheduler – keep the most recent register value in RF (physical registers) – Data forwarding and wakeup are decoupled Register File Non-data-capture scheduling window Functional Units Forwarding wakeup Complexity benefits simpler scheduler / data / wakeup path

Mapping to pipeline stages AMD K7 (data-capture) Pentium 4 (non-data-capture) Data Data / wakeup

Scheduling atomicity & non-data-capture scheduler FetchDecode Sched /Exe WritebackCommit Atomic Sched/Exe FetchDecodeScheduleDispatchRFExeWritebackCommit wakeup/ select FetchDecodeScheduleDispatchRFExeWritebackCommitFetchDecodeScheduleDispatchRFExeWritebackCommitFetchDecodeScheduleDispatchRFExeWritebackCommitFetchDecodeScheduleDispatchRFExeWritebackCommitFetchDecodeScheduleDispatchRFExeWritebackCommit Wakeup /Select FetchDecodeScheduleDispatchRFExeWritebackCommit Wakeup /Select Multi-cycle scheduling loop Scheduling atomicity is not maintained – Separated by extra pipeline stages (Disp, RF) – Unable to issue dependent instructions consecutively  solution: speculative scheduling

Speculative Scheduling Speculatively wakeup dependent instructions even before the parent instruction starts execution – Keep the scheduling loop within a single clock cycle But, nobody knows what will happen in the future Source of uncertainty in instruction scheduling: loads – Cache hit / miss – Store-to-load aliasing  eventually affects timing decisions Scheduler assumes that all types of instructions have pre-determined fixed latencies – Load instructions are assumed to have a common case (over 90% in general) $DL1 hit latency – If incorrect, subsequent (dependent) instructions are replayed

Speculative Scheduling Overview Spec wakeup /select FetchDecodeScheduleDispatchRFExe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted FetchDecodeScheduleDispatchRFExe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted Spec wakeup /select FetchDecodeScheduleDispatchRFExe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted FetchDecodeScheduleDispatchRFExe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted FetchDecodeScheduleDispatchRFExe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted FetchDecodeScheduleDispatchRFExe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted FetchDecodeScheduleDispatchRFExe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted Latency Changed!! FetchDecodeScheduleDispatchRFExe Writeback /Recover Commit Re-schedule when latency mispredicted Invalid input value Speculatively issued instructions FetchDecodeScheduleDispatchRFExe Writeback /Recover Commit Speculatively issued instructions Unlike the original Tomasulo’s algorithm Instructions are scheduled BEFORE actual execution occurs Assumes instructions have pre-determined fixed latencies ALU operations: fixed latency Load operations: assumes $DL1 latency (common case)

Scheduling replay Speculation needs verification / recovery – There’s no free lunch If the actual load latency is longer (i.e. cache miss) than what was speculated – Best solution (disregarding complexity): replay data-dependent instructions issued under load shadow verification flow FetchDecodeRenameQueueSched Disp RF ExeRetire / WB CommitRename instruction flow Cache miss detected

Wavefront propagation Speculative execution wavefront – speculative image of execution (from scheduler’s perspective) Both wavefront propagates along dependence edges at the same rate (1 level / cycle) – the real wavefront runs behind the speculative wavefront The load resolution loop delay complicates the recovery process – scheduling miss is notified a couple of clock cycles later after issue verification flow FetchDecodeRenameQueueSched Disp RF ExeRetire / WB CommitRename speculative execution wavefront real execution wavefront instruction flow dependence linking Data linking

Load resolution feedback delay in instruction scheduling Scheduling runs multiple clock cycles ahead of execution – But, instructions can keep track of only one level of dependence at a time (using source operand identifiers) Broadcast / wakeup Select Execution Dispatch / Payload RF Misc. N N N-1 N-2 N-3 N-4 Time delay between sched and feedback recoverinstructions in this path

Issues in scheduling replay Cannot stop speculative wavefront propagation – Both wavefronts propagate at the same rate – Dependent instructions are unnecessarily issued under load misses checker Sched / Issue Exe cache miss signal cycle n cycle n+1 cycle n+2 cycle n+3

Requirements of scheduling replay Conditions for ideal scheduling replay – All mis-scheduled dependent instructions are invalidated instantly – Independent instructions are unaffected Multiple levels of dependence tracking are needed – e.g. Am I dependent on the current cache miss? – Longer load resolution loop delay  tracking more levels Propagation of recovery status should be faster than speculative wavefront propagation Recovery should be performed on the transitive closure of dependent instructions load miss

Scheduling replay schemes Alpha 21264: Non-selective replay – Replays all dependent and independent instructions issued under load shadow – Analogous to squashing recovery in branch misprediction – Simple but high performance penalty Independent instructions are unnecessarily replayed Sched DispRFExeRetire Invalidate & replay ALL instructions in the load shadow LD ADD OR AND BR LD ADD OR AND BR LD ADD OR AND BR miss resolved LD ADD OR AND BR LD ADD OR Cache miss AND BR

Position-based selective replay Ideal selective recovery – replay dependent instructions only Dependence tracking is managed in a matrix form – Column: load issue slot, row: pipeline stages

Low-complexity scheduling techniques FIFO (Palacharla, Jouppi, Smith, 1996) – Replaces conventional scheduling logic with multiple FIFOs Steering logic puts instructions into different FIFOs considering dependences A FIFO contains a chain of dependent instructions Only the head instructions are considered for issue

FIFO (cont’d) Scheduling example

FIFO (cont’d) Performance Comparable performance to the conventional scheduling Reduced scheduling logic complexity Many related papers on clustered microarchitecture

CRIB Reading – Erika Gunadi, Mikko Lipasti: CRIB: Combined Rename, Issue, and Bypass, ISCA – Goals – Match OOO performance per cycle – Match OOO frequency – Match OOO area – Reduce power significantly – Eliminate pipelines, latches, rename structures, issue logic

CRIB Data Movement ROB RS PRF Bypass ALU Physical Register File - style RAT Front-End CRIB ARF In-place execution CRIB

In-place Execution First proposed by Ultrascalar [1999] – Place instructions in execution stations – Route operands to instructions – Goal: massively wide issue – Power constraints not even on the horizon CRIB: in-place execution as enabler – Eliminate pipelined execution lanes, multiported RF, renaming, wakeup & select, clock loads – Enable efficient speculation recovery – Enable variable execution latency tolerance

CRIB Concept Data values propagate combinationally (no latches) – Completion bit propagates synchronously (latched) Instructions stay until all are finished When all are finished, latch data into ARF latches R0R1R2R3 Source1Source2 Destination CCCC C C C C ALU C Previous Entry Next Entry WE

Renaming in CRIB All the connections forms in parallel after dispatch Dependency is solved by the positional renaming Instructions issue subject to the readiness of its operands Add R2, R0, R0 Sub R3, R0, R2 Add R2, R2, R3 Add R2, R0, R3 R0R1R2R3 Source1Source2 Destination CCCC C C C C Cyc 1 Cyc 2 Cyc 3

Scaling Up CRIB Multiple CRIB partitions maintained as circular queue Only head ARF has committed state – Other latches are left transparent Front End ARF Mult/Div Cache Port ARF LQ Bank SQ LQ Bank SQ LQ Bank SQ LQ Bank SQ

Data Propagation across partitions Transparent latches for data Regular latches for complete bit Data values take one additional cycle to travel to the next partition R0CR1CR2CR3C R0CR1CR2CR3C CRIB 1 C Add R2, R2, R1 CRIB 0 C Add R2, R2, R1 R0CR1CR2CR3C Cycle 0 Cycle 1 Cycle 2 Cycle 3

CRIB Pipeline Diagram Fewer pipe stages – Remove rename stage from front-end – Remove issue and RF from middle Combine dependence and data linking FetchAlignDecAlloc RnmDispIssueRF WB Int A-GenLoad WB Cmt FetchAlignDecAlloc Disp Int A-GenLoad WB dependence linking data linking dependence / data linking OoO CRIB

Load-Store Ordering Loads/stores are ordered aggressively Recovery: replay in place No prediction needed; recovery is cheap ADD R2, R3, R1 ADD R2, R1, R1 LD R3, R1, R2 ST R0, R1, 1 R0R1R2R3 Data Addr LQ SQ misorder

Branch Misprediction Mispredicted branch drives a global signal up the CRIB Forces younger instructions to transform into NOPs Simpler than checkpointing or ROB unrolling branch mispredict Instruction 0 R0R1R2R3 flush Instruction 2 Instruction 3 NOP

CRIB Findings CRIB proposal appears promising – Competitive IPC and area – Dramatic power reductions Over baseline1 (“Bobcat”) – 45% less energy per instruction – 20-30% better IPC Over baseline2 (“Nehalem”) – 75% less energy per instruction – INT IPC slightly better, FP IPC slightly worse

CRIB Summary Instructions are inserted from front end Instructions inside CRIB execute subject to readiness of operands Data propagates without latches Complete bit ensures that data propagate synchronously A CRIB retires when all instructions done executing When a CRIB retires, data are latched in the ARF

Memory Dataflow

Scalable Load/Store Queues Load queue/store queue – Large instruction window: many loads and stores have to be buffered (25%/15% of mix) – Expensive searches positional-associative searches in SQ, associative lookups in LQ – coherence, speculative load scheduling – Power/area/delay are prohibitive

Store Queue/Load Queue Scaling Multilevel queues Bloom filters (quick check for independence) Eliminate associative load queue via replay [Cain 2004] – Issue loads again at commit, in order – Check to see if same value is returned – Filter load checks for efficiency: Most loads don’t issue out of order (no speculation) Most loads don’t coincide with coherence traffic

SVW and NoSQ Store Vulnerability Window (SVW) – Assign sequence numbers to stores – Track writes to cache with sequence numbers – Efficiently filter out safe loads/stores by only checking against writes in vulnerability window NoSQ – Rely on load/store alias prediction to satisfy dependent pairs – Use SVW technique to check

Store/Load Optimizations Weakness: predictor still fails – Machine should fail gracefully, not fall off a cliff – Glass jaw Several other concurrent proposals – DMDC, Fire-and-forget, …

Key Challenge: MLP Tolerate/overlap memory latency – Once first miss is encountered, find another one Naïve solution – Implement a very large ROB, IQ, LSQ – Power/area/delay make this infeasible Build virtual instruction window

Runahead Use poison bits to eliminate miss-dependent load program slice – Forward load slice processing is a very old idea Massive Memory Machine [Garcia-Molina et al. 84] Datascalar [Burger, Kaxiras, Goodman 97] – Runahead proposed by [Dundas, Mudge 97] Checkpoint state, keep running When miss completes, return to checkpoint – May need runahead cache for store/load communication

Waiting Instruction Buffer [Lebeck et al. ISCA 2002] Capture forward load slice in separate buffer – Propagate poison bits to identify slice Relieve pressure on issue queue Reinsert instructions when load completes Very similar to Intel Pentium 4 replay mechanism – But not publicly known at the time

Continual Flow Pipelines [Srinivasan et al. 2004] Slice buffer extension of WIB – Store operands in slice buffer as well to free up buffer entries on OOO window – Relieve pressure on rename/physical registers Applicable to – data-capture machines (Intel P6) or – physical register file machines (Pentium 4) Also extended to in-order machines (iCFP) Challenge: how to buffer loads/stores Reading: Hilton & Roth, BOLT, HPCA 2010

Instruction Flow

Transparent Control Independence Control flow graph convergence – Execution reconverges after branches – If-then-else constructs, etc. Can we fetch/execute instructions beyond convergence point? How do we resolve ambiguous register and memory dependences – Writes may or may not occur in branch shadow TCI employs CFP-like slice buffer to solve these problems – Instructions with ambiguous dependences buffered – Reinsert them the same way forward load miss slice is reinserted “Best” CI proposal to date, but still very complex and expensive, with moderate payback

Summary of Advanced Microarchitecture Instruction scheduling overview – Scheduling atomicity – Speculative scheduling – Scheduling recovery Complexity-effective instruction scheduling techniques – CRIB reading Scalable load/store handling – NoSQ reading Building large instruction windows – Runahead, CFP, iCFP Control Independence