Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Alpha Microarchitecture Onur/Aditya 11/6/2001.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Lecture: Out-of-order Processors
/ Computer Architecture and Design
Computer Structure Multi-Threading
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Out-of-Order Commit Processors
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 18: Pipelining Today’s topics:
Lecture: SMT, Cache Hierarchies
Out-of-Order Commit Processor
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 18: Pipelining Today’s topics:
Lecture: Branch Prediction
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Out-of-Order Commit Processors
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Instruction Execution Cycle
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department of Computer Science * Currently on internship/sabbatical at NVIDIA Research

L2 Motivation L2 HomogeneousHeterogeneous Adaptive (Federation) Multithreaded scalar IO core 2-way OO core L2

Basic Insights A multithreaded in-order core has many registers which can be reused for a reorder buffer or active list A multithreaded in-order core has many registers which can be reused for a reorder buffer or active list If cores are small, single cycle communication between neighbors is feasible If cores are small, single cycle communication between neighbors is feasible Prior work on making large OOO cores feasible can be applied at the low end to make low-cost OOO possible Prior work on making large OOO cores feasible can be applied at the low end to make low-cost OOO possible

Bpred Allocate Rename Issue Commit In-order & Out-of-order Pipelines Fetch Decode Execute Mem Writeback Fetch Decode Execute Mem Writeback In-orderOut-of-order

Ready BitsSubscriber Slot 1Subscriber Slot Issue Queue Example 11IQ2 1 IQ Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002 Sassone et al., Sassone et al., Matrix Scheduler Reloaded, ISCA

Simplified Load-Store Queue Memory Alias Table (MAT) Memory Alias Table (MAT) No store forwarding No store forwarding No conservative waiting on stores No conservative waiting on stores Only detect memory order violations after they have occurred and flush the pipeline when the offending instruction commits Only detect memory order violations after they have occurred and flush the pipeline when the offending instruction commits Amir Roth, Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 2005

MAT Example st 0x13, r5 ld r1, 0x MAT

MAT Example st 0x13, r5 ld r1, 0x13EXE MAT ld executes and increments counter

MAT Example st 0x13, r5COM ! MAT ld r1, 0x13 st commits and sets flag

MAT Example ld r1, 0x13COM ! MAT Flush ld commits, sees flag, and flushes pipeline

MAT Example ld r1, 0x MAT MAT is reset and execution resumes

Performance Impact

Performance

Energy Efficiency

Area Efficiency

Conclusions Two in-order cores can be federated at run-time to form a 2-way OO core Two in-order cores can be federated at run-time to form a 2-way OO core Almost doubling IPC of throughput core is possible with very little extra hardware Almost doubling IPC of throughput core is possible with very little extra hardware Don’t want traditional OO structures because their performance comes at too high a price Don’t want traditional OO structures because their performance comes at too high a price Best combined area- and energy-efficiency Best combined area- and energy-efficiency

Q & A

Backup

Core Fusion Data Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors”, ISCA 2007

Overall Results Scalar in-order core is 8KB I/D, 256KB L2 Scalar in-order core is 8KB I/D, 256KB L2 Base 2-way core has 16KB I and D-Caches, 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpred Base 2-way core has 16KB I and D-Caches, 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpred 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred

Branch Prediction Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Stack (RAS) Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Stack (RAS) NLS ok if your instruction working set not > I$ size NLS ok if your instruction working set not > I$ size Small bimodal predictor ik ok for small window processor Small bimodal predictor ik ok for small window processor

Fetch Two I$’s act as a I$ of twice the size and associativity (and random replacement) Two I$’s act as a I$ of twice the size and associativity (and random replacement) More logic and buffers to capture two instructions More logic and buffers to capture two instructions Extra cycle to route instructions from two I$’s to two decoders Extra cycle to route instructions from two I$’s to two decoders

Decode Cancel second instruction if first turns out to be branch Cancel second instruction if first turns out to be branch Extra cycle to route decoded instructions to new allocate stage Extra cycle to route decoded instructions to new allocate stage

Allocate New logic and free lists to allocate ROB, IQ entries New logic and free lists to allocate ROB, IQ entries

Rename New table since it has too many ports New table since it has too many ports One, centralized rename table, not distributed One, centralized rename table, not distributed Has separate table (or field in each RAT entry) for each registers producer instructions IQ-slot number (see our new issue queue) Has separate table (or field in each RAT entry) for each registers producer instructions IQ-slot number (see our new issue queue)

Issue Uses a simple lookup table as wakeup structure, where instructions subscribe to their input instructions (explained in detail later) Uses a simple lookup table as wakeup structure, where instructions subscribe to their input instructions (explained in detail later) Centralized, one IQ for the two cores Centralized, one IQ for the two cores

Register File Register file is mirrored in the two cores Register file is mirrored in the two cores No extra copy instructions or load-balancing questions No extra copy instructions or load-balancing questions

Execute Add extra cycle for copying result to other core’s register file (like EV6) Add extra cycle for copying result to other core’s register file (like EV6)

Memory Access The two D$s are checked in parallel, each responsible for half of the merged D$’s ways The two D$s are checked in parallel, each responsible for half of the merged D$’s ways No standard LSQ, only a Memory Alias Table (details later) No standard LSQ, only a Memory Alias Table (details later) Only detects ordering violations and send signal to pipeline Only detects ordering violations and send signal to pipeline

Commit Centralized commit, no slippage Centralized commit, no slippage Recover from branch mispredictions since no checkpoints of RAT on branches Recover from branch mispredictions since no checkpoints of RAT on branches Recover from memory order violations (or false positives) from MAT Recover from memory order violations (or false positives) from MAT