A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Slides:

Advertisements

Similar presentations

TRIPS Primary Memory System Simha Sethumadhavan 1.

Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Alpha Microarchitecture Onur/Aditya 11/6/2001.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Multiscalar processors

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

Alpha Supplement CS 740 Oct. 14, 1998

1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Multiscalar Processors

/ Computer Architecture and Design

Lynn Choi Dept. Of Computer and Electronics Engineering

PowerPC 604 Superscalar Microprocessor

Physical Register Inlining (PRI)

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.

Out-of-Order Commit Processors

Power-Aware Operand Delivery

Lecture 6: Advanced Pipelines

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

Address-Value Delta (AVD) Prediction

Lecture 11: Memory Data Flow Techniques

Out-of-Order Commit Processor

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

Out-of-Order Commit Processors

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Patrick Akl and Andreas Moshovos AENAO Research Group

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Handling Stores and Loads

Presentation transcript:

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison

2 Motivation As microprocessors get wider and deeper More in-flight stores Need a larger store queue Increase access time and power consumption Needs SQ access time <= D$ access time Avoid replay in case of store-to-load forwarding

3 A Brief Store Queue Overview Serve 2 main purposes: To maintain the order of in-flight stores To forward store data to later loads Commonly designed as a circular buffer Allocate entry on dispatch Deallocate entry on retirement Equipped with forwarding logic CAM structure for address match Select logic to pick the youngest older matching store

4 Store to Load Forwarding Each load needs to search the store queue for any matching older stores Forwarding logic consists of 3 components: Store Address CAM Select Logic Store Data RAM Store Address CAM Select Logic Store Data RAM

5 SQ Access Latency Major components of latency: CAM and Select CAM is scalable, Select is not

SQ Energy per Access Major component of energy : CAM

7 Outline Motivation and Background Finished Store Buffer (FSB) Initial Study Details of Design Methodology Results Conclusion

8 SQ Occupancy Study Most of the time, there are <= 50% of stores are finished and waiting to retire The number of waiting-to-retire stores does not scale linearly with the size of the OoO window 12, 20, 32, and 52 are used as the number of entry of our FSB for 128, 256, 512, 1024 window size

9 Finished Store Buffer The forwarding logic only cares about waiting- to-retire stores As shown, only less than 50% of in-flight stores ROB can be used to track store order Finished Store Buffer Much smaller than conventional store queue Does not maintain positional store ordering

10 FSB Diagram Allocate FSB entry at schedule Deallocate FSB entry at retirement FSB is maintained using a free-list A store is issued only if there is an available entry FetchDecRnmDispQueueReadExeWBRetSched FSB Conventional SQ

11 Forwarding Logic Load checks the FSB for matching store FSB position does not reflect relative age Non-positional select logic Same problem in a non-compacting scheduler Solutions: Buyuktosunoglu [SOC 2002], Robery [US Patent], and Sassone [ISCA 2007] Solutions similar to that by Buyuktosunoglu is used since it requires the least number of bits

12 Youngest Select Logic 4-entry FSB, 3-bits color (111:youngest, 000:oldest) Modification Add one more bit and a simple reverse logic to handle wrap around Restructure the algorithm hierarchically, checking happens in parallel 4 inputs ……… A1[3:0]A0[3:0] A2[3:0] S[3:0]S[2] A2[2] st A st A st A st A ld A One hot select signal

13 FSB Corner Cases Deadlock avoidance Happens when a store to issue is the oldest in the window and the FSB is full Reserves an entry in the FSB for the oldest store In order retirement Keeps the FSB index in the ROB entry, uses it to index to FSB at retire Branch misprediction Assigns store color to each branch Uses it to determine which FSB entries to invalidate

14 Methodology Simplescalar / Alpha 3.0 tool set Machine configuration 12-stage pipeline, 4-wide machine 128 ROB, 96 PRF 32 LQ, 24 SQ, 32 scheduler 2 integer ALUs, 1 mult/div, 1 memory port I-Cache: 64KB, DM, 64B, 2-cycle D-Cache: 64KB, 4-way, 64B, 3-cycle L2: 2MB, 8-way, 128B, 8-cycle Memory: 150-cycle

15 Modeling To estimate timing and power for the select logic Implemented in Verilog Synthesized using Synopsys Design Compiler and LSI Logic’s gflxp 0.11 micron CMOS standard cell library To estimate timing and power for RAM and CAM structures -> CACTI

16 Access Latency Comparison Due to fewer entries, select logic for FSB is faster CAM latency is similar

17 Energy per Access Comparison Fewer entries -> less CAM power Subarrays do not reduce energy, only latency

18 IPC Comparison (SPEC INT) FSB: 12, 20, 32, 52 for different window sizes FSB-min: the most aggressive limit To avoid stall, only needs 20%*machine-width*issue-retire stages 5, 10, 20, and 40 for different window sizes Both FSB and FSB-min less than 1% average slowdown

19 IPC Comparison (SPEC FP) Sixtrack with 1024 ROB experiences 5% slowdown Retirement stall of unfinished stores Slowdown less than 1% with 2 reservation slots In some cases, FSB slightly outperforms the baseline IPC Happens when the store queue size limits instructions dispatch in the baseline

Prior Work SQIP [Sha, 2005] Remove the associative search of SQ Loads use store-set to predict the index of a forwarding SQ entry Misprediction is detected by precommit re- execution, results in pipeline flush ULB-LSQ [Sethumadhavan, 2007] Unordered SQ, allocated at issue time Similar to our approach Differs in forwarding policy and overflow handling

21 Prior Work [Franklin, 1996]: ARB in Multiscalar [Sethumadhavan, 2003], [Park, 2003]: Filtering mechanism (bloom filter and store set) to reduce store queue access [Baugh, 2004]: Decomposed store queue functionality, only stores in forwarding group need to be put into the forwarding buffer [Torres, 2005]: 2-level SQ, predicted forwarding stores in L1, validation is done in L2 [Roth, 2005]: SVW, breaking SQ functionality into RSQ and FSQ, validation is done using load re-execution [Sha, 2005], [Stone, 2005]: SQIP and AIMD, removing the associative search capability from SQ [Subramanian, 2006], [Sha, 2006]: FnF and NoSQ, eliminate the whole SQ, load re-execution for validation [Sethumadhavan, 2007]: ULB-LSQ, unordered store queue that is allocated at issue time

22 Conclusion FSB, an alternative way to build the SQ Only contains finished stores Much smaller More scalable Minimal IPC impact, < 1% Lower power Possible higher frequency FSB-min, a more aggressive approach Also has minimal IPC impact Future work Load Queue Better deadlock handling

23 Thank you Questions?