A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison
2 Motivation As microprocessors get wider and deeper More in-flight stores Need a larger store queue Increase access time and power consumption Needs SQ access time <= D$ access time Avoid replay in case of store-to-load forwarding
3 A Brief Store Queue Overview Serve 2 main purposes: To maintain the order of in-flight stores To forward store data to later loads Commonly designed as a circular buffer Allocate entry on dispatch Deallocate entry on retirement Equipped with forwarding logic CAM structure for address match Select logic to pick the youngest older matching store
4 Store to Load Forwarding Each load needs to search the store queue for any matching older stores Forwarding logic consists of 3 components: Store Address CAM Select Logic Store Data RAM Store Address CAM Select Logic Store Data RAM
5 SQ Access Latency Major components of latency: CAM and Select CAM is scalable, Select is not
SQ Energy per Access Major component of energy : CAM
7 Outline Motivation and Background Finished Store Buffer (FSB) Initial Study Details of Design Methodology Results Conclusion
8 SQ Occupancy Study Most of the time, there are <= 50% of stores are finished and waiting to retire The number of waiting-to-retire stores does not scale linearly with the size of the OoO window 12, 20, 32, and 52 are used as the number of entry of our FSB for 128, 256, 512, 1024 window size
9 Finished Store Buffer The forwarding logic only cares about waiting- to-retire stores As shown, only less than 50% of in-flight stores ROB can be used to track store order Finished Store Buffer Much smaller than conventional store queue Does not maintain positional store ordering
10 FSB Diagram Allocate FSB entry at schedule Deallocate FSB entry at retirement FSB is maintained using a free-list A store is issued only if there is an available entry FetchDecRnmDispQueueReadExeWBRetSched FSB Conventional SQ
11 Forwarding Logic Load checks the FSB for matching store FSB position does not reflect relative age Non-positional select logic Same problem in a non-compacting scheduler Solutions: Buyuktosunoglu [SOC 2002], Robery [US Patent], and Sassone [ISCA 2007] Solutions similar to that by Buyuktosunoglu is used since it requires the least number of bits
12 Youngest Select Logic 4-entry FSB, 3-bits color (111:youngest, 000:oldest) Modification Add one more bit and a simple reverse logic to handle wrap around Restructure the algorithm hierarchically, checking happens in parallel 4 inputs ……… A1[3:0]A0[3:0] A2[3:0] S[3:0]S[2] A2[2] st A st A st A st A ld A One hot select signal
13 FSB Corner Cases Deadlock avoidance Happens when a store to issue is the oldest in the window and the FSB is full Reserves an entry in the FSB for the oldest store In order retirement Keeps the FSB index in the ROB entry, uses it to index to FSB at retire Branch misprediction Assigns store color to each branch Uses it to determine which FSB entries to invalidate
14 Methodology Simplescalar / Alpha 3.0 tool set Machine configuration 12-stage pipeline, 4-wide machine 128 ROB, 96 PRF 32 LQ, 24 SQ, 32 scheduler 2 integer ALUs, 1 mult/div, 1 memory port I-Cache: 64KB, DM, 64B, 2-cycle D-Cache: 64KB, 4-way, 64B, 3-cycle L2: 2MB, 8-way, 128B, 8-cycle Memory: 150-cycle
15 Modeling To estimate timing and power for the select logic Implemented in Verilog Synthesized using Synopsys Design Compiler and LSI Logic’s gflxp 0.11 micron CMOS standard cell library To estimate timing and power for RAM and CAM structures -> CACTI
16 Access Latency Comparison Due to fewer entries, select logic for FSB is faster CAM latency is similar
17 Energy per Access Comparison Fewer entries -> less CAM power Subarrays do not reduce energy, only latency
18 IPC Comparison (SPEC INT) FSB: 12, 20, 32, 52 for different window sizes FSB-min: the most aggressive limit To avoid stall, only needs 20%*machine-width*issue-retire stages 5, 10, 20, and 40 for different window sizes Both FSB and FSB-min less than 1% average slowdown
19 IPC Comparison (SPEC FP) Sixtrack with 1024 ROB experiences 5% slowdown Retirement stall of unfinished stores Slowdown less than 1% with 2 reservation slots In some cases, FSB slightly outperforms the baseline IPC Happens when the store queue size limits instructions dispatch in the baseline
Prior Work SQIP [Sha, 2005] Remove the associative search of SQ Loads use store-set to predict the index of a forwarding SQ entry Misprediction is detected by precommit re- execution, results in pipeline flush ULB-LSQ [Sethumadhavan, 2007] Unordered SQ, allocated at issue time Similar to our approach Differs in forwarding policy and overflow handling
21 Prior Work [Franklin, 1996]: ARB in Multiscalar [Sethumadhavan, 2003], [Park, 2003]: Filtering mechanism (bloom filter and store set) to reduce store queue access [Baugh, 2004]: Decomposed store queue functionality, only stores in forwarding group need to be put into the forwarding buffer [Torres, 2005]: 2-level SQ, predicted forwarding stores in L1, validation is done in L2 [Roth, 2005]: SVW, breaking SQ functionality into RSQ and FSQ, validation is done using load re-execution [Sha, 2005], [Stone, 2005]: SQIP and AIMD, removing the associative search capability from SQ [Subramanian, 2006], [Sha, 2006]: FnF and NoSQ, eliminate the whole SQ, load re-execution for validation [Sethumadhavan, 2007]: ULB-LSQ, unordered store queue that is allocated at issue time
22 Conclusion FSB, an alternative way to build the SQ Only contains finished stores Much smaller More scalable Minimal IPC impact, < 1% Lower power Possible higher frequency FSB-min, a more aggressive approach Also has minimal IPC impact Future work Load Queue Better deadlock handling
23 Thank you Questions?