Temporally Silent Stores (Alternatively: Louder Silent Stores) Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison PHARM Team http://www.ece.wisc.edu/~pharm
Introduction Many stores do not update system state “Silent Stores” are writes which do not change the value at a memory location Our own prior work has shown (some might say ad-nauseum) that silent stores are exploitable Uniprocessors: Memory write port reductions, reducing write-throughs, etc. Multiprocessors: Reducing sharing misses and invalidate traffic Other researchers have found silent stores useful Purser et. al. MICRO-2000, Yoaz et. al. HPCA-2001, Steffan et. al. HPCA-2002, others October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
What Is Temporal Silence? The key idea behind silence is no observable change in system state What if we change the state but then change it back? Examples: Adding/removing items from a shared work stack Flags indicating condition of a device/data structure Lock variables (revert to unheld value when released) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
Temporal Silence (TS) In MPs Intermediate Value Store Temporally Silent Pair 1 Reversion (TS) Store X Question: Does CPU 1 need to re-read [Addr A]? Answer: No, ”old” value is correct. Can we exploit TS to eliminate this Read miss? October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
Outline Introduction to Temporal Silence Redefining Multiprocessor Sharing Multiprocessor Limit Study Characterizing Temporal Silence Exploiting Temporal Silence Non-Speculatively with Coherence Support Conclusions/Future Work October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
TSS MP Limit Study Setup Temporal Silent Sharing (TSS) How often is a given CPU’s last fetched copy of a cache line current w.r.t. the global copy when accessed? Indicates the potential reduction in data traffic by exploiting TS for shared cache lines Infinite/finite caches, unified, 64B lines Instant TS detection/propagation to remote processors PowerPC, AIX v4.3.1 Scientific (SPLASH-2) and commercial workloads SimOS-PPC full system simulator (4 CPUs) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
MP Limit Study--Comm. Misses Up to 45% reduction in communication misses for TSS, 24%/42% harmonic mean for scientific/commercial October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
MP Limit Study--Overall Miss Rate Up to 33% reduction in miss rate for infinite cache TSS, 15%/25% harmonic mean for scientific/commercial October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
Understanding TS/TSS Examine the contribution to TSS by kernel, library, and user functions Other benchmarks shown in paper Scientific--substantial activity within kernel Commercial--locking, JRE, process management TSS Misses TSS Stores Function Description Example: Spec-JBB October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
Understanding TSS Values of intermediate and TS stores Most TS store values are integer zero Greater than 5% in TPC-W and Spec-WEB are non-null pointers In many cases, intermediate value is not one (user-level spin-locks) Thread Ids—Large fraction in commercial apps Even in OCEAN—not user-level spin locks (40%) Pointers to shared data structures (up to 40% in Spec-WEB) Contribution by atomic primitives (lines touched with store-conditionals) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
TSS--Atomic Primitives Many True Sharing misses are due to atomic primitives; Exception: Spec-JBB, 55% of TSM are data October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
Attacking TSS Non-Speculatively Idea: Memory values revert to previous values Detect when they revert to some previous version Communicate reversion/version to other CPUs How can we win? Improve remote read latency (cache misses -> hits) Communicate versions only (not values) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
Detecting Temporal Silence Inside the core Augment LSQ to detect TS Augment write buffer/write cache Outside the core Exploit inclusive memory hierarchies Ex: Modified L1 cache line has old version in L2 Augment L1/L2 with explicit storage This talk assumes we have enough storage to detect all cases of TSS which we can exploit October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
Exploiting Temporal Silence Add coherence support Add a “temporally invalid” (T) state to MESI Entered upon receipt of an invalidate Add a “validate” transaction Takes remote lines T->S Broadcast when a processor detects the occurrence of temporal silence May lead to an increase in address traffic New protocol -> MESTI October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
MESTI Protocol Comm. Misses MESTI exploits most TSS, reduction in comm. misses 21%/40% harmonic mean for scientific/commercial October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
MESTI Address Traffic--Ideal No address traffic increase with oracle predictor of “useful’ validates -> validates prevent remote read misses October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
MESTI Address Traffic--Measured Actual increase up to 108% (infinite $) Decreases as cache size decreases What causes “useless” validates? No remote cpu has a copy of the cache line Exploit snoop-aware validate Detect if any remote cpu has a copy at intermediate value store—if not, avoid validate Reduces useless validates from 0-20% for infinite caches, 7-50% for a finite 16MB cache October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
MESTI Address Traffic--Reduction What causes “useless” validates? A remote access does not occur before the line is re-written (the TS write is not the last write) Place outbound validates into a delay queue If a subsequent non-silent store occurs to the cache line, the validate is aborted Filters many useless validates for lines re-written quickly Affects timeliness of validates Some TSS misses will not be avoided if cache line has returned to old value but validate is delayed How do we determine effectiveness of such a queue? October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
MESTI—Delay Queue Approach Green—% of not last write TS stores detected in this distance Red—% of TS last writes exploited if propagated within this distance October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
MESTI—Delay Queue Approach Short queue (27 cycles) removes 30-35% of useless validates for Spec-JBB and TPC-W with 1% opportunity lost October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
Summary/Conclusions Storage locations revert to previous values We call stores writing such values temporally silent Redefine MP sharing to consider this Up to 45% of comm. misses eliminated Characterization reveals insight at function level, and not all due to atomic primitives Exploit non-speculatively with simple enhancements to the coherence protocol Achieves the vast majority of possible benefit Simple methods to reduce additional coherence txns October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002
Current and Future Work How do we approach the limit study results with realistic implementations? Further limit unnecessary address traffic Detail efficient ways of detecting TS in the memory hierarchy Performance evaluation in OoO processor models, commercial workloads Comparison with speculative methods which can capture TS October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002