Presentation is loading. Please wait.

Presentation is loading. Please wait.

Temporally Silent Stores (Alternatively: Louder Silent Stores)

Similar presentations


Presentation on theme: "Temporally Silent Stores (Alternatively: Louder Silent Stores)"— Presentation transcript:

1 Temporally Silent Stores (Alternatively: Louder Silent Stores)
Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison PHARM Team

2 Introduction Many stores do not update system state
“Silent Stores” are writes which do not change the value at a memory location Our own prior work has shown (some might say ad-nauseum) that silent stores are exploitable Uniprocessors: Memory write port reductions, reducing write-throughs, etc. Multiprocessors: Reducing sharing misses and invalidate traffic Other researchers have found silent stores useful Purser et. al. MICRO-2000, Yoaz et. al. HPCA-2001, Steffan et. al. HPCA-2002, others October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

3 What Is Temporal Silence?
The key idea behind silence is no observable change in system state What if we change the state but then change it back? Examples: Adding/removing items from a shared work stack Flags indicating condition of a device/data structure Lock variables (revert to unheld value when released) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

4 Temporal Silence (TS) In MPs
Intermediate Value Store Temporally Silent Pair 1 Reversion (TS) Store X Question: Does CPU 1 need to re-read [Addr A]? Answer: No, ”old” value is correct. Can we exploit TS to eliminate this Read miss? October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

5 Outline Introduction to Temporal Silence
Redefining Multiprocessor Sharing Multiprocessor Limit Study Characterizing Temporal Silence Exploiting Temporal Silence Non-Speculatively with Coherence Support Conclusions/Future Work October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

6 TSS MP Limit Study Setup
Temporal Silent Sharing (TSS) How often is a given CPU’s last fetched copy of a cache line current w.r.t. the global copy when accessed? Indicates the potential reduction in data traffic by exploiting TS for shared cache lines Infinite/finite caches, unified, 64B lines Instant TS detection/propagation to remote processors PowerPC, AIX v4.3.1 Scientific (SPLASH-2) and commercial workloads SimOS-PPC full system simulator (4 CPUs) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

7 MP Limit Study--Comm. Misses
Up to 45% reduction in communication misses for TSS, 24%/42% harmonic mean for scientific/commercial October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

8 MP Limit Study--Overall Miss Rate
Up to 33% reduction in miss rate for infinite cache TSS, 15%/25% harmonic mean for scientific/commercial October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

9 Understanding TS/TSS Examine the contribution to TSS by kernel, library, and user functions Other benchmarks shown in paper Scientific--substantial activity within kernel Commercial--locking, JRE, process management TSS Misses TSS Stores Function Description Example: Spec-JBB October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

10 Understanding TSS Values of intermediate and TS stores
Most TS store values are integer zero Greater than 5% in TPC-W and Spec-WEB are non-null pointers In many cases, intermediate value is not one (user-level spin-locks) Thread Ids—Large fraction in commercial apps Even in OCEAN—not user-level spin locks (40%) Pointers to shared data structures (up to 40% in Spec-WEB) Contribution by atomic primitives (lines touched with store-conditionals) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

11 TSS--Atomic Primitives
Many True Sharing misses are due to atomic primitives; Exception: Spec-JBB, 55% of TSM are data October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

12 Attacking TSS Non-Speculatively
Idea: Memory values revert to previous values Detect when they revert to some previous version Communicate reversion/version to other CPUs How can we win? Improve remote read latency (cache misses -> hits) Communicate versions only (not values) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

13 Detecting Temporal Silence
Inside the core Augment LSQ to detect TS Augment write buffer/write cache Outside the core Exploit inclusive memory hierarchies Ex: Modified L1 cache line has old version in L2 Augment L1/L2 with explicit storage This talk assumes we have enough storage to detect all cases of TSS which we can exploit October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

14 Exploiting Temporal Silence
Add coherence support Add a “temporally invalid” (T) state to MESI Entered upon receipt of an invalidate Add a “validate” transaction Takes remote lines T->S Broadcast when a processor detects the occurrence of temporal silence May lead to an increase in address traffic New protocol -> MESTI October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

15 MESTI Protocol Comm. Misses
MESTI exploits most TSS, reduction in comm. misses 21%/40% harmonic mean for scientific/commercial October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

16 MESTI Address Traffic--Ideal
No address traffic increase with oracle predictor of “useful’ validates -> validates prevent remote read misses October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

17 MESTI Address Traffic--Measured
Actual increase up to 108% (infinite $) Decreases as cache size decreases What causes “useless” validates? No remote cpu has a copy of the cache line Exploit snoop-aware validate Detect if any remote cpu has a copy at intermediate value store—if not, avoid validate Reduces useless validates from 0-20% for infinite caches, 7-50% for a finite 16MB cache October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

18 MESTI Address Traffic--Reduction
What causes “useless” validates? A remote access does not occur before the line is re-written (the TS write is not the last write) Place outbound validates into a delay queue If a subsequent non-silent store occurs to the cache line, the validate is aborted Filters many useless validates for lines re-written quickly Affects timeliness of validates Some TSS misses will not be avoided if cache line has returned to old value but validate is delayed How do we determine effectiveness of such a queue? October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

19 MESTI—Delay Queue Approach
Green—% of not last write TS stores detected in this distance Red—% of TS last writes exploited if propagated within this distance October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

20 MESTI—Delay Queue Approach
Short queue (27 cycles) removes 30-35% of useless validates for Spec-JBB and TPC-W with 1% opportunity lost October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

21 Summary/Conclusions Storage locations revert to previous values
We call stores writing such values temporally silent Redefine MP sharing to consider this Up to 45% of comm. misses eliminated Characterization reveals insight at function level, and not all due to atomic primitives Exploit non-speculatively with simple enhancements to the coherence protocol Achieves the vast majority of possible benefit Simple methods to reduce additional coherence txns October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

22 Current and Future Work
How do we approach the limit study results with realistic implementations? Further limit unnecessary address traffic Detail efficient ways of detecting TS in the memory hierarchy Performance evaluation in OoO processor models, commercial workloads Comparison with speculative methods which can capture TS October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002


Download ppt "Temporally Silent Stores (Alternatively: Louder Silent Stores)"

Similar presentations


Ads by Google