Temporally Silent Stores (Alternatively: Louder Silent Stores)

Slides:



Advertisements
Similar presentations
Characterization of Silent Stores Gordon B.Bell Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison
Advertisements

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Lecture 13: Multiprocessors Kai Bu
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Silent Stores for Free (or, Silent Stores Darn Cheap) Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison
ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.
Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
The University of Adelaide, School of Computer Science
Lecture 13: Multiprocessors Kai Bu
Multiprocessors – Locks
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Presented by: Nick Kirchem Feb 13, 2004
Speculative Lock Elision
תרגול מס' 5: MESI Protocol
ASR: Adaptive Selective Replication for CMP Caches
Lecture 21 Synchronization
Computer Engineering 2nd Semester
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
A Study on Snoop-Based Cache Coherence Protocols
Multiprocessor Cache Coherency
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Exploring Value Prediction with the EVES predictor
The University of Adelaide, School of Computer Science
Cache Coherence Protocols:
Cache Coherence Protocols:
Lecture 2: Snooping-Based Coherence
Cache Coherence Protocols 15th April, 2006
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Designing Parallel Algorithms (Synchronization)
CMSC 611: Advanced Computer Architecture
Lecture 5: Snooping Protocol Design Issues
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
15-740/ Computer Architecture Lecture 14: Prefetching
Avoiding Initialization Misses to the Heap
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence, Synchronization
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
University of Wisconsin-Madison Presented by: Nick Kirchem
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Temporally Silent Stores (Alternatively: Louder Silent Stores) Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison PHARM Team http://www.ece.wisc.edu/~pharm

Introduction Many stores do not update system state “Silent Stores” are writes which do not change the value at a memory location Our own prior work has shown (some might say ad-nauseum) that silent stores are exploitable Uniprocessors: Memory write port reductions, reducing write-throughs, etc. Multiprocessors: Reducing sharing misses and invalidate traffic Other researchers have found silent stores useful Purser et. al. MICRO-2000, Yoaz et. al. HPCA-2001, Steffan et. al. HPCA-2002, others October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

What Is Temporal Silence? The key idea behind silence is no observable change in system state What if we change the state but then change it back? Examples: Adding/removing items from a shared work stack Flags indicating condition of a device/data structure Lock variables (revert to unheld value when released) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

Temporal Silence (TS) In MPs Intermediate Value Store Temporally Silent Pair 1 Reversion (TS) Store X Question: Does CPU 1 need to re-read [Addr A]? Answer: No, ”old” value is correct. Can we exploit TS to eliminate this Read miss? October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

Outline Introduction to Temporal Silence Redefining Multiprocessor Sharing Multiprocessor Limit Study Characterizing Temporal Silence Exploiting Temporal Silence Non-Speculatively with Coherence Support Conclusions/Future Work October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

TSS MP Limit Study Setup Temporal Silent Sharing (TSS) How often is a given CPU’s last fetched copy of a cache line current w.r.t. the global copy when accessed? Indicates the potential reduction in data traffic by exploiting TS for shared cache lines Infinite/finite caches, unified, 64B lines Instant TS detection/propagation to remote processors PowerPC, AIX v4.3.1 Scientific (SPLASH-2) and commercial workloads SimOS-PPC full system simulator (4 CPUs) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

MP Limit Study--Comm. Misses Up to 45% reduction in communication misses for TSS, 24%/42% harmonic mean for scientific/commercial October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

MP Limit Study--Overall Miss Rate Up to 33% reduction in miss rate for infinite cache TSS, 15%/25% harmonic mean for scientific/commercial October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

Understanding TS/TSS Examine the contribution to TSS by kernel, library, and user functions Other benchmarks shown in paper Scientific--substantial activity within kernel Commercial--locking, JRE, process management TSS Misses TSS Stores Function Description Example: Spec-JBB October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

Understanding TSS Values of intermediate and TS stores Most TS store values are integer zero Greater than 5% in TPC-W and Spec-WEB are non-null pointers In many cases, intermediate value is not one (user-level spin-locks) Thread Ids—Large fraction in commercial apps Even in OCEAN—not user-level spin locks (40%) Pointers to shared data structures (up to 40% in Spec-WEB) Contribution by atomic primitives (lines touched with store-conditionals) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

TSS--Atomic Primitives Many True Sharing misses are due to atomic primitives; Exception: Spec-JBB, 55% of TSM are data October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

Attacking TSS Non-Speculatively Idea: Memory values revert to previous values Detect when they revert to some previous version Communicate reversion/version to other CPUs How can we win? Improve remote read latency (cache misses -> hits) Communicate versions only (not values) October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

Detecting Temporal Silence Inside the core Augment LSQ to detect TS Augment write buffer/write cache Outside the core Exploit inclusive memory hierarchies Ex: Modified L1 cache line has old version in L2 Augment L1/L2 with explicit storage This talk assumes we have enough storage to detect all cases of TSS which we can exploit October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

Exploiting Temporal Silence Add coherence support Add a “temporally invalid” (T) state to MESI Entered upon receipt of an invalidate Add a “validate” transaction Takes remote lines T->S Broadcast when a processor detects the occurrence of temporal silence May lead to an increase in address traffic New protocol -> MESTI October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

MESTI Protocol Comm. Misses MESTI exploits most TSS, reduction in comm. misses 21%/40% harmonic mean for scientific/commercial October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

MESTI Address Traffic--Ideal No address traffic increase with oracle predictor of “useful’ validates -> validates prevent remote read misses October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

MESTI Address Traffic--Measured Actual increase up to 108% (infinite $) Decreases as cache size decreases What causes “useless” validates? No remote cpu has a copy of the cache line Exploit snoop-aware validate Detect if any remote cpu has a copy at intermediate value store—if not, avoid validate Reduces useless validates from 0-20% for infinite caches, 7-50% for a finite 16MB cache October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

MESTI Address Traffic--Reduction What causes “useless” validates? A remote access does not occur before the line is re-written (the TS write is not the last write) Place outbound validates into a delay queue If a subsequent non-silent store occurs to the cache line, the validate is aborted Filters many useless validates for lines re-written quickly Affects timeliness of validates Some TSS misses will not be avoided if cache line has returned to old value but validate is delayed How do we determine effectiveness of such a queue? October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

MESTI—Delay Queue Approach Green—% of not last write TS stores detected in this distance Red—% of TS last writes exploited if propagated within this distance October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

MESTI—Delay Queue Approach Short queue (27 cycles) removes 30-35% of useless validates for Spec-JBB and TPC-W with 1% opportunity lost October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

Summary/Conclusions Storage locations revert to previous values We call stores writing such values temporally silent Redefine MP sharing to consider this Up to 45% of comm. misses eliminated Characterization reveals insight at function level, and not all due to atomic primitives Exploit non-speculatively with simple enhancements to the coherence protocol Achieves the vast majority of possible benefit Simple methods to reduce additional coherence txns October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002

Current and Future Work How do we approach the limit study results with realistic implementations? Further limit unnecessary address traffic Detail efficient ways of detecting TS in the memory hierarchy Performance evaluation in OoO processor models, commercial workloads Comparison with speculative methods which can capture TS October 7, 2002 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin ASPLOS-2002