On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Slides:



Advertisements
Similar presentations
Characterization of Silent Stores Gordon B.Bell Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
A Dynamic Binary Translation Approach to Architectural Simulation Harold “Trey” Cain, Kevin Lepak, and Mikko Lipasti Computer Sciences Department Department.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
CS Lecture 24 Exceeding the Dataflow Limit via Value Prediction M.H. Lipasti, J.P. Shen Proceedings of MICRO-29 December 1996.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3.
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.
The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.
Revisiting Load Value Speculation:
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Revisiting Hardware-Assisted Page Walks for Virtualized Systems
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Dusty Caches for Reference Counting Garbage Collection Scott Friedman, Praveen Krishnamurthy, Roger Chamberlain, Ron K. Cytron, Jason Fritts Dept. of Computer.
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Silent Stores for Free (or, Silent Stores Darn Cheap) Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Eager Writeback — A Technique for Improving Bandwidth Utilization
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Value Prediction Kyaw Kyaw, Min Pan Final Project.
Implementing Optimizations at Decode Time
Speculative Lock Elision
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Half-Price Architecture
Energy-Efficient Address Translation
Temporally Silent Stores (Alternatively: Louder Silent Stores)
Module 3: Branch Prediction
Address-Value Delta (AVD) Prediction
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Avoiding Initialization Misses to the Heap
Aliasing and Anti-Aliasing in Branch History Table Prediction
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Phase based adaptive Branch predictor: Seeing the forest for the trees
Project Guidelines Prof. Eric Rotenberg.
Presentation transcript:

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Motivation Value Locality (VL) is Real More than 30 VL/VP papers Patents granted for VL work (AMD, Gabbay) VL has been traditionally used to speed up the next state function (the “means” of computing) VL has not been explored (generally) for the output function (the “end” of computing) Is there any benefit to exploiting VL for outputs?

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti VL of Memory Writes—Who Cares? Reduction of writebacks/dirty lines is desirable Less data traffic Fewer invalidates Reduced pressure on WB buffers Writes comparatively expensive Requires multi-porting/banking circuit tricks Removal of “unnecessary” value storage seems intuitively satisfying Do unnecessary writes imply unnecessary computation?

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Outline Motivation Introduction to Silent Stores Uniprocessor Results Multiprocessor True/False Sharing New Definition of False Sharing Multi-Processor Results Conclusions

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Terminology Silent Store: A memory write that does not change the system state

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Silent Stores—Is this for Real? Percentage of silent stores is non-trivial in all cases, 20%-68%

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Terminology Silent Store: A memory write that does not change the system state Store Verify: A load, compare, and subsequent store (if non-silent) operation Store Squashing: Removal of a silent store from program execution Dynamically Statically

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Uniprocessor Machine Model SimpleScalar simulator 64 entry RUU; 8 issue 64K Gshare branch predictor 64K each split I/D cache 1MB L2 unified cache 16 entry load/store queue 4 memory load ports; 1 store port (4-wide version of the 2-wide 21164)

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Squashing Mechanism Baseline Implementation of Store Squashing

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Writebacks Eliminated Substantial WB elimination by simplistic store verify/squash (14%-81% for cases with non-trivial WBs in the baseline case)

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti IPC Effects 7.6%, 7.9%, 14%, and 6.3% speedup of squashing over no squashing for m88ksim, gcc, vortex, and HM

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti IPC Effects Squashing provides more benefit than store forwarding

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Outline Motivation Introduction to Silent Stores Uniprocessor Results Multiprocessor True/False Sharing New Definition of False Sharing Multi-Processor Results Conclusions

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Multiprocessor True/False Sharing Dubois et. al: ISCA-1993 Address-based definition Considers all “side-effects” (non-critical words) brought in by a cache miss and future accesses to the line Does not consider silent stores or data value prediction

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Terminology Program Structure Store Value Locality (PSSVL): The value locality exhibited by a given static store (can write to many addresses) Message Passing Store Value Locality (MPSVL): The value locality exhibited for a specific memory location (can be written by many PCs) Stochastically Silent Store: A store value which is trivially predictable by any well known method

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti MPSVL and PSSVL Percentage of stochastically silent (PSSVL, MPSVL) stores is non-trivial 27%-72% for PSSVL, 39%-70% for MPSVL

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti New Definition of False Sharing Extend Dubois’ definition with store value locality: Update False Sharing (UFS) Consider (update) silent stores Stochastic False Sharing (SFS) Consider stochastically (predictable by any well known method) silent stores Message-passing Stochastic False Sharing (MSFS): Exploit value locality based on effective memory address Program-structure Stochastic False Sharing (PSFS): Exploit value locality based on instruction address (PC)

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Machine Model SimOS-PPC full system simulator 4 processors 1MB data cache 4K direct-mapped stride predictor for Program- structure and Message-passing store value locality Remove all silent and stochastically silent stores from the cache hierarchy Limit study—must have a mechanism to exploit (subject of current research)

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Multiprocessor Sharing

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Multiprocessor Sharing

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Multiprocessor Sharing

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Multiprocessor Sharing Measurable reduction in true/false sharing for simple update silent squashing (UFS) Substantial reductions by squashing update silent store hits and misses (UFS-P) and stochastically silent stores (SFS) Squashing store misses (UFS-P) can be substantially better than simple UFS Motivates silence confidence mechanism for store misses

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Multiprocessor Traffic Measurable reduction in invalidate traffic for simple update silent store squashing (UFS)— more effective than Exclusive state Substantial reduction for UFS-P and Stochastic False Sharing (SFS) Writeback data traffic reduction by squashing update silent store hits and misses (UFS-P) 5%-82% in oltp 16%-17% in ocean 5%-16% in barnes

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Conclusions Significant store value locality exists MPSVL (includes update silent stores) PSSVL Uniprocessor performance can be enhanced by squashing silent stores A new definition of sharing is given which accounts for update/stochastically silent stores We can exploit the new sharing definitions to reduce address and data bus traffic UFS: Implementation given, non-trivial results SFS: Limit study shows significant potential

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Future Work Characterization of silent stores Silence prediction and confidence mechanisms Implementation of SFS mechanism...

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Backup Slides

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti IPC—Continued Squashing in an exclusive 4L/1S memory system equivalent to non-exclusive/no squash

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Previous Value Locality Works Lipasti, Shen: MICRO-1996, ASPLOS-1996 Load value prediction (VP), input register VP Mendelson, Gabbay: Technion TR-97 VP based on output register specifier Gonzalez, Gonzalez: PACT-1999 Improving branch prediction Calder et. al: ISCA-1999 Critical path optimizations Many Others

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Multiprocessor Invalidates OCEAN

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Multiprocessor Invalidates BARNES

June 13, 2000Kevin M. Lepak and Mikko H. Lipasti Multiprocessor Invalidates OLTP