Speculative Lock Elision

Slides:

Advertisements

Similar presentations

TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.

Advertisements

The University of Adelaide, School of Computer Science

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.

Lecture 13: Consistency Models

Computer Architecture II 1 Computer architecture II Lecture 9.

Multiscalar processors

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.

CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

CS533 Concepts of Operating Systems Jonathan Walpole.

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

Lecture 20: Consistency Models, TM

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Transactional Memory : Hardware Proposals Overview

Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

Part 2: Software-Based Approaches

Multiscalar Processors

Memory Consistency Models

Threads Cannot Be Implemented As a Library

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 11: Consistency Models

Memory Consistency Models

Lecture 18: Coherence and Synchronization

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

The University of Adelaide, School of Computer Science

Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.

The University of Adelaide, School of Computer Science

Changing thread semantics

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 6: Transactions

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Transactional Memory An Overview of Hardware Alternatives

Lecture 22: Consistency Models, TM

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Lecture 25: Multiprocessors

Lecture 10: Consistency Models

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 21: Synchronization & Consistency

Lecture: Consistency Models, TM

Lecture 19: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSC Multiprocessor Programming, Spring, 2011

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 11: Consistency Models

Presentation transcript:

Speculative Lock Elision Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and James R. Goodman University of Wisconsin-Madison

Motivation Multithreaded Programming gains Importance SMPs, Multithreaded Architectures Used in non-traditional fields (Desktop) Synchronization Mechanisms required for exclusive access Serialize access to shared data structures Necessary to prevent conflicting updates

Dynamically Unnecessary Synchronization LOCK(locks->error_lock) if (local_error>multi->err_multi) multi->err_multi=local_err; UNLOCK(locks->error_lock); Require no lock, if no write access different fields accessed Examples from ocean SHORE (similar) Thread 1 LOCK(hash_tbl.lock) var=hash_tbl.lookup(X) if(!var) hash_tbl.add(X); UNLOCK(hash_tbl.lock) Thread 2 var=hash_tbl.lookup(Y) hash_tbl.add(Y);

Multithreaded Programming Conservative Locking Easier to show correctness Lock more often than necessary Locking Granularity Trade-Off between Performance and Complexity Thread-unsafe legacy libraries Require global locking

False Dependencies Locks introduce Control and Data Dependence Removal Key is Appearance of Instantaneous Change Lock can be elided, if memory operation remain atomic Data read is not modified by other threads Data written is not access by other threads Any instruction that violates these conditions must not be retired

Silent store pairs i16 and i6 are silent pair Not silent individually i16 undoes i6 Not silent individually No write to _lock_ Read _lock_ is ok SLE does not depend on semantic information Simply observe silent store pairs

Two SLE Predictions Silent pair prediction Atomicity prediction On a store Predict another store will undo changes Don’t perform stores Monitor memory location If validated, elide two stores Atomicity prediction Predict all memory access within silent pair occur atomically No partial updates visible to other threads

Initiating Speculation Detect candidate pairs Filter indexed by program counter Add confidence metric for each pair Predict, if lock held If yes, don’t speculate

Buffering Speculative State Register state Reorder Buffer Already used for branch prediction Register Checkpoint Memory state Augment write buffers Keep speculative writes in write buffer, until validated Allows merging of speculative writes

Misspeculation conditions Atomicity violation Detected by coherence protocol Sufficient for ROB register state For register checkpoint add speculative access bit to L1 cache Resource constraints Cache Size Write Buffer Size

Committing Speculative Memory State Commits must appear instantaneously Cache state and Cache data Can speculate on state Can’t speculate on data Speculative store Send GETX request Block already exclusive when speculative writes commit

Evaluation 3 Systems Run on SimpleMP Benchmarks CMP, SMP and DSM Total Store Order Single Register Checkpoint 32 entry lock predictor Run on SimpleMP Based on Simplescalar Benchmarks SPLASH mp3d, barnes, cholesky SPLASH-2 Radiosity, water-nsq, ocean Microbenchmark

Microbenchmark Results

Lock acquires/releases elided

SPLASH Performance

Reasons for Performance Gain Concurrent critical section execution Reduced observed memory latencies Locks can remain shared Reduced memory traffic No transfer of locks over the bus

Conclusions Remove unnecessary serialization Locks do not need to be acquired, but only observed Control dependence converted to data dependence No Programmer-based static analysis But Hardware-based dynamic analysis No coherence protocol changes Independent of consistency model Accesses are atomic in any model Permit programmers to used conservative synchronization

Questions Are programmer really off the hook? Will this work for OLTP? Are all critical sections short enough for SLE? What about the legacy library example? Will this work for OLTP? Do all MP papers come from Wisconsin?