Decomposing Hardware Lock Elision

Slides:



Advertisements
Similar presentations
Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
Advertisements

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Transactional Memory Supporting Large Transactions Anvesh Komuravelli Abe Othman Kanat Tangwongsan Hardware-based.
Chapter 6: Process Synchronization
Outline CPU caches Cache coherence Placement of data Hardware synchronization instructions Correctness: Memory model & compiler Performance: Programming.
Pessimistic Software Lock-Elision Nir Shavit (Joint work with Yehuda Afek Alexander Matveev)
Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
IBM Software Group © 2004 IBM Corporation Compilation Technology Java Synchronization : Not as bad as it used to be! Mark Stoodley J9 JIT Compiler Team.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
1 Presenter: Chien-Chih Chen. 2 An Assertion Library for On- Chip White-Box Verification at Run-Time On-Chip Verification of NoCs Using Assertion Processors.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.
A Comparison of Software and Hardware Techniques for x86 Virtualization Keith Adams Ole Agesen Oct. 23, 2006.
A Portable Virtual Machine for Program Debugging and Directing Camil Demetrescu University of Rome “La Sapienza” Irene Finocchi University of Rome “Tor.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
The Linux Kernel: A Challenging Workload for Transactional Memory Hany E. Ramadan Christopher J. Rossbach Emmett Witchel Operating Systems & Architecture.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Kernel Locking Techniques by Robert Love presented by Scott Price.
Processor Architecture
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
© 2008 Multifacet ProjectUniversity of Wisconsin-Madison Pathological Interaction of Locks with Transactional Memory Haris Volos, Neelam Goyal, Michael.
Hardware and Software transactional memory and usages in MRE
Kernel Synchronization in Linux Uni-processor and Multi-processor Environment By Kathryn Bean and Wafa’ Jaffal (Group A3)
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
Multiprocessors – Locks
Kernel Synchronization David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis St. Louis, MO
Outline CPU caches Cache coherence Placement of data
Irina Calciu Justin Gottschlich Tatiana Shpeisman Gilles Pokam
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
15-740/ Computer Architecture Lecture 3: Performance
Framework For Exploring Interconnect Level Cache Coherency
Memory COMPUTER ARCHITECTURE
Computer Organization CS224
Speculative Lock Elision
Distributed Shared Memory
The Mach System Sri Ramkrishna.
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Part 2: Software-Based Approaches
PHyTM: Persistent Hybrid Transactional Memory
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture 11: Consistency Models
Faster Data Structures in Transactional Memory using Three Paths
Hyperthreading Technology
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
CSC 4250 Computer Architectures
Lecture: Cache Innovations, Virtual Memory
Henk Corporaal TUEindhoven 2011
Lecture: Pipelining Extensions
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
CSC3050 – Computer Architecture
Lecture 4: Advanced Pipelines
CSE 153 Design of Operating Systems Winter 19
Wackiness Algorithm A: Algorithm B:
CMSC 611: Advanced Computer Architecture
CSC Multiprocessor Programming, Spring, 2011
Nikola Grcevski Testarossa JIT Compiler IBM Toronto Lab
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

Decomposing Hardware Lock Elision Stephan Diestelhorst, TU Dresden Christof Fetzer 26.04.2019

My View of Microprocessor and OS Complexity OS Compatibility Arch Uarch Transactional Memory } > 0 (!) Short intro to the problem: processors are complex I hve been there, I cannot propose crazily complicated features OS no changes -> carefull arch adaption uarch small changes -> carefull uarch extensions Question: does that leave room for innovation? => YES! With all the cutting off, is there still interesting stuff left over? Hardware Verification Cost AMD –extremetech.com, intel – softpedia.com 26.04.2019

Hardware Lock Elision Primer while ( !CAS(lock, FREE, TAKEN) ) Critical Section while ( !CAS(lock, FREE, TAKEN) ) while ( !CAS(lock, FREE, TAKEN) ) introduce the HLE mechanisms quickly recently proposed by Intel lock := FREE Critical Section 26.04.2019

Hardware Lock Elision Primer while ( ACQUIRE !CAS(lock, FREE, TAKEN) ) Critical Section while ( ACQUIRE !CAS(lock, FREE, TAKEN) ) while ( ACQUIRE !CAS(lock, FREE, TAKEN) ) introduce the HLE mechanisms quickly recently proposed by Intel Critical Section RELEASE lock := FREE Critical Section RELEASE lock := FREE 26.04.2019

Hardware Lock Elision Primer Acquire Magic Transaction Acquire Magic introduce the HLE mechanisms quickly recently proposed by Intel Transaction Release Magic Release Magic 26.04.2019

Transactions Lock Elision Abort Handling Comparison TX.start() ACQUIRE Transaction Transaction * lock elision does more than TM: special handling of the lock variable -> more HW * lock elision is less flexible than TM: no visible aborts -> glibc / Linux kernel work currently uses TM * result: people wanting flexibility have to work around the missing ACCESS to the advanced LE feautures => I propose to expose the additional features one by one and make them available to programmers current efforts in Linux Kernel and Glibc to elide locks, all using the transactional mode to work around the limitations of HLE Abort Handler Flexible contention management? 26.04.2019

Transactions Lock Elision Lock Variable Secret Sauce TX.start() ACQUIRE TX.start() ACQUIRE lock := TAKEN assert(lock == TAKEN); lock := TAKEN assert(lock == TAKEN); * lock elision does more than TM: special handling of the lock variable -> more HW * lock elision is less flexible than TM: no visible aborts -> glibc / Linux kernel work currently uses TM * result: people wanting flexibility have to work around the missing ACCESS to the advanced LE feautures => I propose to expose the additional features one by one and make them available to programmers current efforts in Linux Kernel and Glibc to elide locks, all using the transactional mode to work around the limitations of HLE TX.start() Special treatment of the lock variable Prediction assert(lock == TAKEN); 26.04.2019

Complications of Using Transactions for Lock Elision Memcached: short transactions, low overhead SW prediction [Transact 2010] Hotspot JVM: advanced, multi-mode locks, assert(lock == TAKEN), TAKEN1 vs TAKEN2 memcached work: transparently replace pthread mutex lock / unlock through LD_PRELOAD predcitor has significant impact on performance, semi-correctable prediction Java lock elision (unpublished, yet) roling their own advanced locks transactions cannot acquire the lock for writing many codepaths check the lock whether it is held by the current thread if not, some try to reacquire, others with assert(lock == locked); Multi-modal locks, where to put, update etc. the prediction stats? 26.04.2019

Combining HLE Features and TM Flexibility Combine low-overhead HW fast-path & flexible SW handler No extra HW cost Mechanisms: Prediction, Lazy Writes, Silent Store Chains not a HW cost: all the features are likely there for HLE already does not disrupt the incremental upgrade path: each of these has a trivial fall-back split out HLE‘s features and make them available to SW Transactions separately Mechanisms: Prediction, Lazy Writes, Silent Store Chains 26.04.2019

Software-visible Generic Hardware Prediction Branch Predictor branch_on_pred <target>, <id> pred_good <id> pred_bad <id> CPU * advantages: no additional memory traffic, can correlate with other branch events, no additional instructions, early in the pipeline -> no delay Gaetan Lee - http://flickr.com/photos/43078695@N00/1931470865 26.04.2019

Lazy Conflict Detection with Software Control LAZY foo := 1 Transaction Transaction Transaction Transaction i := foo i := foo foo := 1 foo := 1 26.04.2019

Globally Invisible Store Chains CYCLE foo := 1 Transaction Transaction foo := 3 Transaction if (foo !=3) TX.abort foo := 1 CYCLE foo := 2 in a transaction, only the last store to a specific region will become visible if the final store turns value back to the one it had before the transaction, the global effects of these stores can be discarded CYCLE foo := 3 foo := 2 foo := 3 26.04.2019

Putting It All Together ACQUIRE RELEASE branch_on_pred <sw_pred>, 17 CYCLE lock := FREE TX.start <abort_hnd> TX.commit show how the just introduced primitives can be used to implement lock elision with flexible prediction logic if (lock != FREE) jmp <abort_hnd> pred_good <id> LAZY CYCLE lock := TAKEN 26.04.2019

Summary HLE adds interesting HW capabilities. We propose to make these available to general (transactional) programming. 26.04.2019

My Questions Are decomposed lock elision primitives useful? What are additional workloads? Can we increase usability by small tweaks? Upcoming Things what of that is useful to the (S)TM library, compiler and SW developers? can these features be effectively exposed to SW? what are other (except emulating lock elision) use cases for this? I can think of crazy use cases for the prediction feature already are there architectural tweaks that would make their SW adoption easier? Sneak Peek: Resurrecting Aborted Transactions without changing the OS-visible state or the microarchitecture Transactional ressurection and Alert-On-Update (without OS-changes, tiny HW adaption) stephan.diestelhorst@gmail.com 26.04.2019