Decomposing Hardware Lock Elision Stephan Diestelhorst, TU Dresden Christof Fetzer 26.04.2019
My View of Microprocessor and OS Complexity OS Compatibility Arch Uarch Transactional Memory } > 0 (!) Short intro to the problem: processors are complex I hve been there, I cannot propose crazily complicated features OS no changes -> carefull arch adaption uarch small changes -> carefull uarch extensions Question: does that leave room for innovation? => YES! With all the cutting off, is there still interesting stuff left over? Hardware Verification Cost AMD –extremetech.com, intel – softpedia.com 26.04.2019
Hardware Lock Elision Primer while ( !CAS(lock, FREE, TAKEN) ) Critical Section while ( !CAS(lock, FREE, TAKEN) ) while ( !CAS(lock, FREE, TAKEN) ) introduce the HLE mechanisms quickly recently proposed by Intel lock := FREE Critical Section 26.04.2019
Hardware Lock Elision Primer while ( ACQUIRE !CAS(lock, FREE, TAKEN) ) Critical Section while ( ACQUIRE !CAS(lock, FREE, TAKEN) ) while ( ACQUIRE !CAS(lock, FREE, TAKEN) ) introduce the HLE mechanisms quickly recently proposed by Intel Critical Section RELEASE lock := FREE Critical Section RELEASE lock := FREE 26.04.2019
Hardware Lock Elision Primer Acquire Magic Transaction Acquire Magic introduce the HLE mechanisms quickly recently proposed by Intel Transaction Release Magic Release Magic 26.04.2019
Transactions Lock Elision Abort Handling Comparison TX.start() ACQUIRE Transaction Transaction * lock elision does more than TM: special handling of the lock variable -> more HW * lock elision is less flexible than TM: no visible aborts -> glibc / Linux kernel work currently uses TM * result: people wanting flexibility have to work around the missing ACCESS to the advanced LE feautures => I propose to expose the additional features one by one and make them available to programmers current efforts in Linux Kernel and Glibc to elide locks, all using the transactional mode to work around the limitations of HLE Abort Handler Flexible contention management? 26.04.2019
Transactions Lock Elision Lock Variable Secret Sauce TX.start() ACQUIRE TX.start() ACQUIRE lock := TAKEN assert(lock == TAKEN); lock := TAKEN assert(lock == TAKEN); * lock elision does more than TM: special handling of the lock variable -> more HW * lock elision is less flexible than TM: no visible aborts -> glibc / Linux kernel work currently uses TM * result: people wanting flexibility have to work around the missing ACCESS to the advanced LE feautures => I propose to expose the additional features one by one and make them available to programmers current efforts in Linux Kernel and Glibc to elide locks, all using the transactional mode to work around the limitations of HLE TX.start() Special treatment of the lock variable Prediction assert(lock == TAKEN); 26.04.2019
Complications of Using Transactions for Lock Elision Memcached: short transactions, low overhead SW prediction [Transact 2010] Hotspot JVM: advanced, multi-mode locks, assert(lock == TAKEN), TAKEN1 vs TAKEN2 memcached work: transparently replace pthread mutex lock / unlock through LD_PRELOAD predcitor has significant impact on performance, semi-correctable prediction Java lock elision (unpublished, yet) roling their own advanced locks transactions cannot acquire the lock for writing many codepaths check the lock whether it is held by the current thread if not, some try to reacquire, others with assert(lock == locked); Multi-modal locks, where to put, update etc. the prediction stats? 26.04.2019
Combining HLE Features and TM Flexibility Combine low-overhead HW fast-path & flexible SW handler No extra HW cost Mechanisms: Prediction, Lazy Writes, Silent Store Chains not a HW cost: all the features are likely there for HLE already does not disrupt the incremental upgrade path: each of these has a trivial fall-back split out HLE‘s features and make them available to SW Transactions separately Mechanisms: Prediction, Lazy Writes, Silent Store Chains 26.04.2019
Software-visible Generic Hardware Prediction Branch Predictor branch_on_pred <target>, <id> pred_good <id> pred_bad <id> CPU * advantages: no additional memory traffic, can correlate with other branch events, no additional instructions, early in the pipeline -> no delay Gaetan Lee - http://flickr.com/photos/43078695@N00/1931470865 26.04.2019
Lazy Conflict Detection with Software Control LAZY foo := 1 Transaction Transaction Transaction Transaction i := foo i := foo foo := 1 foo := 1 26.04.2019
Globally Invisible Store Chains CYCLE foo := 1 Transaction Transaction foo := 3 Transaction if (foo !=3) TX.abort foo := 1 CYCLE foo := 2 in a transaction, only the last store to a specific region will become visible if the final store turns value back to the one it had before the transaction, the global effects of these stores can be discarded CYCLE foo := 3 foo := 2 foo := 3 26.04.2019
Putting It All Together ACQUIRE RELEASE branch_on_pred <sw_pred>, 17 CYCLE lock := FREE TX.start <abort_hnd> TX.commit show how the just introduced primitives can be used to implement lock elision with flexible prediction logic if (lock != FREE) jmp <abort_hnd> pred_good <id> LAZY CYCLE lock := TAKEN 26.04.2019
Summary HLE adds interesting HW capabilities. We propose to make these available to general (transactional) programming. 26.04.2019
My Questions Are decomposed lock elision primitives useful? What are additional workloads? Can we increase usability by small tweaks? Upcoming Things what of that is useful to the (S)TM library, compiler and SW developers? can these features be effectively exposed to SW? what are other (except emulating lock elision) use cases for this? I can think of crazy use cases for the prediction feature already are there architectural tweaks that would make their SW adoption easier? Sneak Peek: Resurrecting Aborted Transactions without changing the OS-visible state or the microarchitecture Transactional ressurection and Alert-On-Update (without OS-changes, tiny HW adaption) stephan.diestelhorst@gmail.com 26.04.2019