Software Transactional Memory Should Not Be Obstruction Free Robert Ennals Intel Research Cambridge 15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK

Slides:

Advertisements

Similar presentations

Mutual Exclusion – SW & HW By Oded Regev. Outline: Short review on the Bakery algorithm Short review on the Bakery algorithm Black & White Algorithm Black.

Advertisements

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.

1 Concurrency Control Chapter Conflict Serializable Schedules  Two actions are in conflict if  they operate on the same DB item,  they belong.

Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.

Mutual Exclusion.

CH7 discussion-review Mahmoud Alhabbash. Q1 What is a Race Condition? How could we prevent that? – Race condition is the situation where several processes.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Lock-Based Concurrency Control

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Quick Review of Apr 29 material

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

Distributed Systems 2006 Styles of Client/Server Computing.

1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:

Distributed Systems Fall 2010 Transactions and concurrency control.

Review: Process Management Objective: –Enable fair multi-user, multiprocess computing on limited physical resources –Security and efficiency Process: running.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

OS Spring 2004 Concurrency: Principles of Deadlock Operating Systems Spring 2004.

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.

OS Fall’02 Concurrency: Principles of Deadlock Operating Systems Fall 2002.

CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

Introduction to Embedded Systems

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.

Software Transactional Memory for Dynamic-Sized Data Structures Maurice Herlihy, Victor Luchangco, Mark Moir, William Scherer Presented by: Gokul Soundararajan.

CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.

CS 153 Design of Operating Systems Spring 2015 Lecture 11: Scheduling & Deadlock.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Deadlock Detection and Recovery

Transactions and Concurrency Control. Concurrent Accesses to an Object Multiple threads Atomic operations Thread communication Fairness.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.

CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Operating Systems Lecture 4 Deadlock Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.

MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.

Switch off your Mobiles Phones or Change Profile to Silent Mode.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

CSCI1600: Embedded and Real Time Software Lecture 17: Concurrent Programming Steven Reiss, Fall 2015.

Lecture 5 Page 1 CS 111 Summer 2013 Bounded Buffers A higher level abstraction than shared domains or simple messages But not quite as high level as RPC.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Process Management Deadlocks.

Lecture 20: Consistency Models, TM

Concurrency: Deadlock and Starvation

Atomic Operations in Hardware

Atomic Operations in Hardware

ITEC 202 Operating Systems

Faster Data Structures in Transactional Memory using Three Paths

Multiprocessor Cache Coherency

Chapter 12: Concurrency, Deadlock and Starvation

Lecture 22: Consistency Models, TM

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Hybrid Transactional Memory

Concurrency: Mutual Exclusion and Process Synchronization

Software Transactional Memory Should Not be Obstruction-Free

CS333 Intro to Operating Systems

Programming with Shared Memory Specifying parallelism

Lecture 23: Transactional Memory

CSE 542: Operating Systems

CSE 542: Operating Systems

Presentation transcript:

Software Transactional Memory Should Not Be Obstruction Free Robert Ennals Intel Research Cambridge 15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK presented by Ted Cooper for CS510 – Concurrent Systems (Spring 2014) Portland State University

Grand Context (courtesy of Professor Walpole) Locking is slow and hard to get right. Clearly, non-blocking algorithms must be the answer! But non-blocking algorithms (harder to get right) might starve out threads. Thus, they should be wait-free. Wait-free algorithms must use “helping” to ensure all threads make progress, so they perform poorly, and are no simpler to reason about. Transactions look like lock-based and sequential programs, so maybe they're easier to reason about. Can we make them fast? But hardware transactional memory implementations have limits on transaction size and other problems, must coexist with locks in real systems, and don't seem to be faster than locks in practice. Can we at least get an STM that handles transactions of arbitrary size and length and performs reasonably? What properties do we really need in an STM? Does it need to be some flavor of non-blocking?

STM Context STM performance not stellar compared to conventional locks. Processor speed growing faster than memory bandwidth. Can we reduce memory accesses to improve STM performance? Do existing STM implementations maximize processor use? If not, can we improve processor use to improve performance? “Obstruction-freedom” has been borrowed by STM researchers from distributed systems (which have independent failure domains, so it's important that one node be able to continue progressing if another fails). Is this a useful property for STM? How does it affect performance?

Terminology Thread: Programmer-level idea, single parallelizable control flow. Think green threads, user-level threads. Transactions run on threads. Task: OS-level idea, one runs per available core. Runtime multiplexes threads onto tasks. Think OS threads. Non-blocking: At any given time, there is some thread whose progress is not blocked (e.g. by mutual exclusion). Obstruction-free: A property non-blocking algorithms can have. If all other threads are suspended (i.e. no contention), a thread can complete its operation in a finite number of its own steps. This may require retrying. Does not guarantee progress in the presence of conflicting operations, e.g. livelock is possible Obstruction-free is the weakest additional “natural” property a non-blocking algorithm can have.

Livelock? Threads are doing work, but one's work prevents the another from progressing. Just like deadlock, you can have 2-participant, 3-participant, n-participant livelock. “A real-world example of livelock occurs when two people meet in a narrow corridor, and each tries to be polite by moving aside to let the other pass, but they end up swaying from side to side without making any progress because they both repeatedly move the same way at the same time.” In this example, each person's “sway deterministically until there is no obstacle” algorithm is obstruction-free since it can proceed if the other person holds still, but not guaranteed to make progress while the other person does the same thing.

Non-blocking algorithms wait-free lock-free obstruction-free Wait-free: Under contention, every thread makes progress, i.e. no starvation Lock-free: Under contention, some thread makes progress. If multiple threads try to operate on the same data, someone will win. A given thread may never win, so could be starved, but the system as a whole will make progress, so no livelock. Obstruction-free: In isolation (all contenders suspended), a given thread makes progress. Under contention, this progress may not be useful, i.e. 2 threads could forever interfere and retry, livelocking.

Do we need obstruction-free STM? STM common case: parallelizing existing sequential programs Sequential programmers are used to blocking semantics, e.g. system calls(?) If we map tasks to cores 1-1, and run in-flight transactions to completion before scheduling new ones, it's unlikely that any thread will be suspended mid-transaction, and only suspended transactions can block other transactions.

There is no one thread use case to rule them all Threading for convenience: Multiple threads to track computations that proceed independently, e.g. compute and GUI threads. Blocking locks are fine here, may need priority levels for locks to ensure low-priority threads don't block high-priority threads. Threading for performance: Actual concurrent computation is possible. Blocking fine in sequential code, so also fine in transactions (draw picture) To STMify lock-based code, we can map lock-protected critical sections to transactions. This is no worse, since locks don't allow any concurrency in critical sections.

Why obstruction-freedom isn't as useful as it might seem

Obstruction-free misconception 1 Misconception: Obstruction-freedom prevents a long-running transaction from blocking others Counterexample: A transaction t reads an object x, computes for a year, writes to x. t completes only if any other transaction that needs x blocks until t finishes. So, either t blocks contending transactions or t never completes. Question: Is it a problem for a transaction to block others of the same or lower priority?

Obstruction-free misconception 2 Misconception: Obstruction-freedom prevents the system from locking up if a thread t is switched out mid-transaction. Argument 1: The OS will always switch the task running t back in eventually (provided all tasks have the same OS scheduling priority), so you don't need obstruction-freedom to make progress as long as temporary interruptions are okay. Argument 2: STM runtime can match the number of tasks to the number of available cores (dynamically). In this situation tasks (and the threads they run) will be switched out by the OS rarely, if ever. Argument 3: STM runtime can only start a new transaction on a given task when that tasks' last transaction completes, i.e. the runtime never preempts an in-flight transaction. That is, we allow in-flight transactions to obstruct new ones :)

Obstruction-free misconception 3 Misconception: Obstruction-freedom prevents the system from locking up if a thread t fails. i.e. the system should continue to make progress as a whole if transactions fail silently. Argument 1: If it's a software failure, an equivalent lock-based or sequential program would also fail. Argument 2: If it's a hardware failure, then a) node failures in distributed systems are common, while independent core failures in shared memory multiprocessors that don't bork the whole system are exceedingly rare, and b) again, a hardware failure would also break a lock-based or sequential program.

What does abandoning obstruction-freedom buy us?

Improved cache locality If object metadata lives in the same cache line as object data, only one memory access to load a shared object. If program is memory bandwidth-limited, performance is directly proportional to number of memory accesses. Any metadata we can't fit in the object data cache line should live in memory that is private to a given transaction, so transactions don't fight over it and so it stays in one cache.

Improved cache locality cont'd What does this have to do with obstruction-freedom? No obstruction-free STM can store object metadata and data in the same cache line. They all require object data to be behind a level of indirection to prevent the following situation:  Transaction t is writing to object x and is switched out.  Transaction s runs, needs x. What can s do? s could wait for t to finish with x, but that isn't obstruction-free. s could access x, but if t wakes up again it might overwrite x, invalidating s' transaction and leaving s in an undefined state. s could abort t, but we can't guarantee abort has succeeded without an acknowlegement from t, and that isn't obstruction-free. Even if s could abort t, then t could restart and abort s, resulting in livelock. My question: Could we avoid livelock with a total ordering of abort precedence, i.e. s can abort t but t can't abort s? This is the same reason we need pointers and copies in relativistic programming.

Optimal number of in-flight transactions Consider N in-flight transactions on N cores. A new transaction t tries to start before any of the N complete. While t exists but has not yet been scheduled to run, it can make no progress in isolation, and so is not obstruction-free. So as soon as t exists, we have to switch out an in-flight transaction and share N cores among N+1 transactions. This introduces context-switching overhead, which was previously avoided, and which wastes cycles. This also increases the number of concurrently running transactions, increasing the probability of conflicts among transactions. Why not just let each transaction complete without context-switching it out, and once it completes run the new transaction in its task? Then we'd always have N transactions running on N cores.

What does a non-obstruction-free STM that employs these optimizations look like, and how does it perform against existing obstruction-free STMs?

The Lightweight Transaction Library Ennals et al wrote a non-obstruction-free STM library to test these ideas. In summary, it handily beats Fraser's STM and Fraser's C implementation of DSTM, both of which are obstruction-free. It is available at:

Memory Layout ltx designates a public memory region all transactions can access, where shared objects (and only shared objects) live. It also allocates a private memory region to each transaction for the transaction state, which other transactions (usually) do not access. Each private region is allocated contiguosly starting at an aligned address once and reused by subsequent transactions that run on the same core, so it stays in that core's cache. This means that cache misses on private memory are rare.

What lives in private memory? At the very beginning (i.e. the aligned base address), a descriptor for the transaction itself from which its priority can be determined. Read and write descriptors, one for each shared object x the current transaction t has accessed. Read descriptors contain:  x's version number as of the last time t read it. This is used to check whether t needs to restart because the data it read changed before t could commit.  A pointer to x, so t can read the data, check x's version, and check whether x has been locked for writing by another transaction.

What lives in private memory? cont'd Write descriptors contain:  The object's version number as of the last time t read it. This is used to compute a new version number on a successful commit, or to roll x back the its previous version on abort.  A pointer to x, so t knows where to write on commit or abort.  A copy of x's object data. This is where t stages changes to x before committing. Note that unlike in RP, where changes are made visible by replacing a public pointer to the old version with a public pointer to the new version, ltx copies this staged object data back to the public object data during commit, enforcing the public/private division. This is unavoidable, since object metadata and data are stored adjacently in the public region in a fixed location (to avoid the extra memory accesses imposed by indirection).

Object handles Each public object has a handle (metadata) stored adjacent to the object data. The last bit of the handle signals whether a transaction is currently writing to the object x:  If 1, no transaction is currently writing, and the rest of the handle represents x's current version number.  If 0, a transaction t is currently writing to x, and the rest of the handle is a pointer to t's write descriptor (more on this later) for x. Some fixed number of higher order bits in this pointer can also be used to t's transaction descriptor, since private regions are allocated in aligned contiguous blocks.

Is this figure correct?

? How could “Verision Seen” be a pointer? 0 or more

Writes Managed using revocable two-phase locking:  A transaction locks every object to which it needs to write, but keeps enough information around to release the lock and restore the object to its previous state on abort.  If two transactions deadlock on write sets, one aborts. My question: How does deadlock detection work in this case? Does a transaction s who needs an object x locked by t use x's handle to find t's write descriptors and some record of the set of objects t intends to ultimately lock, compare that to its own write descriptors and pending locks, look for a cycle, and abort if it finds one?

Writes cont'd How does t lock x for writing?  t reads x's handle. If it ends in a 1 then the rest is x's version number, and t stores that and a pointer to x in a write descriptor d, then uses a compare and swap or other atomic operation to replace x's handle with a pointer to d with a trailing 0. If the atomic operation succeeds, t has locked x. Otherwise some other transaction has concurrently updated x, and t must retry. If t successfully locks x, it makes a copy of x's object data in the write descriptor.

Writes cont'd What if x is already locked by another transaction s?  t (busy?) waits for a bounded number of cycles for x to become available. If this time expires and x is still locked, t gets s' transaction descriptor (available via the pointer in the locked handle) and checks whether s is of the same or lower priority, then requests that s abort itself.

Reads Managed using optimistic concurrency control:  t reads x's handle. If x is not locked, it logs the version number from the handle in a read descriptor for x, along with a pointer to x. If x is locked, t waits in the same fashion as for writing.  When t attempts to commit, it compares its logged copy of x's version number to the current value in x's handle, and the commit fails if they differ.

Commits When t is ready to commit, it first checks whether it is still valid:  If no other transaction has written to an object in t's read set (i.e. the version numbers in the write descriptors still match the handles), t is valid. If t is valid, it can commit. t must have locked all the objects in its write set, so we don't need to check those for to determine validity. For each write descriptor d for an object x, t simply copies the updated object data in d (private memory) to the corresponding object data in public memory, then overwrites the lock in the x's handle with an incremented version number for x, releasing the lock and publishing the new version of x in one fell swoop.

Commits cont'd What if t isn't valid?  t may have read inconsistent data and gone into a weird state, e.g. an infinite loop or a segfault from reading an out-of-date or corrupted array index.  Because we can't predict the behavior caused by inconsistent data, t may not retry properly, so the runtime has to periodically abort outstanding invalid transactions.

Performance Evaluation Benchmarks on Fraser's testbed to ensure that comparison to Fraser's STM and C DSTM is fair. SunFire 15K server  106 UltraSparc III 1.2GHz Benchmarks  Red-black tree and skip-list, both read and write random set elements, 75% reads, 25% writes.

Lower on y axis (CPU time per operation in microseconds) is better. Key space varied to compare performance under contention ltx takes 50-60% time of Fraser, 35% time of C DSTM Probably wins because of cache locality optimization (fewer total memory accesses): ltx incurs 48% L2 misses, 58% L1 misses, and 22% TLB misses compared to Fraser

Lower on y axis (CPU time per operation in microseconds) is better. Key space varied from 16 to 219 to compare performance under contention, number of processors used fixed at 90. Under high contention (left region of each graph) ltx takes ~20% time of Fraser, C DSTM barely runs. Fraser's transactions help blockers, performs poorly for the same reason wait-free algorithms do.

Lower on y axis (CPU time per operation in microseconds) is better. Run on 4-way SPARC machine and number of tasks varied to measure effect of OS context-switching. Unsurprisingly, as rate of context-switching increases performance degrades. ltx more affected by context-switching than Fraser since switched-out transactions can block others in ltx, but ltx is still faster. Under normal ltx deployment, number of tasks always upper-bounded by available cores, so context-switching rarely occurs.

Conclusions Obstruction-freedom is not necessary for STM. 2 non-obstruction-free STM optimizations that maximize cache locality and minimize context-switching are demonstrated in an implementation that outperforms existing best-in-class obstruction-free STM implementations. Therefore, Ennals et al belive that STM designers should abandon obstruction-freedom. But wait, ltx writers use locks. Weren't we trying to get away from locks?