Concurrent Programming Without Locks Keir Fraser & Tim Harris.

Slides:



Advertisements
Similar presentations
Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+
Advertisements

Concurrent programming for dummies (and smart people too) Tim Harris & Keir Fraser.
Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Software Transactional Memory and Conditional Critical Regions Word-Based Systems.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Virendra J. Marathe, William N. Scherer III, and Michael L. Scott Department of Computer Science University of Rochester Presented by: Armand R. Burks.
Concurrent Programming Without Locks Keir Fraser & Tim Harris Adapted from an earlier presentation by Phil Howard.
COS 461 Fall 1997 Transaction Processing u normal systems lose their state when they crash u many applications need better behavior u today’s topic: how.
Ali Saoud Object Based Transactional Memory. Introduction Resent trends go towards object based SMT because it’s dynamic Word-based STM systems are more.
Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
Computer Laboratory Practical non-blocking data structures Tim Harris Computer Laboratory.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Exceptions and side-effects in atomic blocks Tim Harris.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
Software Transaction Memory for Dynamic-Sized Data Structures presented by: Mark Schall.
Distributed Commit. Example Consider a chain of stores and suppose a manager – wants to query all the stores, – find the inventory of toothbrushes at.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Transaction. A transaction is an event which occurs on the database. Generally a transaction reads a value from the database or writes a value to the.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
Software Transactional Memory for Dynamic-Sized Data Structures Maurice Herlihy, Victor Luchangco, Mark Moir, William Scherer Presented by: Gokul Soundararajan.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
Software Transactional Memory Yoav Cohen Seminar in Distributed Computing Spring 2007 Yoav Cohen Seminar in Distributed Computing Spring 2007.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.
Optimistic Design 1. Guarded Methods Do something based on the fact that one or more objects have particular states  Make a set of purchases assuming.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Lowering the Overhead of Software Transactional Memory Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David Eisenstat, William.
Transactional Memory Lecturer: Danny Hendler. 2 2 From the New York Times…
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
Practical concurrent algorithms Mihai Letia Concurrent Algorithms 2012 Distributed Programming Laboratory Slides by Aleksandar Dragojevic.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy Slides by Vincent Rayappa.
MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,
CS510 Concurrent Systems Tyler Fetters. A Methodology for Implementing Highly Concurrent Data Objects.
Concurrent Programming Without Locks Based on Fraser & Harris’ paper.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.
Lecture 5 Page 1 CS 111 Summer 2013 Bounded Buffers A higher level abstraction than shared domains or simple messages But not quite as high level as RPC.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
File System Consistency

Maurice Herlihy, Victor Luchangco, Mark Moir, William N. Scherer III
Transactions and Reliability
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Part 2: Software-Based Approaches
Atomic Operations in Hardware
Atomic Operations in Hardware
A Qualitative Survey of Modern Software Transactional Memory Systems
Hybrid Transactional Memory
Introduction of Week 13 Return assignment 11-1 and 3-1-5
Software Transactional Memory Should Not be Obstruction-Free
Concurrent Programming Without Locks
Locking Protocols & Software Transactional Memory
CS333 Intro to Operating Systems
Multicore programming
Lecture 23: Transactional Memory
Presentation transcript:

Concurrent Programming Without Locks Keir Fraser & Tim Harris

Motivation Locking introduces dependencies among threads Non-blocking solutions keep threads independent, but –they are complicated to program –or depend on unrealistic instructions (CAS2) Need a practical and general non-blocking solution

Solutions? Only use data structures that can be implemented with CAS? –Limiting Build MCAS in software using CAS Build Transactional Memory in software using CAS

Goals Concreteness Linearizability Non-blocking progress guarantee Disjoint access parallelism Read parallelism Dynamicity Practicable space costs Composability

Definitions Obstruction freedom – a thread will make progress as long as it doesn’t contend with other threads access to any location Lock-freedom – The system as a whole will make progress Wait-freedom – Every thread makes progress Focus is on Lock-free design Whole transactions are lock-free, not just the sub- components

The Basic Problem How can we conditionally update multiple locations atomically – using “real” instructions that can only update a single location atomically? The trick –Introduce a level of indirection –Use descriptors to access values indirectly

How do we use indirection? Memory locations involved in a transaction or MCAS have –old and new uncommitted values stored in a descriptor –a status field determines which value to use –we must be careful how status is updated! Memory locations not involved in a transaction can hold their value directly –requires tidying up after transactions commit

New ValueOld ValueAddress Status Memory Descriptor Indirect Memory Access

Direct or Indirect? How do we know if the value in a location should be used directly or indirectly? –we can reserve some low order bits –interpret them on each access –but this limits the use of the approach to aligned pointers

Using Descriptors in TM Commit operation atomically updates status field –we have to do it with CAS to avoid races Once a descriptor is made visible, only the status field changes –Why? –How? Once a transaction’s outcome is decided, the status value doesn’t change –Retries use a new descriptor … why? Descriptors are managed via garbage collection

Other requirements Descriptors must be able to own locations –one transaction must not unlink another –why? –so what should be done on a conflict, wait? But doesn’t this introduce blocking? –not necessarily – contending threads could help the owner complete

Uncontended Commits To be obstruction free, uncontended commits must succeed The phases: –Prepare the transaction descriptor (use CCAS for each location accessed) to atomically link locations while outcome is undecided –Decide the transaction’s outcome and update the status field (using CAS) –Update memory (using CAS) and mark the descriptor for collection

Contended Commits Contended Commits must make progress –If status is decided, but not complete Help the other thread complete –If status is undecided, either Abort contending transactions –needs contention management to prevent live-lock Help contending transactions –need some way to ensure success of at least one transaction –Read-check, used in WSTM or OSTM to ensure read set is still current: Abort at least one contender Help, and ensure progress by ordering transactions

Three STM Implementations MCAS Multiple Compare And Swap WSTM Word Software Transactional Memory OSTMObject Software Transactional Memory

MCAS CAS(word *address, // actual value word expected_value, word new_value); MCAS(int count, word *address[], // actual values word expected_value[], word new_value[]); … but an extra indirection is added because pointers must indirect through the descriptor

MCAS Operates only on aligned pointers –enables use of 2 low order bits to distinguish values from descriptors Descriptors contain –status {Success, Failure, Undecided} –N –address[ ] –expected[ ] –new_value[ ]

Data Access Examples New ValueOld ValueAddress Status: SUCCESS descriptor value descriptor New Value Old ValueAddress Status: UNDECIDED

The Prepare Phase Create MCAS descriptor Insert descriptor address in each location –don’t overwrite other concurrent attempts –don’t keep working if another thread has already helped you succeed or fail use CAS conditional on undecided status (CCAS) –MCAS descriptor must not become visible until its fully initialized link CCAS descriptors in each location first then swap for MCAS descriptor using CCAS

CCAS Conditional CAS built from CAS - takes effect only if condition == undecided - used to insert descriptor references in two phases CCAS(word *address, word expected_value, word new_value, word *condition); return original value of *address

CCAS Create a new private CCAS descriptor Copy CCAS parameter values into it Try to link it into the target location (using CAS) On failure try to help whoever succeeded by using their CCAS descriptor –again using CAS –then retry your own

word *CCAS(word **a, word *e, word *n, word *cond) { ccas_descriptor *d = new ccas_descriptor(); word *v; (d->a, d->e, d->n, d->cond) = (a,e,n,cond); while ( (v = CAS(d->a, d->e, d)) != d->e ) { if ( IsCCASDesc(v) ) CCASHelp( (ccas_descriptor *)v); else return v; } CCASHelp(d); return v; } void CCASHelp(ccas_descriptor *d) { bool success = (*d->cond == UNDECIDED); CAS(d->a, d, success ? d->n : d->e); }

Cost in terms of CAS CCAS takes at least 2 CAS to link the MCAS descriptor into each location –2N CAS for N locatons But we still have not committed the MCAS –at least 1 CAS required to set MCAS status –at least N CAS required to update the memory locations with the new values from the MCAS descriptor

Reading We can’t simply read values anymore! CCASRead must be used for reading It must be able to read values directly and indirectly through CCAS descriptors –detect which situation it is in –function correctly in the presence of concurrent updates

CCASRead Copy address to be read to local Test to see if it’s a value or a descriptor If it’s a descriptor help the thread whose descriptor it is complete –requires more CAS

word *CCASRead(word **a) { word *v = *a; while ( IsCCASDesc(v) ) { CCASHelp( (ccas_descriptor *)v); v = *a; } return v; } void CCASHelp(ccas_descriptor *d) { bool success = (*d->cond == UNDECIDED); CAS(d->a, d, success ? d->n : d->e); }

Reading We also need an MCASRead to read locations subject to MCAS MCASRead used CCASRead to read the contents of the location –if its an MCAS descriptor it must find the address in the descriptor and determine whether to use the old or new values –this requires more CCAS

Putting it all together Example MCAS (3, {a,b,c}, {1,2,3}, {4,5,6})

1 2 3 a b c

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5}) c 52b 41a 3 UNDECIDED a b c

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5}) c 52b 41a 3 UNDECIDED a b c CCAS Descr a 1 &MCAS_Descr &mcas->status

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5}) c 52b 41a 3 UNDECIDED a b c CCAS Descr a 1 &MCAS_Descr &mcas->status

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5}) c 52b 41a 3 UNDECIDED a b c CCAS Descr a 1 &MCAS_Descr &mcas->status

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5}) 63c 52b 41a 3 UNDECIDED a b c

MCAS(3, {a,c,b}, {1,3,2}, {4,6,5}) 63c 52b 41a 3 SUCCESS a b c

1 2 3 MCAS(3, {a,b,c}, {1,2,3}, {4,5,6}) c 52b 41a 3 SUCCESS a b c

1 2 3 MCAS(3, {a,b,c}, {1,2,3}, {4,5,6}) a b c

bool MCAS(int N, word **a[], word *e[], word *n[]) { mcas_descriptor *d = new mcas_descriptor(); d->N = N; d->status = UNDECIDED; for (int i=0; i<N; i++) { d->a[i] = a[i]; d->e[i] = e[i]; d->n[i] = n[i]; } address_sort(d); return mcas_help(d); }

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; // Phase 1: acquire for (int i=0; i N; i++) { while (TRUE){ v = CCAS(d->a[i], d->e[i], d, &d->status); if (v = d->e[i] || v == d) break; if (IsMCASDesc(v) ) mcas_help( (mcas_descriptor *)v ); else goto decision_point; } desired = SUCCESS; decision_point:

mcas_help continued // PHASE 2: read – not used by MCAS decision_point: CAS(&d->status, UNDECIDED, desired); // PHASE 3: clean up success = (d->status == SUCCESS); for (int i=0; i N; i++) { CAS(d->a[i], d, success ? d->n[i] : d->e[i]); } return success; }

Word *MCASRead(word **addr) { word *v; retry_read: v = CCASRead(addr); if ( !IsMCASDesc(v)) return v; for (int i=0; i N; i++) { if (v->addr[i] == addr) { if (v->status == SUCCESS) if (CCASRead(addr) == v) return v->new[i] else goto retry_read; else // FAILED or UNDECIDED if (CCASRead(addr) == v) return v->expected[i]; else goto retry_read; } return v; }

Conflicts New ValueOld ValueAddress Status: UNDECIDED New Value Old ValueAddress Status: UNDECIDED

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; // Phase 1: acquire for (int i=0; i N; i++) { while (TRUE){ v = CCAS(d->a[i], d->e[i], d, &d->status); if (v = d->e[i] || v == d) break; if (IsMCASDesc(v) ) mcas_help( (mcas_descriptor *)v ); else goto decision_point; } desired = SUCCESS; decision_point:

Conflicts New ValueOld ValueAddress Status: UNDECIDED New Value Old ValueAddress Status: UNDECIDED

Conflicts New ValueOld ValueAddress Status: UNDECIDED

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; // Phase 1: acquire for (int i=0; i N; i++) { while (TRUE){ v = CCAS(d->a[i], d->e[i], d, &d- >status); if (v = d->e[i] || v == d) break; if (!IsMCASDesc(v) ) goto decision_point; mcas_help( (mcas_descriptor *)v ); } desired = SUCCESS; decision_point:

Conflicts New ValueOld ValueAddress Status: UNDECIDED New Value Old ValueAddress Status: UNDECIDED

mcas_help continued // PHASE 2: read – not used by MCAS decision_point: CAS(&d->status, UNDECIDED, desired); // PHASE 3: clean up success = (d->status == SUCCESS); for (int i=0; i N; i++) { CAS(d->a[i], d, success ? d->n[i] : d->e[i]); } return success; }

Conflicts New ValueOld ValueAddress Status: SUCCESS New Value Old ValueAddress Status: UNDECIDED

mcas_help continued // PHASE 2: read – not used by MCAS decision_point: CAS(&d->status, UNDECIDED, desired); // PHASE 3: clean up success = (d->status == SUCCESS); for (int i=0; i N; i++) { CAS(d->a[i], d, success ? d->n[i] : d->e[i]); } return success; }

Conflicts New ValueOld ValueAddress Status: SUCCESS New Value Old ValueAddress Status: UNDECIDED

Failure Modes Can fail during any of the CAS attempts –CCAS –CCASHelp

CCAS “failure modes” Someone helped us with the CCAS –call CCASHelp with our own descriptor –next time around, return MCAS descriptor –MCAS continues Someone else beat us to CCAS –help them with their CCAS –next time around, return their MCAS descriptor –Help with their MCAS –Our MCAS likely aborts Source value changed –return new value –MCAS aborts

word *CCAS(word **a, word *e, word *n, word *cond) { ccas_descriptor *d = new ccas_descriptor(); word *v; (d->a, d->e, d->n, d->cond) = (a,e,n,cond); while ( (v = CAS(d->a, d->e, d)) != d->e ) { if ( !IsCASDesc(v) ) return v; CCASHelp( (ccas_descriptor *)v); } CCASHelp(d); return v; } void CCASHelp(ccas_descriptor *d) { bool success = (*d->cond == UNDECIDED); CAS(d->a, d, success ? d->n : d->e); }

CCASHelp “failure modes” MCAS aborted so status isn’t UNDECIDED –old value put back in place MCAS aborted, CCASHelp doesn’t restore value –MCAS cleanup will put old value back in place Race: status switches to SUCCESS between check and CAS –CAS will fail because CCAS descriptor already removed –CCAS return will not cause MCAS failure Race: status switches to FAILURE between check and CAS –CAS will always fail because for MCAS to fail, someone must have read beyond us

Cost Minimum of 3N + 1 CAS instructions for N locations –many more CAS under heavy contention ! With no contention the three batches of N CAS all act on the same N locations “[improvements] may be useful if there are systems in which CAS operates substantially more slowly than an ordinary write.”

Deep Breath

WSTM Remove requirement for space reserved in values being updated –hash addr to find ownership record Caller need not keep track of locations –read and write sets stored in transaction descriptor Provides read parallelism Obstruction free, not lock free nor wait free

Data Structures version 52 Status: Undecided a1: (100,15) -> (200,16)‏ a2: (200,52) -> (100,53)‏ Orecs

Logical contents Orec contains a version number: –value comes direct from memory Orec contains a descriptor reference –descriptor contains address value comes from descriptor based on status –descriptor does not contain address value comes direct from memory

Transaction Process Call WSTMRead/WSTMWrite to gather/change data –Builds transaction data structure, but it’s NOT visible WSTMCommitTransaction –Get ownership – update ORecs –Read-Check – check version numbers –Decide –Clean up

version 52 version 15 version 53 version 16 Data Structures Status: UNKNOWN a1: (100,15) -> (200,16) a2: (200,52) -> (200,52)‏a2: (200,52) -> (100,53) Status: SUCCESS

Complications Fixed number of Orecs Hash collisions lead to false sharing

Issues Orec ownership acts like a lock, so simple scheme is not even obstruction free Can’t help with “cleanup” because might overwrite newer data Can’t determine value during READCHECK, so we’re forced to shoot down force_decision() might be circular causing live lock helping requires stealing of transactions Uncontended cost is N+2 CAS for N locations

OSTM Objects are represented as opaque handles –can’t use pointers directly –must rewrite data structures Get accessible pointers via OSTMOpenForReading/OSTMOpenForWr iting Eliminates need for Orecs/aliasing

Evaluation “We use … reference-counting garbage collection” Evaluated with one thread/CPU “Since we know the number of threads participating in our experiments…”

Uncontended Performance

Contended Locks

Data Contention

Data/Lock Contention

Spare Slides

word WSTMRead(wstm_transaction *tx, word *addr) { if (entry_exists) return entry->new_value; if (orec->type != descriptor)‏ create entry [current value, orec version] else { force_decision(descriptor); // can’t be ours: not in commit if (descriptor contains our address)‏ if (status == SUCCESS)‏ create entry [descr.new_val, descr.new_ver] else create entry [descr.old_val, descr.old_ver] else create entry [current value, descr.aliased.new_ver] } if (aliased) { if (entry->old_version != aliased->old_version)‏ status = FAILED; entry->old_version = aliased->old_version; entry->new_version = aliased->new_version; } return entry->new_value; }

void WSTMWrite(wstm_transaction *tx, word *addr, word new_value { get entry using WSTMRead logic entry->new_value = new_value; for each aliased entry { entry->new_version++; }

bool WSTMCommit(wstm_transaction *tx) { if (tx->status == FAILED) return false; sort descriptor entries desired_status = FAILED; for each update if (!acquire_orec) goto decision_point; CAS(status, UNDECIDED, READ_CHECK); for each read if (!read_check) goto decision_point; desired_status = SUCCESS; decision_point:

status = tx->status; while (status != FAILED && status != SUCCESS) { CAS(tx->status, status, desired_status); status = tx->status; } if (tx->status == SUCCESS)‏ for each update *addr = entry->new_value; for each update release_orec return (tx->status == SUCCESS); }

bool read_check(wstm_transaction *tx, wstm_entry *entry)‏ { if (orec is WSTM_descriptor) { force_decision()‏ if (SUCCESS)‏ version = new_version; else version = old_version } else { version = orec_version; } return (version == entry->old_version); }

Data Structures version 52 Status: Undecided a1: (100,15) -> (200,16)‏ a2: (200,52) -> (100,53)‏ a3: (300,15) -> (300,16)‏ Orecs a1 a2 a3

Caveats “It remains possible for a thread to see a mutually inconsistent view of shared memory if it performs a series of [read] calls.” In other words there is not complete isolation between transactions –a thread may crash due to concurrency prior to having its transaction abort and retry