Advanced Operating Systems (CS 202) Transactional memory Jan, 27, 2016 slide credit: slides adapted from several presentations, including stanford TCC.

Slides:

Advertisements

Similar presentations

Maurice Herlihy (DEC), J. Eliot & B. Moss (UMass)

Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

The University of Adelaide, School of Computer Science

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Concurrency: Mutual Exclusion and Synchronization Chapter 5.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Transactional Memory Part 1: Concepts and Hardware- Based Approaches 1Dennis Kafura – CS5204 – Operating Systems.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

1 Hardware Transactional Memory Royi Maimon Merav Havuv 27/5/2007.

Transactional Memory: Architectural Support for Lock- Free Data Structures Herlihy & Moss Presented by Robert T. Bauer.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Transactional Memory Overview Olatunji Ruwase Fall 2007 Oct

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Copyright © 2006, CS 612 Transactional Memory Architectural Support for a Lock-Free Data Structure Some material borrowed from : Konrad Lai, Microprocessor.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

Multiprocessor Cache Coherency

Transactional Memory CDA6159. Outline Introduction Paper 1: Architectural Support for Lock-Free Data Structures (Maurice Herlihy, ISCA ‘93) Paper 2: Transactional.

1 Hardware Transactional Memory (Herlihy, Moss, 1993) Some slides are taken from a presentation by Royi Maimon & Merav Havuv, prepared for a seminar given.

Transactional Memory Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech (Adapted from Stanford TCC group and MIT SuperTech.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Transactional Memory Lecturer: Danny Hendler. 2 2 From the New York Times…

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

Transactional Memory Student Presentation: Stuart Montgomery CS5204 – Operating Systems 1.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Lecture 20: Consistency Models, TM

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Transactional Memory : Hardware Proposals Overview

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Designing Parallel Algorithms (Synchronization)

Lecture 6: Transactions

Lecture 21: Synchronization and Consistency

Part 1: Concepts and Hardware- Based Approaches

Lecture 22: Consistency Models, TM

Lecture: Coherence and Synchronization

Hybrid Transactional Memory

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture: Coherence and Synchronization

Lecture: Consistency Models, TM

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Advanced Operating Systems (CS 202) Memory Consistency and Transactional Memory Feb. 6, 2019.

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Advanced Operating Systems (CS 202) Transactional memory Jan, 27, 2016 slide credit: slides adapted from several presentations, including stanford TCC group and MIT superTech group

2 Motivation Uniprocessor Systems – Frequency – Power consumption – Wire delay limits scalability – Design complexity vs. verification effort – Where is ILP? Support for multiprocessor or multicore systems – Replicate small, simple cores, design is scalable – Faster design turnaround time, Time to market – Exploit TLP, in addition to ILP within each core – But now we have new problems

3 Parallel Software Problems Parallel systems are often programmed with – Synchronization through barriers – Shared objects access control through locks Lock granularity and organization must balance performance and correctness – Coarse-grain locking: Lock contention – Fine-grain locking: Extra overhead – Must be careful to avoid deadlocks or data races – Must be careful not to leave anything unprotected Performance tuning is not intuitive – Performance bottlenecks are related to low level events E.g. false sharing, coherence misses Feedback is often indirect (cache lines, rather than variables )

4 Parallel Hardware Complexity (TCC’s view) Cache coherence protocols are complex – Must track ownership of cache lines – Difficult to implement and verify all corner cases Consistency protocols are complex – Must provide rules to correctly order individual loads/stores – Difficult for both hardware and software Current protocols rely on low latency, not bandwidth – Critical short control messages on ownership transfers – Latency of short messages unlikely to scale well in the future – Bandwidth is likely to scale much better High speed interchip connections Multicore (CMP) = on-chip bandwidth

5 What do we want? A shared memory system with – A simple, easy programming model (unlike message passing) – A simple, low-complexity hardware implementation (unlike shared memory) – Good performance

6 Why are locks bad? Common problems in conventional locking mechanisms in concurrent systems – Priority inversion/inefficiency: When low-priority process is preempted while holding a lock needed by a high-priority process – Convoying: When a process holding a lock is de-scheduled (e.g. page fault, no more quantum), no forward progress for other processes capable of running – Deadlock (or Livelock): Processes attempt to lock the same set of objects in different orders (could be bugs by programmers) Error-prone

Lock-free Shared data structure is lock-free if its operations do not require mutual exclusion - Will not prevent multiple processes operating on the same object + avoid lock problems - Existing lock-free techniques use software and do not perform well against lock counterparts

Transactional Memory Use transaction style operations to operate on lock free data Allow user to customized read-modify- write operation on multiple, independent words Easy to support with hardware, straight forward extensions to conventional multiprocessor cache

Transaction Style A finite sequence of machine instruction with – Sequence of reads, – Computation, – Sequence of write and – Commit Formal properties – Atomicity, Serializability (~ACID)

Access Instructions Load-transactional (LT) – Reads from shared memory into private register Load-transactional-exclusive (LTX) – LT + hinting write is coming up Store-transactional (ST) – Tentatively write from private register to shared memory, new value is not visible to other processors till commit

State Instructions Commit – Tries to make tentative write permanent. – Successful if no other processor read its read set or write its write set – When fails, discard all updates to write set – Return the whether successful or not Abort – Discard all updates to write set Validate – Return current transaction status – If current status is false, discard all updates to write set

Transactional memory API Programmer specifies atomic code blocks Lock version TM version Lock(X[a]); atomic { Lock(X[b]); X[c]=X[a]+X[b]; Lock(X[c]); } X[c] = X[a] + X[b]; Unlock(X[c]); Unlock(X[b]); Unlock X[a] 12

Typical Transaction /* keep trying */ While ( true ) { /* read variables */ v1 = LT ( V1 ); …; vn = LT ( Vn ); /* check consistency */ if ( ! VALIDATE () ) continue; /* compute new values */ compute ( v1, …, vn); /* write tentative values */ ST (v1, V1); … ST(vn, Vn); /* try to commit */ if ( COMMIT () ) return result; else backoff; }

Example 14 ld 0xdddd... st 0xbeef Transaction A Time ld 0xbeef Transaction C ld 0xbeef Re-execute with new data Commit Arbitrate ld 0xdddd... ld 0xbbbb Transaction B Commit Arbitrate Violation! 0xbeef

Warning… Not the same database transactions or intended for database use – Transactions are short in time – Transactions are small in dataset But similar in intent and semantics

Idea Behind Implementation Existing cache protocol detects accessibility conflicts Accessibility conflicts ~ transaction conflicts Can extended to cache coherent protocols – Includes bus snoopy, directory

Bus Snoopy Example processor Regular cache byte lines Direct mapped Transaction cache 64 8-byte lines Fully associative bus Caches are exclusive Transaction cache contains tentative writes without propagating them to other processors

TM support for transactions BufferingTransactional cache Conflict detectionCache coherence protocol Abort/RecoveryInvalidate transactional cache line CommitValidate transactional cache line

Transaction Cache Cache line contains separate transactional tag in addition to coherent protocol tag – Transactional tag state: empty, normal, xcommit, xabort Two entries per transaction – Modification write to xabort, set to empty when abort – Xcommit contains the original, set to empty when commits Allocation policy order in decreasing favor – Empty entries, normal entries, xcommit entries Must guarantee a minimum transaction size

Transactional Cache Fully set associative cache – Each cache line can be in only one of transactional or regular cache Holds transactional writes – Transactional writes are hidden from other processors and memory Makes updated lines available for snooping on COMMIT Invalidate updated line on ABORT

Herlihy and Moss, ISCA ‘93 M S S XCommit XAbort CacheTransaction Cache CPU Memory

Sample Counter code

Exposing more concurrency Doubly linked list implementation of queue – Head, Tail pointers If queue not empty – Only head pointer is used for dequeuing – Only tail pointer is used for enqueuing Concurrent enqueuing/dequeuing – Possible in TM – Not possible with locks

Challenges of TM Long transactions I/O Nested transactions Interrupts

Other TM Ideas Speculative Lock Elision Software Transactional Memory – Requires no hardware changes – Allows composition of transactions Multiple improvements both for hardware and software TM Hybrid TMs

Speculative Lock Elision Ravi and Goodman, MICRO ‘01 Speculatively remove lock acquire and removal instructions Microarchitectural changes No changes to cache systems No changes to ISA – Can work with existing lock based code

SLE example

Compare TM and TLS TM is optimistic synchronization TLS is optimistic parallelization Any other similarities or differences

Simulation Proteus Simulator 32 processors Regular cache – Direct mapped, byte lines Transactional cache – Fully associative, 64 8-byte lines Single cycle caches access 4 cycle memory access Both snoopy bus and directory are simulated 2 stage network with switch delay of 1 cycle each

Benchmarks Counter – n processors, each increment a shared counter (2^16)/n times Producer/Consumer buffer – n/2 processors produce, n/2 processor consume through a shared FIFO – end when 2^16 items are consumed Doubly-linked list – N processors tries to rotate the content from tail to head – End when 2^16 items are moved – Variables shared are conditional – Traditional locking method can introduce deadlock

Comparisons Competitors – Transactional memory – Load-locked/store-cond (Alpha) – Spin lock with backoff – Software queue – Hardware queue

Counter Result

Producer/Consumer Result

Doubly Linked List Result

Conclusion Avoid extra lock variable and lock problems Trade dead lock for possible live lock/starvation Comparable performance to lock technique when shared data structure is small Relatively easy to implement

What has happened since this paper? Many other transactional memory proposals – Software TM (slower, but no hardware support needed, and no limit on size of data) – Hardware TM (many proposals with various degrees of improvement) Products! – Sun Rock in mid 2000s TxLinux used it – Intel/AMD announced 2013 Shipped 2014 Supports SLE as well 36