Multiprocessor Programming

Slides:

Advertisements

Similar presentations

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.

Advertisements

Transactional Locking Nir Shavit Tel Aviv University (Joint work with Dave Dice and Ori Shalev)

Chapter 6: Process Synchronization

Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

The Future of Concurrency Theory Renaissance or Reformation? (Met dank aan Maurice Herlihy) Frits Vaandrager.

Nested Transactional Memory: Model and Preliminary Architecture Sketches J. Eliot B. Moss Antony L. Hosking.

CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.

The Future of Distributed Computing Renaissance or Reformation? Maurice Herlihy Brown University.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.

An Introduction to Software Transactional Memory

Art of Multiprocessor Programming 1 Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint.

Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.

Transactional Memory The Art of Multiprocessor Programming Herlihy and Shavit.

A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.

Transactional Locking Nir Shavit Tel Aviv University Joint work with Dave Dice and Ori Shalev.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

Parallel Data Structures. Story so far Wirth’s motto –Algorithm + Data structure = Program So far, we have studied –parallelism in regular and irregular.

Commutativity and Coarse-Grained Transactions Maurice Herlihy Brown University Joint work with Eric Koskinen and Matthew Parkinson (POPL 10)

Transactional Memory Companion slides for

Translation Lookaside Buffer

Lecture 20: Consistency Models, TM

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Jonathan Walpole Computer Science Portland State University

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Transactions and Reliability

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Virtualizing Transactional Memory

Transactional Memory : Hardware Proposals Overview

PHyTM: Persistent Hybrid Transactional Memory

Transactional Memory Companion slides for

Transactional Memory TexPoint fonts used in EMF.

Atomic Operations in Hardware

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Atomic Operations in Hardware

Faster Data Structures in Transactional Memory using Three Paths

Challenges in Concurrent Computing

Anders Gidenstam Håkan Sundell Philippas Tsigas

Example Cache Coherence Problem

Changing thread semantics

Lecture 6: Transactions

Transactional Memory Companion slides for

Lecture 21: Transactional Memory

Transactional Memory An Overview of Hardware Alternatives

Yiannis Nikolakopoulos

Lecture 22: Consistency Models, TM

Does Hardware Transactional Memory Change Everything?

Hybrid Transactional Memory

Software Transactional Memory Should Not be Obstruction-Free

Kernel Synchronization II

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Update : about 8~16% are writes

Lecture 17 Multiprocessors and Thread-Level Parallelism

Programming with Shared Memory Specifying parallelism

Lecture 23: Transactional Memory

Lecture 21: Transactional Memory

Lecture: Consistency Models, TM

Problems with Locks Andrew Whitaker CSE451.

Lecture: Transactional Memory

The University of Adelaide, School of Computer Science

Concurrency control (OCC and MVCC)

I, for one, Welcome our new Multicore Overlords …

Lecture 17 Multiprocessors and Thread-Level Parallelism

Parallel Data Structures

Presentation transcript:

Multiprocessor Programming Transactional Memory The Art of Multiprocessor Programming

Multicore Architetures “Learn how the multi-core processor architecture plays a central role in Intel's platform approach. ….” “AMD is leading the industry to multi-core technology for the x86 based computing market …” “Sun's multicore strategy centers around multi-threaded software. ... “ Art of Multiprocessor Programming

Why do we care? Time no longer cures software bloat When you double your path length You can’t just wait 6 months Your software must somehow exploit twice as much concurrency © 2006 Herlihy & Shavit

The Problem Cannot exploit cheap threads Today’s Software Non-scalable methodologies Today’s Hardware Poor support for scalable synchronization © 2006 Herlihy & Shavit

Locking © 2006 Herlihy & Shavit

Coarse-Grained Locking Easily made correct … But not scalable. © 2006 Herlihy & Shavit

Fine-Grained Locking Here comes trouble … © 2006 Herlihy & Shavit

Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit

No one else can make progress Locks are not Robust If a thread holding a lock is delayed … No one else can make progress © 2006 Herlihy & Shavit

Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit

Locking Relies on Conventions Relation between Lock bit and object bits Exists only in programmer’s mind Actual comment from Linux Kernel (hat tip: Bradley Kuszmaul) /* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */ © 2006 Herlihy & Shavit

Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit

No interference if ends “far enough” apart Sadistic Homework enq(x) enq(y) Double-ended queue No interference if ends “far enough” apart © 2006 Herlihy & Shavit

Interference OK if ends “close enough” together Sadistic Homework enq(x) enq(y) Double-ended queue Interference OK if ends “close enough” together © 2006 Herlihy & Shavit

Make sure suspended dequeuers awake as needed Sadistic Homework deq() deq() Double-ended queue  Make sure suspended dequeuers awake as needed © 2006 Herlihy & Shavit

You Try It … One lock? Locks at each end? Waking blocked dequeuers? Too Conservative Locks at each end? Deadlock, too complicated, etc Waking blocked dequeuers? Harder than it looks © 2006 Herlihy & Shavit

Actual Solution Clean solution would be a publishable result [Michael & Scott, PODC 96] High performance fine-grained lock-based solutions are good for libraries… not general consumption by programmers © 2006 Herlihy & Shavit

Why Locking Doesn’t Scale Not Robust Relies on conventions Hard to Use Conservative Deadlocks Lost wake-ups Not Composable © 2006 Herlihy & Shavit

Exposing lock internals breaks abstraction Locks do not compose Hash Table lock T1 Must lock T1 before adding item add(T1, item) item Move from T1 to T2 lock T1 lock T2 delete(T1, item) add(T2, item) Must lock T2 before deleting from T1 item item Exposing lock internals breaks abstraction © 2006 Herlihy & Shavit

Monitor Wait and Signal Empty zzz buffer Yes! If buffer is empty, wait for item to show up © 2006 Herlihy & Shavit

Wait and Signal do not Compose empty Wait for either? empty zzz… © 2006 Herlihy & Shavit

The Transactional Manifesto What we do now is inadequate to meet the multicore challenge Research Agenda Replace locking with a transactional API Design languages to support this model Implement the run-time to be fast enough © 2006 Herlihy & Shavit

Transactions Atomic Linearizable Commit: takes effect Abort: effects rolled back Usually retried Linearizable Appear to happen in one-at-a-time order © 2006 Herlihy & Shavit

Sadistic Homework Revisited Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Write sequential Code (1) © 2006 Herlihy & Shavit

Sadistic Homework Revisited Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } (1) © 2006 Herlihy & Shavit

Sadistic Homework Revisited Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Enclose in atomic block (1) © 2006 Herlihy & Shavit

Warning Not always this simple But often it is Conditional waits Enhanced concurrency Overlapping locks But often it is Works for sadistic homework © 2006 Herlihy & Shavit

Composition Trivial or what? Public void Transfer(Queue q1, q2) { atomic { Item x = q1.deq(); q2.enq(x); } Trivial or what? (1) © 2006 Herlihy & Shavit

Wake-ups: lost and found Public item LeftDeq() { atomic { if (this.left == null) retry; … } Roll back transaction and restart when something changes (1) © 2006 Herlihy & Shavit

OrElse Composition Run 1st method. If it retries … atomic { x = q1.deq(); } orElse { x = q2.deq(); } Run 1st method. If it retries … Run 2nd method. If it retries … Entire statement retries © 2006 Herlihy & Shavit

Transactional Memory [HerlihyMoss93] © 2006 Herlihy & Shavit

Hardware Overview Exploit Cache coherence protocols Already do almost what we need Invalidation Consistency checking Exploit Speculative execution Branch prediction = optimistic synch! Original TM proposal by Herlihy and Moss 1993… © 2006 Herlihy & Shavit

HW Transactional Memory read active T caches Interconnect memory © 2006 Herlihy & Shavit

Transactional Memory read active active T T caches memory © 2006 Herlihy & Shavit

Transactional Memory active active committed T T caches memory © 2006 Herlihy & Shavit

Transactional Memory write active committed T D caches memory © 2006 Herlihy & Shavit

Rewind write aborted active active T T D caches memory © 2006 Herlihy & Shavit

Transaction Commit At commit point Mark transactional entries If no cache conflicts, we win. Mark transactional entries Read-only: valid Modified: dirty (eventually written back) That’s all, folks! Except for a few details … © 2006 Herlihy & Shavit

Not all Skittles and Beer Limits to Transactional cache size Scheduling quantum Transaction cannot commit if it is Too big Too slow Actual limits platform-dependent © 2006 Herlihy & Shavit

Some Approaches Trap to software if hardware fails “Hybrid” approach Use locks if speculation fails Lock elision “Virtualize” transactional memory VTM, UTM, etc… © 2006 Herlihy & Shavit

Hardware versus Software Do we need hardware at all? Like virtual memory, probably need HW for performance Do we need software? Policy issues don’t make sense for hardware © 2006 Herlihy & Shavit

Transactional Locking Build STM based on locks Atomic{} simple as course grained locking and it composes Under the covers (transparent to user) we deploy fine grained locks © 2006 Herlihy & Shavit

Locking STM Design Choices Map Array of Versioned- Write-Locks Application Memory V# PS = Lock per Stripe (separate array of locks) PO = Lock per Object (embedded in object) © 2006 Herlihy & Shavit

Lock-based STM Mem Locks V# 0 V#+1 0 X Y V# 0 V# 0 V# 0 To Read: load lock + location Location in write-set? (Bloom Filter) Check unlocked add to Read-Set To Write: add value to write set Acquire Locks Validate read/write v#’s unchanged Release each lock with v#+1 X V# 0 V# 1 V# 1 V# 0 V#+1 0 V# 0 Y V# 1 V# 0 V# 0 V# 0 V#+1 0 V#+1 0 V# 1 V# 0 V# 0 V# 0 V# 0 V# 0 © 2006 Herlihy & Shavit

Transactional Consistency Invariant x = 2y 8 4 4 X Transaction A: Write x Write y 2 Y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/2 Application Memory

Transactional Inconsistency Invariant x = 2y 4 8 4 X Transaction B: Read x = 4 2 Y Transaction A: Write x Write y Zombie transactiosn are ones that are doomed to fail (here trans will fail validatuon at the end) but go on Computing in a way that can cause problems Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) Application Memory DIV by 0 ERROR

Solution: The “Global Clock” Have one shared global clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent

Solution: “Version Clock” Have one shared global version clock Incremented by (small subset of) writing transactions Read by all transactions Used to validate that state worked on is always consistent Later: how we learned not to worry about contention and love the clock © 2006 Herlihy & Shavit

Version Clock: Read-OnlyTrans Mem Locks 100 VClock 87 0 34 0 88 0 V# 0 44 0 99 0 50 0 87 0 87 0 87 0 RV  VClock On Read: read lock, read mem, read lock: check unlocked, unchanged, and v# <= RV Commit. 34 0 34 0 34 0 88 0 99 0 99 0 V# 0 99 0 44 0 Reads form a snapshot of memory. No read set! 50 0 V# 0 50 0 50 0 100 RV © 2006 Herlihy & Shavit

Version Clock 100 120 100 121 VClock Mem Locks 87 0 121 0 88 0 V# 0 87 0 121 0 88 0 V# 0 44 0 50 0 87 0 87 0 87 0 87 0 RV  VClock On Read/Write: check unlocked and v# <= RV then add to Read/Write-Set Acquire Locks WV = F&I(VClock) Validate each v# <= RV Release locks with v#  WV X X 34 0 34 1 34 0 34 0 121 0 88 0 Y Y 121 0 99 1 99 0 99 0 99 0 44 0 50 0 V# 0 50 0 50 0 50 0 Reads+Inc+Writes =Linearizable 100 RV Commit © 2006 Herlihy & Shavit

Performance Benchmarks Mechanically Transformed Sequential Red-Black Tree using TL2 Compare to STMs and hand-crafted fine-grained Red-Black implementation On a 16–way Sun Fire™ running Solaris™ 10 © 2006 Herlihy & Shavit

Uncontended Large Red-Black Tree Hand-crafted 5% Delete 5% Update 90% Lookup TL/PO TL2/P0 Ennals TL/PS TL2/PS FarserHarris Lock-free © 2006 Herlihy & Shavit

Uncontended Small RB-Tree 5% Delete 5% Update 90% Lookup TL/P0 TL2/P0 © 2006 Herlihy & Shavit

Contended Small RB-Tree TL/P0 30% Delete 30% Update 40% Lookup TL2/P0 Ennals © 2006 Herlihy & Shavit

What Next? On multicores like Niagara Global Clock not a big issue. But there is a lot of TM research work ahead Do you want to help…? © 2006 Herlihy & Shavit