Practical Reduction for Store Buffers Ernie Cohen, Microsoft Norbert Schirmer, DFKI.

Slides:

Advertisements

Similar presentations

50.003: Elements of Software Construction Week 6 Thread Safety and Synchronization.

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

1 Chapter 5 Concurrency: Mutual Exclusion and Synchronization Principals of Concurrency Mutual Exclusion: Hardware Support Semaphores Readers/Writers Problem.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Chapter 6: Process Synchronization

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

CH7 discussion-review Mahmoud Alhabbash. Q1 What is a Race Condition? How could we prevent that? – Race condition is the situation where several processes.

Multi Core Processors and Casino Programming W. J. Paul Vienna 2014 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA.

CSE506: Operating Systems Block Cache. CSE506: Operating Systems Address Space Abstraction Given a file, which physical pages store its data? Each file.

Concurrency in Shared Memory Systems Synchronization and Mutual Exclusion.

Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.

Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.

1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.

Memory Model Safety of Programs Sebastian Burckhardt Madanlal Musuvathi Microsoft Research EC^2, July 7, 2008.

29-Jun-15 Java Concurrency. Definitions Parallel processes—two or more Threads are running simultaneously, on different cores (processors), in the same.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

Synchronization CSCI 444/544 Operating Systems Fall 2008.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Modern Concurrency Abstractions for C# by Nick Benton, Luca Cardelli & C´EDRIC FOURNET Microsoft Research.

Evaluation of Memory Consistency Models in Titanium.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

CDP 2012 Based on “C++ Concurrency In Action” by Anthony Williams and The C++11 Memory Model and GCC WikiThe C++11 Memory Model and GCC Created by Eran.

CDP 2013 Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki and Herb Sutter’s.

Games Development 2 Concurrent Programming CO3301 Week 9.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Chapter 6 – Process Synchronisation (Pgs 225 – 267)

COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-5 Process Synchronization Department of Computer Science and Software.

Operating Systems CSE 411 CPU Management Dec Lecture Instructor: Bhuvan Urgaonkar.

Chapter 17: Recovery System

Concurrency in Shared Memory Systems Synchronization and Mutual Exclusion.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

AtomCaml: First-class Atomicity via Rollback Michael F. Ringenburg and Dan Grossman University of Washington International Conference on Functional Programming.

Specifying Multithreaded Java semantics for Program Verification Abhik Roychoudhury National University of Singapore (Joint work with Tulika Mitra)

The C++11 Memory Model CDP Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki.

CSCI1600: Embedded and Real Time Software Lecture 17: Concurrent Programming Steven Reiss, Fall 2015.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 6: Process Synchronization.

Chapter 6 Synchronization Dr. Yingwu Zhu. The Problem with Concurrent Execution Concurrent processes (& threads) often access shared data and resources.

Lecture 5 Page 1 CS 111 Summer 2013 Bounded Buffers A higher level abstraction than shared domains or simple messages But not quite as high level as RPC.

Background on the need for Synchronization

Process Synchronization

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Memory Consistency Models

Memory Consistency Models

Atomicity CS 2110 – Fall 2017.

Lecture 5: GPU Compute Architecture

Threads and Memory Models Hal Perkins Autumn 2011

Lecture 5: GPU Compute Architecture for the last time

How to improve (decrease) CPI

Threads and Memory Models Hal Perkins Autumn 2009

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Java Concurrency 17-Jan-19.

CSCI1600: Embedded and Real Time Software

CSE451 Virtual Memory Paging Autumn 2002

- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts: multiprogramming, multiprocessing, multitasking,

Java Concurrency.

CSE 153 Design of Operating Systems Winter 19

Relaxed Consistency Part 2

Why we have Counterintuitive Memory Models

Java Concurrency.

Compilers, Languages, and Memory Models

CSCI1600: Embedded and Real Time Software

Java Concurrency 29-May-19.

Problems with Locks Andrew Whitaker CSE451.

Presentation transcript:

Practical Reduction for Store Buffers Ernie Cohen, Microsoft Norbert Schirmer, DFKI

problem practical reasoning about imperative code is based on state assertions and invariants such reasoning tacitly assumes sequential consistency (SC) … … but real MP hardware doesn’t provide SC needed: a programming discipline that – guarantees SC – is flexible enough to handle real software – is practical to check

x86/x64 hardware model: TSO FIFO store buffer (SB) between each processor (P) and the (shared, SC) memory – P writes are queued onto its SB – concurrently, writes leave SBs and are applied to memory – a read by P reads from P’s SB if possible; otherwise, it reads from memory (“SB forwarding”) – P can flush its own SB (expensive) note: TSO != “load-acquire, store-release” – a read can move backward past a write to the same location, turning into a read of a constant note: UP TSO machines are SC, but …

TSO is not SC TSO is not SC, because of the delay in writes becoming visible to other processors, e.g. P0: P1: both Ps can complete under TSO, but not under SC (whichever thread writes second gets stuck)

a simple SC discipline make sure that P reads only when P’s SB is empty – writes dirty the SB; flushes clean it – read allowed only when the SB is clean – (lazy caching uses a similar trick to achieve SC) proof of SC: – each P simulates a virtual P (that might fall behind) – virtual P takes a write step when that write hits memory – real and virtual P are in sync on read steps but this discipline isn’t practical – disjoint concurrency shouldn’t require any flushes! idea: distinguish private and shared memory

ownership each location can be either owned (by a unique processor) or unowned each access is volatile or nonvolatile modified discipline: – nonvolatile access requires ownership of the location – volatile writes dirty the SB – volatile reads allowed only when SB is clean simulation proof is similar, but novolatile accesses happen as soon as there are no volatile writes in front of them – they’re guaranteed to see the same values when they hit the SB, because other Ps don’t modify

moving ownership around use ghost operations to take and release ownership – P can take ownership of unowned locations – P can release ownership of locations P owns (this fits with ownership in VCC, where “unowned” means owned by a data object rather than a thread) discipline in the paper also adds unowned read- only locations, which allows shared non-volatile reading

ex: spinlocks typedef … struct _SPIN_LOCK { volatile int Lock; _(ghost \object prot_obj;) _(invariant !Lock ==> \mine(prot_obj)) } SPIN_LOCK; void Acquire(SPIN_LOCK *SpinLock …) … { int stop; do { …{ //atomic stop = (__interlockedcompareexchange(&SpinLock->Lock, 1, 0) == 0); _(if (stop) \giveup_closed_owner(SpinLock->prot_obj, SpinLock);) } } while (!stop); } Microsoft confidential

key points discipline follows some basic VCC methodology – discipline expressed in terms of ghost state – ghost code “witnesses” conformance to the discipline (much as ghost code is used to witness simulations) – by replacing proof obligations with programming obligations, we’re more likely to get programmers to do it when checking the discipline, we get to assume a SC execution, so we never have to think about the SBs.

the only tricky part of the proof key observation: ownership changes cannot race on their own – if they do, there are executions that violate the discipline therefore, we can pretend that ownership doesn’t get released until the next volatile write

a note on ghosts VCC requires lots of ghost code, incude racy operations on volatile ghost state why doesn’t this introduce flushing? SC code follows discipline on real data => {SC stripped code simulates SC code} SC stripped code follows discipline on real data => {reduction theorem} stripped code simulates SC stripped code => {SC stripped code simulates SC code} stripped code simulates SC code

how close is this to practice? discipline followed almost everywhere in the Hv codebase – even non-interlocked volatile writes are fairly rare exceptions (outside of device ops) are writes where – the write doesn’t race with other writes – racing reads can safely read the old value – ex: releasing a spinlock, broadcasting signals a solution: introduce a new kind of volatile – one reader, multiple writers – keep track of an upper and lower bound – writes must be above the upper bound, – writes raise the upper bound – flush raises the lower bound to the upper bound – reads by other processors raise the lower bound to the value read – this works, but is kind of gross