The C++11 Memory Model CDP

Slides:



Advertisements
Similar presentations
Threads Cannot be Implemented As a Library Andrew Hobbs.
Advertisements

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
6/10/2015C++ for Java Programmers1 Pointers and References Timothy Budd.
Chapter 6. 2 Objectives You should be able to describe: Function and Parameter Declarations Returning a Single Value Pass by Reference Variable Scope.
1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.
CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
A First Book of C++: From Here To There, Third Edition2 Objectives You should be able to describe: Function and Parameter Declarations Returning a Single.
CDP 2012 Based on “C++ Concurrency In Action” by Anthony Williams and The C++11 Memory Model and GCC WikiThe C++11 Memory Model and GCC Created by Eran.
CDP 2013 Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki and Herb Sutter’s.
Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Multiprocessor Cache Consistency (or, what does volatile mean?) Andrew Whitaker CSE451.
CHARLES UNIVERSITY IN PRAGUE faculty of mathematics and physics Advanced.NET Programming I 5 th Lecture Pavel Ježek
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
ICOM 4035 – Data Structures Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 – August 23, 2001.
Week 9, Class 3: Java’s Happens-Before Memory Model (Slides used and skipped in class) SE-2811 Slide design: Dr. Mark L. Hornick Content: Dr. Hornick Errors:
The C++11 Memory Model CDP Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki.
Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Information and Computer Sciences University of Hawaii, Manoa
Efficient Software-Based Fault Isolation
EGR 2261 Unit 11 Pointers and Dynamic Variables
Threads & Multithreading
Course Contents KIIT UNIVERSITY Sr # Major and Detailed Coverage Area
Advanced .NET Programming I 11th Lecture
User-Written Functions
Distributed Shared Memory
Friend Class Friend Class A friend class can access private and protected members of other class in which it is declared as friend. It is sometimes useful.
Advanced OS Concepts (For OCR)
Memory Consistency Models
Threads Cannot Be Implemented As a Library
Atomic Operations in Hardware
Atomic Operations in Hardware
Memory Consistency Models
Run-time organization
Lecture 5: GPU Compute Architecture
Threads and Memory Models Hal Perkins Autumn 2011
Phil Tayco Slide version 1.0 Created Oct 2, 2017
Object Oriented Programming COP3330 / CGS5409
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
Lecture 5: GPU Compute Architecture for the last time
The C++ Memory model Implementing synchronization)
Implementing synchronization
Threads and Memory Models Hal Perkins Autumn 2009
Functions Pass By Value Pass by Reference
Dr. Mustafa Cem Kasapbaşı
Java Concurrency 17-Jan-19.
Shared Memory Consistency Models: A Tutorial
Mastering Memory Modes
Programming with Shared Memory Specifying parallelism
Java Concurrency.
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Java Concurrency.
Relaxed Consistency Finale
Programming with Shared Memory - 3 Recognizing parallelism
Programming with Shared Memory Specifying parallelism
Java Concurrency 29-May-19.
Problems with Locks Andrew Whitaker CSE451.
Presentation transcript:

The C++11 Memory Model CDP Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCC Wiki and Herb Sutter’s atomic<> Weapons talk. Created by Eran Gilad

Reminder: what is a memory model? The guarantees provided by the runtime environment to a multithreaded program, regarding the order of memory operations Each level of the environment might have a different memory model CPU, virtual machine, compiler The correctness of parallel algorithms depends on the memory model

Why C++ needs a memory model Isn’t the CPU memory model enough? Until now, different code was required for every compiler, OS and CPU combination Threads are now part of the language (C++11) Their behavior must be fully defined The same code should run the same on different environments regardless of the CPU (Intel, AMD, ARM, …) or the OS Having a standard guarantees portability and simplifies the programmer’s work Dependencies examples: Compiler – Different optimizations in general. VS provides special semantics to volatile on some cases. OS – Different threading facilities, Win32 and pthreads for instance. CPU – Pretty strict memory model on x86, weak on ARM. Having a standard basically passes the burden from the programmer to the compiler.

C++ in the past and now Given the following data race: Pre-2011 C++ Thread 1: x = 10; y = 20; Thread 2: if (y == 20) assert(x == 10); Pre-2011 C++ What’s a “Thread”? Namely – behavior is unspecified (not even undefined) C++11 Undefined Behavior (don’t try the above code at home) But now, C++11 introduces a new set of tools to get it right Pre-C++11 had no mention of threads in the standard. Everything was implementation-dependent. Undefined behavior is a clear statement about situations the programmer should never reach. It’s not an “I have no clue if it’s ok”. It’s a “don’t you dare!”.

Why Undefined Behavior matters Can the compiler convert the switch to a quick jump table, based on x’s value? Optimizations that are safe on sequential programs might not be safe on parallel ones So, races are forbidden! At this point, on some other thread: x = 8; if (x >= 0 && x < 3) { switch (x) { case 0: do0(); break; case 1: do1(); break; case 2: do2(); break; } Switch jumps to unknown location. Your monitor catches fire. The compiler knows x is bound between 0 and 2, so instead of creating a series of if’s, it can simply jump to (switch base address + x * constant). But if x is shared and unprotected, it might be changed between the if and the switch, causing an unrestricted jump, which can cause unpredictable damage… Races are forbidden because the compiler can’t always recognize them (static analysis in not required by the standard), and to avoid them the programmer must help the compiler. In the example above, even if x were somehow protected, it could still be modified between the if and the switch. This would have caused the code to behave differently from the serial implementation. However, had the compiler known x might be modified by another thread, it would not have used the unsafe jump table optimization.

Dangerous optimizations Optimizations are crucial for increasing performance, but might be dangerous Compiler optimizations: Reordering to hide load latencies Using registers to avoid repeating loads and stores CPU optimizations: Loose cache coherence protocols Out Of Order Execution Thumb rule: a thread must appear to execute serially to other threads Optimizations must not be visible to other threads Single-threaded programs are fairly easy to optimize, because there’s no “external viewer” (other than the output). So most of the time they execute “atomically”. Shared variables, however, might be viewed by different threads at (almost) any given moment. So, their modifications by one thread must seem to be ordered to other threads. The listed optimizations are just an example – obviously there are many more. Notice that in c++11 some optimizations can be seen by other threads, which is fully defined by c++11’s memory model

Memory layout definitions Every variable of a scalar (simple) type occupies one memory location Bit fields are different One memory location must not be affected by writes to an adjacent memory location! Can’t read and write all 4 bytes when changing just one struct s { char c[4]; int i: 3, j: 4; struct in { double d; } id; }; A memory location does not depend on the size of the type, and can be part of an array or struct/class. Bit fields are different – several might occupy a single memory location. About the example: writing single bytes can be inefficient on some systems. It could be faster to read the whole array of chars, for instance, as a single 32-bit word, update one element, and write the whole word back. But that might introduce a race on the other elements! This is similar to having two variables on the same cache line. Updating one must not override an update of the other. This requires non-trivial synchronization, and a performance overhead. This wasn’t a problem on single threaded programs. In Java, update of a variable smaller or equal to a dword should not affect an adjacent variable (like C++11). However, longs and doubles are not guaranteed to be atomic (unless defined as volatile, in which case they are always atomic). This is due to the fact that it is a multi-step operation at the machine code level. Note: int i:3 means that the compiler will use 3 bits to store i. Same memory location Considered one memory locaction even if hardware has no 64-bit atomic ops

Threads in C++11 (std::mutex) #include <iostream> #include <thread> void func(mutex * m, int * x) { m->lock(); std::cout << "Inside thread " << ++(*x) << std::endl; m->unlock(); } int main() { mutex m; int x = 0; std::thread th(func, &m, &x); // or std::ref(m), std::ref(x) th.join(); std::cout << "Outside thread " << x << std::endl; return 0; std::thread and std:mutex is used as we know it thread_local (like static, extern, auto, register, etc…) – define the variable to be local to each thread. The object is allocated when the thread begins and deallocated when the thread ends.

Threads in C++11 (std::atomic) std::atomic provides both atomicity and ordering guarantees Default order is sequential consistency Other orders are also possible The language offers specializations for basic types Basic types: bool, numbers, pointers Custom wrappers can be created by the programmer (i.e. template) A mutex is sometimes required to ensure atomic reads and writes Most architectures provide atomic primitives for up to a dword In case the variable is too large, a mutex must be used

Threads in C++11 (std::atomic definition) template<typename T> struct atomic { void store(T val, memory_order m = memory_order_seq_cst); T load(memory_order m = memory_order_seq_cst); // etc... }; // and on some specializations of atomic: T operator++(); T operator+=(T val); bool compare_exchange_strong(T& expected, T desired, memory_order m = memory_order_seq_cst);

Order Definitions: Sequenced Before The order imposed by the appearance of (most) statements in the code, for a single thread Thread 1 read x, 1 write y, 2 Sequenced Before On some occasions, the sequence of execution is not specified (such as function argument evaluation). This cases are therefore unordered with respect to each other. For instance, in f(h(), g()) it is unspecified whether h() or g() will be executed first. Yet, both h() and g() are always sequenced before f().

Order Definitions: Synchronized with The order imposed by an atomic read of a value that has been atomically written by some other thread Thread 1 write y, 2 Synchronized with Thread 2 read y, 2 The said synchronization does not rely on any synchronization object. It means that when E1 synchronizes with E2, every write done before E1 should be visible to reads done after E2; and no write done after E1 should be visible to reads done before E2. This defines an order that crosses thread boundaries.

Order Definitions: Inter-thread happens-before A combination of sequenced before and synchronized with Thread 1 read x, 1 write y, 2 Thread 2 read y, 2 write z, 3 Sequenced Before Synchronized With Sequenced Before Here, the events in thread 1 inter-thread happen-before the events in thread 2.

Order Definitions: Happens Before An event A happens-before an event B if: A inter-thread happens-before B or A is sequenced-before B Thread 1 (1) read x, 1 (2) write y, 2 Thread 2 (3) read y, 2 (4) write z, 3 Java has similar order definitions (using slightly different wording): happens before is determined by (intra-thread) program order and synchronized-with actions. Happens before Happens before Happens before

Using atomics Memory Order: Sequential Consistency Strongest ordering, default memory order Atomic writes are globally ordered Compiler and processor optimizations are limited No action can be reordered across an atomic action Non-atomic actions can be safely reordered as long as the reorder doesn’t cross an atomic command Atomic writes flush non-atomic writes that are sequenced-before them They are externally visible after the atomic action If the hardware doesn’t already guarantee sequential consistency, a barrier is required after every write. Non-atomic actions take place in the happen-before relation due to the sequenced-before relation, and the inter-thread happens before relation imposed by adjacent atomic operations. Example in next slide.

Using atomics Memory Order: Sequential Consistency (2) atomic<int> x; int y; // not atomic void thread1() { y = 1; x.store(2); } void thread2() { if (x.load() == 2) assert(y == 1); } The blue arrows are sequenced-before. The green arrow is synchronized-with. Together they form an inter-thread happens-before relation, which guarantees the write to y happens-before reading it. Note that y isn’t atomic, and in general not safe to use as shared variable. However, in this particular scenario, its accesses are synchronized by the use of the atomic x. The assert is guaranteed not to fail

This was the part you have to know in order to write correct code Now, if you like to play with sharp tools…

Using atomics Memory Order: Acquire Release Synchronized-with relation exists only between the releasing thread and the acquiring thread Other threads might see updates in a different order All writes before a release are visible after an acquire Even if they were relaxed or non-atomic Similar to release consistency memory model

Using atomics Memory Order: Acquire Release (2) #define rls memory_order_release /* save slide space… */ #define acq memory_order_acquire /* save slide space… */ atomic<int> x, y; void thread1() { y.store(20, rls); } void thread3() { assert(y.load(acq) == 20 && x.load(acq) == 0); } The writes are on different threads. Each acquiring thread is synchronized with the thread that has done the release. Thus, the releases are unordered. If sequential consistency had been used instead of acquire-release, the results would have been different. There would have been a total order of writes, so if one of the asserts had succeeded, the other must have failed. void thread4() { assert(y.load(acq) == 0 && x.load(acq) == 10); } void thread2() { x.store(10, rls); } Both asserts might succeed No order between writes to x and y

Using atomics Memory Order: Relaxed Consistency Happens-before now only exists between actions to the same variables Similar to cache coherence But can be used on systems with incoherent caches Imposes fewer limitations on the compiler Atomic actions can now be reordered if they access different memory locations Less synchronization is required on the hardware side as well In this model, two processors might see writes to two variables (done by some third processor) in a different order. This is where C++ atomic differs from Java volatile: Java never relaxes program order between volatile variables, while C++ allows reordering of (and across) atomic variables.

Using atomics Memory Order: Relaxed Consistency (2) #define rlx memory_order_relaxed /* save slide space… */ atomic<int> x, y; void thread3() { if (y.load(rlx) == 10) assert(x.load(rlx) == 10); } void thread1() { y.store(20, rlx); x.store(10, rlx); } R y, 10 W y, 20 R x, 0 W x, 10 void thread2() { if (x.load(rlx) == 10) { assert(y.load(rlx) == 20); y.store(10, rlx); } Notice: the sequenced before was removed. Both asserts might fail because the stores on thread 1 are not ordered, and different threads might see them in a different order. One way for this to happen is if thread 3 has a speculative load of x before the if. When this is the case, x might be read before threads 1 and 2 started to execute. Important: the rlx define is used just to reduce text in the presentation. There’s absolutely no reason to use this in real code! R x, 10 R y, 20 W y, 10 Both asserts might fail No order between operations on x and y

Using atomics Memory Order: Relaxed Consistency (3) Let’s look at the trace of a run in which the assert failed T1: W y, 20; W x, 10; T2: R x, 10; R y, 20; W y, 10; T3: R y, 10; R x, 0;

Using atomics Memory Order: Relaxed Consistency (4) Let’s look at the trace of a run in which the assert failed T1: W y, 20; W x, 10; T2: R x, 10; R y, 20; W y, 10; T3: R y, 10; R x, 0;

Using atomics Memory Order: Relaxed Consistency (5) Let’s look at the trace of a run in which the assert failed T1: W y, 20; W x, 10; T2: R x, 10; R y, 20; W y, 10; T3: R y, 10; R x, 0;

volatile in multi-threaded C++ Predates C++ multi-threading Serves different purpose Intended for reading from memory that was written by a device Does not guarantee atomic Unspecified memory order However Java (and C#) volatile is very similar to C++ atomic volatile was actually the only option in pre-C++11 code, but other than avoiding the use of a register it has no multi-threading semantics, because it was introduced way before C++ supported multi-threading. It does usually do the trick, but not due to some standard guarantee. Some compilers (VS) add ordering semantics (release consistency) on certain CPUs (Itanium) as an extension – but we’re talking about standard C++ here.

Summary and recommendations (1) Lock-free code is hard to write Unless speed is crucial, better use mutexes Sequential consistency is the default It is also the memory model for function that does not have memory-model as an input parameter Use the default sequential consistency and avoid data races. Really. Acquire-release is hard Relaxed is much harder Mixing orders is insane (but allowed)

Summary and recommendations (2) The memory model can be passed as run-time parameter But it is better to use it as a compilation constant to allow optimizations We only covered main orders C++11’s basic features are fully supported by all common compilers Current (end of 2012) compiler support status of the presented C++11 threading features: VS2010 – no VS2012 – yes Cygwin official distro (4.7) – no MinGW official distro (4.7) – no MinGW nuwen distro (4.7) – yes Linux GCC (4.7) – yes In 2016, most common compilers support atomic operations, thread-local storage and etc. as stated here: http://en.cppreference.com/w/cpp/compiler_support