The C++11 Memory Model CDP Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki.

Slides:

Advertisements

Similar presentations

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Advertisements

Threads Cannot be Implemented As a Library Andrew Hobbs.

50.003: Elements of Software Construction Week 6 Thread Safety and Synchronization.

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Outline CPU caches Cache coherence Placement of data Hardware synchronization instructions Correctness: Memory model & compiler Performance: Programming.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.

Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.

Lecture 13: Consistency Models

Computer Architecture II 1 Computer architecture II Lecture 9.

1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.

Memory Consistency Models

CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

More on Locks: Case Studies

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.

CDP 2012 Based on “C++ Concurrency In Action” by Anthony Williams and The C++11 Memory Model and GCC WikiThe C++11 Memory Model and GCC Created by Eran.

CDP 2013 Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki and Herb Sutter’s.

Netprog: Java Intro1 Crash Course in Java. Netprog: Java Intro2 Why Java? Network Programming in Java is very different than in C/C++ –much more language.

Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

Practical Reduction for Store Buffers Ernie Cohen, Microsoft Norbert Schirmer, DFKI.

By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Threads Cannot be Implemented as a Library Hans-J. Boehm.

Java Thread and Memory Model

Multiprocessor Cache Consistency (or, what does volatile mean?) Andrew Whitaker CSE451.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

Processor Architecture

Threads cannot be implemented as a library Hans-J. Boehm (presented by Max W Schwarz)

CHARLES UNIVERSITY IN PRAGUE faculty of mathematics and physics Advanced.NET Programming I 5 th Lecture Pavel Ježek

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

Quick Review of OOP Constructs Classes:  Data types for structured data and behavior  fields and methods Objects:  Variables whose data type is a class.

Specifying Multithreaded Java semantics for Program Verification Abhik Roychoudhury National University of Singapore (Joint work with Tulika Mitra)

Week 9, Class 3: Java’s Happens-Before Memory Model (Slides used and skipped in class) SE-2811 Slide design: Dr. Mark L. Hornick Content: Dr. Hornick Errors:

Announcements Assignment 2 Out Today Quiz today - so I need to shut up at 4:25 1.

The C++11 Memory Model CDP

Course Contents KIIT UNIVERSITY Sr # Major and Detailed Coverage Area

Outline CPU caches Cache coherence Placement of data

Concurrency 2 CS 2110 – Spring 2016.

Memory Consistency Models

Threads Cannot Be Implemented As a Library

Memory Consistency Models

Lecture 5: GPU Compute Architecture

Threads and Memory Models Hal Perkins Autumn 2011

Lecture 5: GPU Compute Architecture for the last time

Shared Memory Consistency Models: A Tutorial

The C++ Memory model Implementing synchronization)

Implementing synchronization

COP 4600 Operating Systems Fall 2010

Threads and Memory Models Hal Perkins Autumn 2009

COT 5611 Operating Systems Design Principles Spring 2012

Dr. Mustafa Cem Kasapbaşı

Programming with Shared Memory Specifying parallelism

(Computer fundamental Lab)

CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization

CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization

CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

Compilers, Languages, and Memory Models

Programming with Shared Memory Specifying parallelism

Problems with Locks Andrew Whitaker CSE451.

Presentation transcript:

The C++11 Memory Model CDP Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki and Herb Sutter’s atomic<> Weapons talk.atomic<> Weapons talk Created by Eran Gilad 1

Reminder: what is a memory model? The guarantees provided by the runtime environment to a multithreaded program, regarding the order of memory operations Each level of the environment might have a different memory model CPU, virtual machine, compiler The correctness of parallel algorithms depends on the memory model 2

Why C++ needs a memory model Isn’t the CPU memory model enough? Until now, different code was required for every compiler, OS and CPU combination Threads are now part of the language (C++11) Their behavior must be fully defined The same code should run the same on different environments regardless of the CPU (Intel, AMD, ARM, …) or the OS Having a standard guarantees portability and simplifies the programmer’s work 3

C++ in the past and now Given the following data race: Thread 1: x = 10; y = 20; Thread 2: if (y == 20) assert(x == 10); Pre-2011 C++ What’s a “Thread”? Namely – behavior is unspecified (not even undefined) C++11 Undefined Behavior (don’t try the above code at home) But now, C++11 introduces a new set of tools to get it right 4

Why Undefined Behavior matters Can the compiler convert the switch to a quick jump table, based on x’s value? Optimizations that are safe on sequential programs might not be safe on parallel ones So, races are forbidden! 5 if (x >= 0 && x < 3) { switch (x) { case 0: do0(); break; case 1: do1(); break; case 2: do2(); break; } At this point, on some other thread: x = 8; Switch jumps to unknown location. Your monitor catches fire.

Dangerous optimizations Optimizations are crucial for increasing performance, but might be dangerous Compiler optimizations: Reordering to hide load latencies Using registers to avoid repeating loads and stores CPU optimizations: Loose cache coherence protocols Out Of Order Execution A thread must appear to execute serially to other threads Optimizations must not be visible to other threads 6

Memory layout definitions Every variable of a scalar (simple) type occupies one memory location Bit fields are different One memory location must not be affected by writes to an adjacent memory location! 7 struct s { char c[4]; int i: 3, j: 4; struct in { double d; } id; }; Can’t read and write all 4 bytes when changing just one Same memory location Considered one memory locaction even if hardware has no 64- bit atomic ops

Threads in C++11 (std::mutex) 8 #include void func(mutex * m, int * x) { m->lock(); std::cout << "Inside thread " << ++(*x) << std::endl; m->unlock(); } int main() { mutex m; int x = 0; std::thread th(&func, &m, &x); th.join(); std::cout << "Outside thread " << x << std::endl; return 0; }

Threads in C++11 (std::atomic) std::atomic provides both atomicity and ordering guarantees Default order is sequential consistency Other orders are also possible The language offers specializations for basic types Basic types: bool, numbers, pointers Custom wrappers can be created by the programmer (i.e. template) A mutex is sometimes required to ensure atomic reads and writes Most architectures provide atomic primitives for up to a word In case the variable is too large, a mutex must be used 9

Threads in C++11 (std::atomic definition) 10 template struct atomic { void store(T val, memory_order m = memory_order_seq_cst); T load(memory_order m = memory_order_seq_cst); // etc... }; // and on some specializations of atomic: T operator++(); T operator+=(T val); bool compare_exchange_strong(T& expected, T desired, memory_order m = memory_order_seq_cst); // etc...

Order Definitions: Sequenced Before The order imposed by the appearance of (most) statements in the code, for a single thread 11 Thread 1 read x, 1 write y, 2 Sequenced Before

Order Definitions: Synchronized with The order imposed by an atomic read of a value that has been atomically written by some other thread 12 Thread 1 write y, 2 Thread 2 read y, 2 Synchronized with

Order Definitions: Inter-thread happens-before A combination of sequenced before and synchronized with 13 Thread 1 read x, 1 write y, 2 Thread 2 read y, 2 write z, 3 Synchronized With SequencedBeforeSequencedBefore

Order Definitions: Happens Before An event A happens-before an event B if: A inter-thread happens-before B or A is sequenced-before B 14 Thread 1 (1) read x, 1 (2) write y, 2 Thread 2 (3) read y, 2 (4) write z, 3 Happens before

Using atomics Memory Order: Sequential Consistency Strongest ordering, default memory order Atomic writes are globally ordered Compiler and processor optimizations are limited No action can be reordered across an atomic action Non-atomic actions can be safely reordered as long as the reorder doesn’t cross an atomic command Atomic writes flush non-atomic writes that are sequenced-before them They are externally visible after the atomic action 15

Using atomics Memory Order: Sequential Consistency (2) atomic x; int y; // not atomic void thread1() { y = 1; x.store(2); } 16 void thread2() { if (x.load() == 2) assert(y == 1); } The assert is guaranteed not to fail

This was the part you have to know in order to write correct code Now, if you like to play with sharp tools… 17

Using atomics Memory Order: Acquire Release Synchronized-with relation exists only between the releasing thread and the acquiring thread Other threads might see updates in a different order All writes before a release are visible after an acquire Even if they were relaxed or non-atomic Similar to release consistency memory model 18

Using atomics Memory Order: Acquire Release (2) 19 #define rls memory_order_release /* save slide space… */ #define acq memory_order_acquire /* save slide space… */ atomic x, y; void thread1() { y.store(20, rls); } void thread2() { x.store(10, rls); } void thread3() { assert(y.load(acq) == 20 && x.load(acq) == 0); } void thread4() { assert(y.load(acq) == 0 && x.load(acq) == 10); } Both asserts might succeed No order between writes to x and y

Using atomics Memory Order: Relaxed Consistency Happens-before now only exists between actions to the same variables Similar to cache coherence But can be used on systems with incoherent caches Imposes fewer limitations on the compiler Atomic actions can now be reordered if they access different memory locations Less synchronization is required on the hardware side as well 20

Using atomics Memory Order: Relaxed Consistency (2) 21 #define rlx memory_order_relaxed /* save slide space… */ atomic x, y; void thread1() { y.store(20, rlx); x.store(10, rlx); } void thread2() { if (x.load(rlx) == 10) { assert(y.load(rlx) == 20); y.store(10, rlx); } void thread3() { if (y.load(rlx) == 10) assert(x.load(rlx) == 10); } Both asserts might fail No order between operations on x and y W y, 20 W x, 10 R x, 10 R y, 20 W y, 10 R y, 10 R x, 0

Using atomics Memory Order: Relaxed Consistency (4) Let’s look at the trace of a run in which the assert failed 22 T1: W y, 20; W x, 10; T2: R x, 10; R y, 20; W y, 10; T3: R y, 10; R x, 0;

Using atomics Memory Order: Relaxed Consistency (5) Let’s look at the trace of a run in which the assert failed 23 T1: W y, 20; W x, 10; T2: R x, 10; R y, 20; W y, 10; T3: R y, 10; R x, 0;

Let’s look at the trace of a run in which the assert failed T1: W y, 20; W x, 10; T2: R x, 10; R y, 20; W y, 10; T3: R y, 10; R x, 0; Using atomics Memory Order: Relaxed Consistency (6) 24

volatile in multi-threaded C++ Predates C++ multi-threading Serves different purpose Intended for reading from memory that was written by a device Does not guarantee atomic Unspecified memory order However Java (and C#) volatile is very similar to C++ atomic 25

Summary and recommendations Lock-free code is hard to write Unless speed is crucial, better use mutexes Sequential consistency is the default It is also the memory model for function that does not have memory-model as an input parameter Use the default sequential consistency and avoid data races. Really. Acquire-release is hard Relaxed is much harder Mixing orders is insane (but allowed) The memory model can be passed as run-time parameter But it is better to use it as a compilation constant to allow optimizations We only covered main orders C++11 is new Not all features are supported by common compilers 26