Multicore programming

Slides:



Advertisements
Similar presentations
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A AAA A A A AA A Proving that non-blocking algorithms don't block.
Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
1 Chapter 4 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld © 2014 Synchronization Algorithms and Concurrent Programming Synchronization.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Silberschatz, Galvin and Gagne ©2007 Operating System Concepts with Java – 7 th Edition, Nov 15, 2006 Chapter 6 (a): Synchronization.
Multiprocessor Synchronization Algorithms ( ) Lecturer: Danny Hendler The Mutual Exclusion problem.
1 Operating Systems, 122 Practical Session 5, Synchronization 1.
Maged M. Michael, “Hazard Pointers: Safe Memory Reclamation for Lock- Free Objects” Presentation Robert T. Bauer.
What is an Algorithm? (And how do we analyze one?)
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.
Review 1 Introduction Representation of Linear Array In Memory Operations on linear Arrays Traverse Insert Delete Example.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
224 3/30/98 CSE 143 Recursion [Sections 6.1, ]
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
1 Condition Variables CS 241 Prof. Brighten Godfrey March 16, 2012 University of Illinois.
Martin Kruliš by Martin Kruliš (v1.1)1.
Data Structures. Abstract Data Type A collection of related data is known as an abstract data type (ADT) Data Structure = ADT + Collection of functions.
Stacks This presentation shows – how to implement the stack – how it can be used in real applications.
CSCI1600: Embedded and Real Time Software Lecture 17: Concurrent Programming Steven Reiss, Fall 2015.
Agenda  Quick Review  Finish Introduction  Java Threads.
Distributed Algorithms (22903) Lecturer: Danny Hendler Lock-free stack algorithms.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Tutorial 2: Homework 1 and Project 1
Lecture 20: Consistency Models, TM
Parallelism and Concurrency
Department of Computer Science, University of Rochester
Atomic Operations in Hardware
Atomic Operations in Hardware
Data Structures Interview / VIVA Questions and Answers
Distributed Algorithms (22903)
Algorithm Analysis CSE 2011 Winter September 2018.
Distributed Algorithms (22903)
CSE451 Basic Synchronization Spring 2001
Distributed Algorithms (22903)
Pointers and Dynamic Variables
Stacks, Queues, and Deques
Distributed Algorithms (22903)
Concurrent Data Structures Concurrent Algorithms 2017
Synchronization Lecture 23 – Fall 2017.
Designing Parallel Algorithms (Synchronization)
CS510 - Portland State University
Race Conditions & Synchronization
Multicore programming
Scalable lock-free Stack Algorithm
Multicore programming
Multicore programming
Multicore programming
CSCI1600: Embedded and Real Time Software
A Concurrent Lock-Free Priority Queue for Multi-Thread Systems
Explaining issues with DCremoval( )
Multicore programming
Multicore programming
Multicore programming
CS510 Advanced Topics in Concurrency
CSE 153 Design of Operating Systems Winter 19
Why we have Counterintuitive Memory Models
Multicore programming
Multicore programming
CSCI1600: Embedded and Real Time Software
CS510 Concurrent Systems Jonathan Walpole.
Stacks, Queues, and Deques
Lecture 20: Synchronization
Lecture: Coherence and Synchronization
Problems with Locks Andrew Whitaker CSE451.
Process/Thread Synchronization (Part 2)
Pointer analysis John Rollinson & Kaiyuan Li
CPS110: Thread cooperation
Presentation transcript:

Multicore programming Stacks, queues and more: Why order when you can be disordered? Week 1 - Wednesday Trevor Brown http://tbrown.pro

Wrapping up the Last lecture Important definitions Asynchronous shared memory model Shared object Execution of an algorithm Linearizability Important system concepts Cache coherence, exclusive vs shared access, and cache invalidations False sharing and padding Locks and fetch-and-add

Recall: sharded & padded counter Each thread has its own sub-counter Without padding, false sharing dominates With padding, increments are fast But… reading is hard! Simplicity is valuable! Do we really need a complex, sharded solution?

Considering use cases: a FAA-counter might be good enough FAA-based counter does not truly scale But, its absolute throughput might be high enough for your application Real applications do more than just increment a single counter Avoid unnecessary optimization Figure out if it’s a bottleneck first 60 million increments/sec!

Warning: padding can hurt Union-find data structure Each 8b node was padded to 64b Removing padding  5x faster! Why? Many nodes, uniformly accessed  contention is rare  false sharing is rare  padding can’t help much Padding wastes space  1/8th as many nodes in cache! operations per microsecond

modeling performance on a real system Processor caches largely determine performance Last-level cache is most important (Memory is 10-100x slower) Cache lines accessed by only one thread Cheap to access (even exclusive mode creates no contention) Read-only cache lines accessed by many threads Cheap to access (shared mode allows concurrency) Cache lines accessed by many threads and often modified Expensive to access (exclusive mode blocks other threads) Must also be aware of pitfalls like false sharing (what else?)

(Dis)ordered data structures Next topic (Dis)ordered data structures

Stack object Operations Push(e) Pop() Pushes an element e onto the stack Pop() Returns the last element pushed onto the stack, if the stack is not empty Otherwise, returns null

Naïve stack in C++ Data types Operations struct node { const int key; node * volatile next; node(int _key, node * _next) : key(_key), next(_next) {} }; struct stack { node * volatile top; stack() : top(NULL) {} void push(int key); int pop(); void stack::push(int key) { node * n = new node(key, top); top = n; } int stack::pop() { node * n = top; if (n == NULL) return EMPTY; top = n->next; return n->key;

Example execution Push(17) Push(52) Push(24) Pop() Read top 24 Create node Change top Push(52) Push(24) Pop() Read top and see node(24) Change top to node(52) and return 24 24 next top 52 next 17 next

What can go wrong? Read top, see node(17) Create node Change top This item is lost! top thread p 24 Push(17) Push(24) 52 next next thread q Push(52) Time 17 next Read top, see node(17) Create node Change top

Another example Read top, see node(52) Read top->next, see node(17) Change top to node(17) Return 52 top thread p Push(17) Push(52) Pop() 52 next thread q Pop() Time 17 Read top, see node(52) Read top->next, see node(17) Change top to node(17) Return 52 next

What’s the problem here? Algorithmic step 1: Read the value(s) that determine what we will write Algorithmic step 2: perform the Write Anything that happens in between is ignored / overwritten! Reads and Writes are not enough

A more powerful primitive Compare-and-swap (CAS) Atomic instruction implemented in most modern architectures (even mobile/embedded) Idea: a write that succeeds only if a location contains an “expected” value exp Semantics CAS(addr, exp, new) if (*addr == exp) { *addr = new; return true; } return false; Implemented atomically in hardware

CAS-based stack [Treiber86] void stack::push() int stack::pop() node * n = new node(key); while (true) { node * curr = top; n->next = curr; if (CAS(&top, curr, n)) return; } while (true) { node * curr = top; if (curr == NULL) return EMPTY; node * next = curr->next; if (CAS(&top, curr, next)) { return curr->key; } Change top from curr to n Change top from curr to curr->next

How does this avoid errors? Thread p will reread top and try again currp = Read(top) np = new node(24) CAS(&top, currp, np) - fails np top 24 nq thread p Push(17) Push(24) 52 next next thread q Push(52) 17 Time next currq currq = Read(top) nq = new node(52) CAS(&top, currq, nq) currp

What about freeing memory? int stack::pop() while (true) { node * curr = top; if (curr == NULL) return EMPTY; node * next = curr->next; if (CAS(&top, curr, next)) { free(curr); auto retval = curr->key; return retval; } We disconnected a node from the stack! Why not call free() on it? We are not locking before accessing nodes… Multiple threads might be accessing curr!

Problems freeing memory currp = Read(top) nextp = Read(currp->next), Accessing freed memory! top currp thread p Push(17) Push(52) Pop() currq 52 next thread q Pop() next Time 17 currq = Read(top) nextq = Read(currq->next) CAS(&top, currq, nextq) free(currq) next nextq

Using free correctly Must delay free(curr) until it is safe! When is it safe? When no other thread has a pointer to curr There are memory reclamation algorithms designed to solve this Hazard pointers, epoch-based reclamation, garbage collection (Java, C#) Usually provide a delayedFree operation (plus some other operations) We won’t worry about memory reclamation for now…

What does the stack guarantee? Correctness (safety): Linearizable Every concurrent execution is equivalent to some valid execution of a sequential stack (as long as we don’t screw up memory reclamation)

What does the stack guarantee? Progress (liveness): Lock-free Some thread will always make progress, even if some threads can crash A crashed thread stops taking steps forever Crashed threads are indistinguishable from very slow threads (because threads can be unboundedly slow) Allows some threads to starve But not all threads will starve Algorithms are usually designed so starvation is rare in practice

Reasoning about progress void stack::push() node * n = new node(key); while (true) { node * curr = top; n->next = curr; if (CAS(&top, curr, n)) return; } What can prevent progress in the stack? Unbounded while loops When do we break out of a loop? When we do a successful CAS What can prevent a successful CAS? A concurrent successful CAS by another thread p int stack::pop() while (true) { node * curr = top; if (curr == NULL) return EMPTY; node * next = curr->next; if (CAS(&top, curr, next)) { return curr->key; }

Mechanics of proving progress Proof by contradiction Assume threads keep taking steps forever, but progress stops After some time t, no operation terminates, so everyone is stuck in a while loop To continue in their while loops, threads must continue to perform failed CASs forever after t Every failed CAS is concurrent with a successful CAS, which is performed by an operation that will not perform any more loop iterations Eventually, no running operation will perform loop iterations  contradiction!

Reasoning about correctness Creativity needed here Reasoning about correctness void stack::push() Linearization points: Push: the successful CAS Pop: If return is EMPTY, it’s the last read of top Otherwise, it’s the successful CAS Must prove: every operation returns the same value it would if it were performed atomically at its linearization point. node * n = new node(key); while (true) { node * curr = top; n->next = curr; if (CAS(&top, curr, n)) return; } int stack::pop() while (true) { node * curr = top; if (curr == NULL) return EMPTY; node * next = curr->next; if (CAS(&top, curr, next)) { return curr->key; }

Performance Are stacks really suitable for multicore programming? One thread is best… Other stacks developed up to 2012 Treiber stack we just saw

Recap Finishing up last class Important definition: lock-freedom “Padding” fixes false sharing, but can be a double edged sword Predicting performance on multicore machines revolves large around the caches Important definition: lock-freedom Lock-free stack (implementation and proof sketch)