Multicore programming

Multicore programming
Implementing a shared counter: What could go wrong? Week 1 - Monday Trevor Brown

A bit about me Interests Fast concurrent data structures
Memory reclamation Transactional memory Big systems Big numbers … Gamer Drummer

This course Emphasis on both theory and practice
Leaning towards practice Correctness is hard Theory is needed to make things rigorous But, concurrency is about performance So, practice must inform theory! Less focus on traditionally taught concurrent programming material Attempting to teach the unspoken/unwritten intuition of experts Hoping to reach the bleeding edge

Theoretical Model: Asynchronous shared memory
Many threads Each threads has private memory Shared memory Accessible by all threads (with reads & writes) Read and write are atomic Cannot see partially written values

Definition: (Shared) object
An object offers one or more operations An operation has an invocation (possibly taking some arguments) and a response (possibly returning a value) An object can be accessed simultaneously by many threads

Example: Counter object
Stores an integer value, initially zero Operations Increment Increase the value by one and return the old value Read Return the current value

Motivation Why might you want a counter? Why are we studying counters?
To keep track of the size of a concurrent hash table Counter is incremented whenever an element is inserted Counter is read to decide whether the table should be expanded Suppose accesses to the hash table are spread out and highly concurrent Then the hash table is not a concurrency bottleneck Counter should not be a concurrency bottleneck either! Why are we studying counters? As a vehicle to understand concurrency, correctness, performance and systems

Naïve implementation of a counter
Using reads and writes Increment(c) x = Read(c) Write(c, x + 1) Return x Read(c)

Execution of an algorithm
Execution E is a sequence of alternating configurations and steps C1s1C2s2C3s3… Configurations and steps capture how the system changes over time A configuration describes the state of the system Contents of memory Program counters of all threads A step is an invocation or response, or a read or write to shared memory

Example execution 1: our naïve counter implementation
Suppose two threads p and q both access the same counter c thread p thread q Read(c)  0 Write(c, 1) Return 0 Read(c)  1 Write(c, 2) Return 1 After 2 increments Final counter value = 2 Time

Example execution 2 thread p thread q Read(c)  0 Read(c)  0
After 2 increments Final counter value = 1? Incorrect! Write(c, 1) Return 0 Write(c, 1) Return 0 Same return values? Incorrect! Time

Any changes between this Read and Write are overwritten!
Example execution 3 thread p thread q thread r Read(c)  0 Read(c)  0 Write(c, 1) Return 0 Read(c)  1 Write(c, 2) Return 1 Read(c)  2 After 4 increments Final counter value = 1? Any changes between this Read and Write are overwritten! Write(c, 3) Return 2 Write(c, 1) Time

How bad is this in practice?
Simple experiment n threads share one counter Each thread performs (10 billion / n) increments What should the final count be? 93% of all increments are lost!

What does it mean to be correct?
Counter experiment Obvious: at the end, counter should = 10 billion Not obvious: during the experiment, what should each Increment return? General correctness condition: Linearizability High level idea Every concurrent execution has an equivalent sequential execution Cannot do anything that is impossible in a sequential execution Lifts an object’s sequential definition into the concurrent setting Concurrent behaviour is defined by the sequential behaviour

Defining Linearizability
A concurrent execution is linearizable if each operation has a linearization point during the operation, and all operations appear to occur instantly at their linearization points, behaving as defined in the sequential object specification An object is linearizable if every possible execution is linearizable I.e., every concurrent execution has an equivalent sequential one

counter execution 1: linearizable?
Yes: can pick these linearization points p Increment(c) return 0 Threads q Increment(c) return 1 Time

No! We would need to linearize p’s increment after q’s increment p Increment(c) return 1 Threads q Increment(c) return 0 Time Latest possible linearization point for p’s increment Earliest possible linearization point for q’s increment

Yes: can pick these linearization points p Increment(c) return 1 Threads q Increment(c) return 0 Time Operations can occur in any order, for as long as they are concurrent

Implementing a linearizable counter
Increment must perform its Read and Write at the same time Otherwise a thread can always Write a stale value Need stronger tools! How about locks?

In C: pthread_spin_lock
A Lock Guards an object: allows only one thread at a time to access it Operations Acquire Blocks until the calling thread has acquired the lock (then returns) (Thread is then allowed to access the object) Release Releases the lock so other threads can acquire it (Thread is no longer allowed to access the object) In Java: synchronized In C: pthread_spin_lock

Lock-based counter Let L be the lock guarding the counter c
Increment(c) Acquire(L) x = Read(c) Write(c, x + 1) Release(L) Return x Read(c) Acquire(L) x = Read(c) Release(L) Return x

Can this problem happen now?
thread p thread q Read(c)  0 Acquire(L) Read(c)  0 Acquire(L) Write(c, 1) Return 0 Blocks until p’s Release(L) Release(L) Write(c, 1) Return 0 Will now Read & Return 1! Time

Output is now correct! Similar experiment Output is now correct!
Comparing final counter value of naïve and lock-based Output is now correct! (Of course, this experiment is not a correctness proof) What about performance?

Performance comparison
Simple timed experiment Each data point = average of 5 trials In each trial, for 3 seconds, threads repeatedly perform Increment, and we measure increments/second What is the overhead of locking? 4.9x slower with 1 thread 862x slower with 18 threads Is there a better tool than locking?

Fetch and add (FAA) Instruction implemented modern Intel and AMD systems (lock xadd) FAA(addr, val) does the following atomically (all at once) x = Read(addr) Write(addr, x+val) Return result

Easy FAA-based counter
Increment(c) x = FAA(c, 1) Return x Read(c) x = Read(c) Return x

How does this perform in practice?
Same timed experiment Excluding Naïve from the graph (to zoom in on Lock and FAA) Compared to Lock FAA is up to 16.5x faster Compared to Naïve (incorrect) FAA is up to 64x slower (much better than Lock’s 862x)

Problem: too much contention
Accessing a single counter creates a contention bottleneck Any ideas? What if we shard (partition) the counter into multiple sub counters Increment: pick one sub counter and increment it What about Read? Counter value is distributed over the sub counters Trade-off Single counter  slow Increment, fast Read Sharded counter  fast Increment, slower/more complex Read? Think about this complication later…

Naïve Sharded counter Array of sub counters: int64_t volatile c[numberOfThreads]; Thread p uses its own sub counter c[p] Increment by p x = Read(c[p]) Write(c[p], x+1) Return x No data sharing In theory, should scale perfectly

How does this perform? Same timed experiment
Why is the scaling so poor? Ideal scaling = 18x and 36x No shared data, right? Answer: cache coherence 8.4x single-threaded 9.4x single-threaded

How cache coherence works
Thread 1’s cache Thread 2’s cache w1 w2 w3 w4 w5 w6 w7 w8 S w1 w2 w3 w4 w5 w6 w7 w8 X S Thread 1 reads w2 Thread 2 reads w7 Thread 2 writes w7 w1 w2 w3 w4 w5 w6 w7 w8 64 byte (8 word) cache line

Memory layout of sub counters
… c[0] c[1] c[2] c[3] c[4] c[5] c[6] c[7] c[8] c[9] c[10] c[11] c[12] c[13] c[14] c[15] one cache line one cache line Incrementing one of these will invalidate all of them (causing huge contention) This is called false sharing

Solution: padding … Add empty space to each sub counter
To make it cache line sized … c[0] c[1] one cache line one cache line

How does this perform? Same experiment, but comparing naïve sharding with a padded counter Close to perfect scaling! 16.6x ~ 18x 33.6X ~ 36x 16.6x single-threaded 33.6x single-threaded

Summary Important definitions Important system concepts
(Asynchronous) shared memory model Shared object Execution (of an algorithm) Linearizability Important system concepts Cache coherence Exclusive vs shared mode access, and cache invalidations False sharing and padding Locks and fetch-and-add C++ code for all counters and experiments is available:

Multicore programming

Similar presentations

Presentation on theme: "Multicore programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multicore programming

Similar presentations

Presentation on theme: "Multicore programming"— Presentation transcript:

Similar presentations

About project

Feedback