EE 193: Parallel Computing

EE 193: Parallel Computing
Fall 2017 Tufts University Instructor: Joel Grodstein Lecture 4: More concurrent programming

Goals Primary goals: Learn about mutex, semaphores and barriers
Learn about atomic operations

Next problem opportunity
Having too many threads can cause two different problems. Let's see them, and find a solution EE 193 Joel Grodstein

More threads don't always help
Assume we've improved our code that computes π. void th_func (int thread_numb, int stride) { do a lot of work that does not need to be in a critical section; while (lock != thread_numb) ; // Busy wait until it's my turn do a little work in a critical section; lock = (lock+1) % N_THREADS; // Yield the lock } Assume we have N cores, and that using N threads works well. What happens if we use 2N threads? We finish faster than with N threads We finish slower than with N threads It all depends yes, it always all depends . But in this case… EE 193 Joel Grodstein

Results from Norbert Remember: Norbert has 32 cores Any explanations?
All examples add 500M terms 1 thread=3620ms; 2 threads=1840ms; 4 threads=920ms; 32 threads=160ms 64 threads=230ms; 128 threads=750ms Any explanations? Assume that one thread can completely use all of a core's resources. So 32 threads can keep Norbert busy. A thread that is spin waiting seems to the O/S like it's doing real work, and will be scheduled Adding a thread that just spin waits will steal resources from a thread that's computing. I.e, thread #2 testing whether thread #1 is done actually steals CPU cycles from thread #1! A watched pot never boils. EE 193 Joel Grodstein

Another problem with spin loops
Our spin-waiting loop requires that all threads get priority in order. First thread #0, then #1, then #2… and back to 0. What if that's not good enough? Say that thread #10 finishes its partial sum way before threads #0-9. And then #2 finishes. It would be nice for the threads to be able to get access to the critical section in order of whose ready first, regardless of thread id. Can you think of any scheme that works? Any thread can request access to the critical section, in any order The first requester gets it As usual, we must never allow two different threads to get access at once. EE 193 Joel Grodstein

Mutexes How do we get around these problems?
Just never use more threads than cores . yes, but… you would like your code to run on different machines with different numbers of cores OK, you can probably test for the number of cores But anyway, spin loops burns power Better answer: use a mutex EE 193 Joel Grodstein

C++ threads mutex mutex mut;
void th_func (int thread_numb, int stride) { do a lot of work that does not need to be in a critical section; mut.lock(); do a little work in a critical section; mut.unlock(); } EE 193 Joel Grodstein

Cool things about a mutex
mut.lock(); // Code waits here until we get the lock do a little work in a critical section; mut.unlock(); // Yield the lock Just because we called a library routine doesn't make that routine able to do magic. What's in that routine that solved our problems? In principle, mut.lock() is just But there two big tricks efficiency atomicity while (locked) ; lock=true; EE 193 Joel Grodstein

Cool things about a mutex
mut.lock(); // Code waits here until we get the lock do a little work in a critical section; mut.unlock(); // Yield the lock Isn't this a busy wait? Won't it waste CPU cycles just like our previous while loop did? No, because of secret sauce #1: Internally, there is no while loop. mutex.lock() tests the locked flag just once. If false, then the thread goes to sleep until somebody else unlocks the mutex. Most CPUs/OS provide a mechanism for threads to sleep and wake. while (locked) ; locked=true; EE 193 Joel Grodstein

Uncool things about a mutex
mut.lock(); // Code waits here until we get the lock do a little work in a critical section; mut.unlock(); // Yield the lock What if two threads request the lock at roughly the same time? Could we get while (locked) ; locked=true; locked is false; thread #0 load R1=locked; locked=true; thread #1 load R1=locked; locked=true; Both threads think they have the lock! EE 193 Joel Grodstein

Atomic operations Consider the operation SWAP R1, MEM[R2] But
Loads temp=MEM[R2] Stores MEM[R2]=R1 Copies R1=temp But does not actually use a temporary register does not let anybody else access MEM[R2] between our load and our store This solves our mutex.lock() problem! EE 193 Joel Grodstein

Atomic mutex.lock() The atomic swap solves our problem.
while (R1==1) { swap R1, locked; if (R1==1) sleep until woken; } I have the lock! The atomic swap solves our problem. Different CPU architectures have different (but similar) mechanisms. EE 193 Joel Grodstein

Results from Norbert Remember: Norbert has 32 cores Any explanations?
All examples add 500M terms 1 thread=3620ms; 2 threads=1840ms; 4 threads=980ms; 32 threads=160ms 64 threads=175ms; 128 threads=170ms; 256 threads=165ms Any explanations? As before, assume that one thread can completely use all of a core's resources. So 32 threads can keep Norbert busy. A thread that is spin waiting seems to the O/S like it's doing real work, and will be scheduled A thread that's waiting for a mutex is sleeping; the O/S will not schedule it. I.e, thread #2 testing whether thread #1 is done actually steals CPU cycles from thread #1! (A watched pot never boils). EE 193 Joel Grodstein

Remember this, from the first class?
This is an example of producers and consumers. The consumers must know when the producers have data ready for them The producers usually have to know when the consumers have read the data Copyright © 2010, Elsevier Inc. All rights Reserved

Producer & consumer Consider just two threads: a producer and consumer. With our old calculating-π program, one thread(s) might produce a bunch of partial sums and another thread adds them up. thread 0 1− − 1 7 =.724 1 9 − − 1 15 =.121 thread 1 = .724 = .845 Questions (to think about for 5 minutes): how does thread #1 know that new partial sum is ready? how does thread #0 know that #1 has used its last partial sum, so it can overwrite that with a new one? EE 193 Joel Grodstein

Spin-wait solution Global data structure: Any issues with this code?
double partial_sum; // written only by the producer. double sum; // written only by the consumer boolean partial_is_rdy; // says partial_sum has good data boolean data_taken; // says that data has been taken thread 0 while (not done) { sum a few terms → partial_sum; partial_is_rdy = true; while (!data_taken) ; } thread 1 while (not done) { while (!partial_is_rdy) ; sum += partial_sum; data_taken = true; } Any issues with this code? partial_is_rdy and data_taken are set, but never cleared! EE 193 Joel Grodstein

Spin-wait solution Global data structure:
double partial_sum; // written only by the producer. double sum; // written only by the consumer boolean partial_is_rdy; // says partial_sum has good data boolean data_taken; // says that data has been taken thread 0 while (not done) { sum a few terms → partial_sum; partial_is_rdy = true; while (!data_taken) ; data_taken = false; } thread 1 while (not done) { while (!partial_is_rdy) ; sum += partial_sum; partial_is_rdy = false; data_taken = true; } EE 193 Joel Grodstein

Can this have races? thread 0 thread 1 while (not done) {
sum a few terms → partial_sum; partial_is_rdy = true; while (!data_taken) ; data_taken = false; } thread 1 while (not done) { while (!partial_is_rdy) ; sum += partial_sum; partial_is_rdy = false; data_taken = true; } Ordering enforced: Thread #0 sets partial_is_rdy=true Thread 1 modifies sum and then sets partial_is_rdy=false, data_taken=true Thread #0 cannot modify partial_is_rdy again until after thread #1 has set data_taken=true. EE 193 Joel Grodstein

Can you do this with only one flag?
thread 0 while (not done) { sum a few terms → partial_sum; partial_is_rdy = true; while (!data_taken) ; data_taken = false; } thread 1 while (not done) { while (!partial_is_rdy) ; sum += partial_sum; partial_is_rdy = false; data_taken = true; } We used both partial_is_rdy and data_taken. Can you do the same thing with just one flag? EE 193 Joel Grodstein

thread 0 while (not done) { sum a few terms → partial_sum; partial_is_rdy = true; while (partial_is_rdy) ; } thread 1 while (not done) { while (!partial_is_rdy) ; sum += partial_sum; partial_is_rdy = false; } Simpler version also works fine EE 193 Joel Grodstein

thread 0 while (not done) { sum a few terms → partial_sum; partial_is_rdy = true; while (partial_is_rdy) ; } thread 1 while (not done) { while (!partial_is_rdy) ; sum += partial_sum; partial_is_rdy = false; } Spin waiting is rarely a great solution, in general. Can you figure out how to use a mutex instead? Not so easy: thread #0 wants to set it 1 and wait for it to be 0, thread #1 wants to do the reverse Doesn't easily fit a mutex There's a better solution, anyway EE 193 Joel Grodstein

Semaphores A semaphore is an object that keeps an internal counter and has two methods: .post(). Increments the internal counter .wait(). Waits until the counter is >0, and then decrements it and returns. Both of these are guaranteed atomic Intuition: post() says that you’ve just created one unit of some resource wait() says to wait until there’s at least one unit of the resource; and then consume it and return. EE 193 Joel Grodstein

Semaphores A semaphore is an object that keeps an internal counter and has two methods: .post(). Increments the internal counter .wait(). Waits until the counter is >0, and then decrements it and returns. Both of these are guaranteed atomic thread 0 while (not done) { sum a few terms → partial_sum; data.post(); slot.wait(); } thread 1 while (not done) { data.wait(); sum += partial_sum; slot.post(); } EE 193 Joel Grodstein

Semaphores Do you believe that it works?
Can you do it with just one semaphore? Hopefully, yes  No. The producer would need a function called "wait until semaphore==0," which doesn't exist thread 0 while (not done) { sum a few terms → partial_sum; data.post(); slot.wait(); } thread 1 while (not done) { data.wait(); sum += partial_sum; slot.post(); } EE 193 Joel Grodstein

More semaphores Our two semaphores just had values of 0 and 1
in general, they can count up to any integer, down to 0. can keep track of how much of a resource is available. Minor issue: C++ threads doesn't provide a semaphore but it does provide a condition variable, which you can easily use to build one. we won't discuss condition variables here. EE 193 Joel Grodstein

Barriers Typically, different threads all proceed independently.
Due to stalls, memory misses, etc., they may proceed at very different rates. This is not always good it may make your program harder to debug if threads share some common data, it may mean that data must live in cache longer (esp. for GPUs) A barrier is a software construct such that: a bunch of threads include a barrier statement each thread hits the barrier and stops when the last thread has hit the barrier, then they all go on. Analogy: hikers all wait for each other at a fork in the trail C++ threads does not have a barrier, but you can easily build one with the statements it has. EE 193 Joel Grodstein

In-class exercise: can you build a barrier out of a semaphore?
thread() { some stuff; barrier(); other stuff; } all threads do "some stuff" at their own pace, then wait for each other at the barrier. Assume there are exactly 4 threads mutex mut; semaphore sem; barrier () { what goes here? } You get to use post() and wait(). And it might help to use a mutex to protect a critical section. EE 193 Joel Grodstein

Assume there are exactly 4 threads
some stuff; barrier(); other stuff; } all threads do "some stuff" at their own pace, then wait for each other at the barrier. Assume there are exactly 4 threads mutex mut; semaphore sem; barrier () { what goes here? } Hints: increment a counter that tracks how many threads have reached the barrier if counter==4, do some semaphore posts do a semaphore wait EE 193 Joel Grodstein

++n_threads_at_barrier; if (n_threads_at_barrier==4)
mutex mut; semaphore sem; barrier () { mut.lock(); ++n_threads_at_barrier; if (n_threads_at_barrier==4) sem.post(); sem.post(); sem.post(); sem.post() mut.unlock(); sem.wait(); } Count how many threads have reached the barrier When all four have reached it, let everyone proceed critical section First 3 threads wait here until the final thread posts 4 times. EE 193 Joel Grodstein

Summary We learned about
mutexes (simple control over a critical section) semaphores (more advanced tracking of resources) barriers (ensure that all threads have reached a particular point in the code) atomic operations (they are what goes under the hood, to make all of the above operations work) Now we know the basics of why parallel programs are hard, and some tricks Next up: learn enough about hardware to make our programs run really fast EE 193 Joel Grodstein

EE 193: Parallel Computing

Similar presentations

Presentation on theme: "EE 193: Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EE 193: Parallel Computing

Similar presentations

Presentation on theme: "EE 193: Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback