Shared Memory Programming with Threads

Shared Memory Programming with Threads
High-Performance Grid Computing and Research Networking Shared Memory Programming with Threads Presented by Yuxin Zhuang Instructor: S. Masoud Sadjadi sadjadi At cs Dot fiu Dot edu

Acknowledgements The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! Henri Casanova Principles of High Performance Computing

Shared memory programming
The “easiest” form of parallel programming Can be used to parallelize a sequential code in an incremental way: take a sequential code parallelize a small section check that it works check that it speeds things up a bit move on to another section We will see that parallelizing a program for distributed memory is far from trivial and requires much more effort But it is necessary to scale to large numbers of processors Remember that almost everybody would prefer a shared-memory machine to a distributed-memory machine. The problem is that shared-memory machines do not scale well within reasonable cost (if at all possible).

Outline Multi-threading with pthreads Multi-threading with OpenMP

What is a thread? A thread is a stream of instructions that can be scheduled as an independent unit. A process is created by an operating system contains information about resources process id, file descriptors, ... contains information on the execution state program counter, stack, ... The concept of a thread requires that we make a separation between these two kinds of information in a process resources available to the entire process program instructions, global data, working directory schedulable entities program counters and stacks. A thread is an entity within a process which consists of the schedulable part of the process.

Parallelism with Threads
Create threads within a process Each thread does something (hopefully) useful Threads may be working truly concurrently Multi-processor Multi-core Or just pseudo-concurrently Single-proc, single-core

Example Say I want to compute the sum of two arrays
I can just create N threads, each of which sums 1/Nth of both arrays and then combine their results I can also create N threads that each increment some sum variable element-by-element, but then I’ve got to make sure they don’t step on each other’s toes The first version is a bit less “shared-memory”, but is probably more efficient We’ll see how to actually write code to do this

Multi-threading issues
There are really two main issues when writing multi-threaded code: Issue #1: Load Balancing Make sure that no processors/cores is left idle when it could be doing useful work We will talk about this a lot throughout the semester as it arises in all forms of parallel computing Issue #2: Correct access to shared variables Implemented via mutual exclusion: create sections of code that only a single thread can be in at a time Called “critical sections” Classical variable update example Done via “locks” and “unlocks” Warning: locks are NOT on variables, but on sections of code

Threads in Practice Pthreads OpenMP Java Threads Popular C library
Flexible Requires a fair amount of work OpenMP Standard for multi-threading for high-performance computing More rigid than pthreads Requires very little work Java Threads Well integrated with the rest of the language

Pthreads A POSIX standard (IEEE c) API for thread creation and synchronization The API specifies the standard behavior Implementation choices are up to developers And implementations vary from systems to systems, with some better than some others Common in all UNIX operating systems Some people have written it for Win32 The most portable threading library out there What do threads look like in UNIX?

User-level / Kernel-level
User-level threads: Many-to-one thread mapping Implemented by user-level runtime libraries Create, schedule, synchronize threads at user-level OS is not aware of user-level threads OS thinks each process contains only a single thread of control Advantages Does not require OS support; Portable Can tune scheduling policy to meet application demands Lower overhead thread operations since no system calls Disadvantages Cannot leverage multiprocessors Entire process blocks when one thread blocks

User-level / Kernel-level
Kernel-level threads: One-to-one thread mapping OS provides each user-level thread with a kernel thread Each kernel thread scheduled independently Thread operations (creation, scheduling, synchronization) performed by OS Advantages Each kernel-level thread can run in parallel on a multiprocessor When one thread blocks, other threads from process can be scheduled Disadvantages Higher overhead for thread operations OS must scale well with increasing number of threads

Using the Pthread Library
Pthread library typically uses kernel-threads But a “cool” project would be to implement it in a user-level fashion Programs must include the file pthread.h Programs must be linked with the pthread library (-lpthread) Command line: “gcc –pthread program.c” The API contains functions to create threads control threads manage threads synchronize threads

pthread_self() Returns the thread identifier for the calling thread
At any point in its instruction stream a thread can figure out which thread it is Convenient to be able to write code that says: “If you’re thread 1 do this, otherwise do that” #include <pthread.h> pthread_t pthread_self(void);

pthread_create() Creates a new thread of control
#include <pthread.h> int pthread_create ( pthread_t *thread, pthread_attr_t *attr, void * (*start_routine) (void *), void *arg); Returns 0 to indicate success, otherwise returns error code thread: output argument that will contain the thread id of the new thread attr: input argument that specifies the attributes of the thread to be created (NULL = default attributes) start_routine: function to use as the start of the new thread must have prototype: void * foo(void*) arg: argument to pass to the new thread routine If the thread routine requires multiple arguments, they must be passed bundled up in an array or a structure

pthread_create() example
Want to create a thread to compute the sum of the elements of an array void *do_work(void *arg); Needs three arguments the array, its size, where to store the sum we need to bundle them in a structure struct arguments { double *array; int size; double *sum; }

int main(int argc, char *argv) { double array[100]; double sum; pthread_t worker_thread; struct arguments *arg; arg = (struct arguments *)calloc(1,sizeof(struct arguments); arg->array = array; arg->size=100; arg->sum = ∑ if (pthread_create(&worker_thread, NULL, do_work, (void *)arg)) { fprintf(stderr,”Error while creating thread\n”); exit(1); } ...

void *do_work(void *arg) { struct arg *argument; int i, size; double *array; double *sum; struct arg *argument = (struct arg *)arg; size = argument->size; array = argument->array; sum = argument->sum; *sum = 0; for (i=0;i<size;i++) *sum += array[i]; return NULL; }

Comments about the example
The “parent thread” continues its normal execution after creating the “child thread” Memory is shared by the parent and the child (the array, the location of the sum) nothing prevents from the parent doing something to it while the child is still executing which may lead to a wrong computation The bundling and unbundling of arguments is a bit tedious, but nothing compared to what’s needed with shared memory segments and processes

pthread_exit() Terminates the calling thread
#include <pthread.h> void pthread_exit( void *retval); The return value is made available to another thread calling a pthread_join() (see later) My previous example had the thread just return from function do_work() In this case the call to pthread_exit() is implicit The return value of the function serves as the argument to the (implicitly called) pthread_exit().

pthread_join() Causes the calling thread to wait for another thread to terminate #include <pthread.h> int pthread_join( pthread_t thread, void **value_ptr); thread: input parameter, id of the thread to wait on value_ptr: output parameter, value given to pthread_exit() by the terminating thread (which happens to always be a void *) returns 0 to indicate success, error code otherwise multiple simultaneous calls for the same thread are not allowed

pthread_kill() Causes the termination of a thread
#include <pthread.h> int pthread_kill( pthread_t thread, int sig); thread: input parameter, id of the thread to terminate sig: signal number returns 0 to indicate success, error code otherwise

pthread_join() example
int main(int argc, char *argv) { double array[100]; double sum; pthread_t worker_thread; struct arguments *arg; void *return_value; arg = (struct arguments *)calloc(1,sizeof(struct arguments)); arg->array = array; arg->size=100; arg->sum = ∑ if (pthread_create(&worker_thread, NULL, do_work, (void *)arg)) { fprintf(stderr,”Error while creating thread\n”); exit(1); } ... if (pthread_join(worker_thread, &return_value)) { fprintf(stderr,”Error while waiting for thread\n”);

Synchronizing pthreads
As we’ve seen earlier, we need a system to implement locks to create mutual exclusion for variable access, via critical sections Lock creation int pthread_mutex_init( pthread_mutex_t *mutex, const pthread_mutexattr_t *attr); returns 0 on success, an error code otherwise mutex: output parameter, lock attr: input, lock attributes NULL: default There are functions to set the attribute (look at the man pages if you’re interested)

Locking a lock If the lock is already locked, then the calling thread is blocked If the lock is not locked, the the calling thread acquires it int pthread_mutex_lock( pthread_mutex_t *mutex); returns 0 on success, an error code otherwise mutex: input parameter, lock Just checking Returns instead of locking int pthread_mutex_trylock( returns 0 on success, EBUSY if the lock is locked, an error code otherwise

Releasing a lock int pthread_mutex_unlock( pthread_mutex_t *mutex); returns 0 on success, an error code otherwise mutex: input parameter, lock With locking, trylocking, and unlocking, one can avoid all race conditions and protect access to shared variables

Mutex Example: ... pthread_mutex_t mutex; pthread_mutex_init(&mutex, NULL); pthread_mutex_lock(&mutex); count++; pthread_mutex_unlock(&mutex); Critical Section To “lock” variable count, just put a pthread_mutex_lock() and pthread_mutex_unlock() around all sections of the code that write to variable count Again, you’re really locking code, not variables

Cleaning up memory Releasing memory for a mutex
int pthread_mutex_destroy( pthread_mutex_t *mutex); Releasing memory for a mutex attribute int pthread_mutexattr_destroy( pthread_mutexattr_t *mutex);

Signaling Allows a thread to wait until some process signals that some condition is met provides a more sophisticated way to synchronize threads than just mutex locks Done with “condition variables” Example: You have to implement a server with a main thread and many threads that can be assigned work (e.g., an incoming request) You want to be able to “tell” a thread: “there is work for you to do” Inconvenient to do with mutex locks the main thread must carefully manage a lock for each worker thread everybody must constantly be polling locks

Condition Variables Condition variables are used in conjunction with mutexes Create a condition variable Create an associated mutex We will see why it’s needed later Waiting on a condition lock the mutex wait on condition variable unlock the mutex Signaling Lock the mutex Signal on the condition variable Unlock mutex

pthread_cond_init() Creating a condition variable
int pthread_cond_init( pthread_cond_t *cond, const pthread_condattr_t *attr); returns 0 on success, an error code otherwise cond: output parameter, condition attr: input parameter, attributes (default = NULL)

pthread_cond_wait() Waiting on a condition variable
int pthread_cond_wait( pthread_cond_t *cond, pthread_mutex_t *mutex); returns 0 on success, an error code otherwise cond: input parameter, condition mutex: input parameter, associated mutex

pthread_cond_signal()
Signaling a condition variable int pthread_cond_signal( pthread_cond_t *cond; returns 0 on success, an error code otherwise cond: input parameter, condition “Wakes up” one thread out of the possibly many threads waiting for the condition The thread is chosen non-deterministically

pthread_cond_broadcast()
Signaling a condition variable int pthread_cond_broadcast( pthread_cond_t *cond; returns 0 on success, an error code otherwise cond: input parameter, condition “Wakes up” ALL threads waiting for the condition

Condition Variable: example
Say I want to have multiple threads wait until a counter reaches a maximum value and be awakened when it happens pthread_mutex_lock(&lock); while (count < MAX_COUNT) { pthread_cond_wait(&cond,&lock); } pthread_mutex_unlock(&lock) Locking the lock so that we can read the value of count without the possibility of a race condition Calling pthread_cond_wait() in a loop to avoid “spurious wakes ups” When going to sleep the pthread_cond_wait() function implicitly releases the lock When waking up the pthread_cond_wait() function implicitly acquires the lock (and may thus sleep) Unlocking the lock after exiting from the loop

pthread_cond_timed_wait()
Waiting on a condition variable with a timeout int pthread_cond_timedwait( pthread_cond_t *cond, pthread_mutex_t *mutex, const struct timespec *delay); returns 0 on success, an error code otherwise cond: input parameter, condition mutex: input parameter, associated mutex delay: input parameter, timeout (same fields as the one used for gettimeofday)

PThreads: Conclusion A popular way to write multi-threaded code
If you know pthreads, you’ll have no problem adapting to other multi-threading techniques Condition variables are a bit odd, but very useful For your project you may want to use pthreads More information Man pages PThread Tutorial: pthread_mutex.c & pthread_CondVars.c

Outline Multi-threading with pthreads Multi-threading with OpenMP

Simple OpenMP Program Goal: make shared memory programming easy (or at least easier than with pthread) How? A library with some simple functions The definition of a few C pragmas pragmas are a way to make the language extensible provide an easy way to give hints/information to a compiler A compiler (which generates pthread code!)

Fork-Join Model Program begins with a Master thread
Fork: Teams of threads created at times during execution Join: Threads in the team synchronize (barrier) and only the master thread continues execution

OpenMP and #pragma C / C++ Directives Format
#pragma omp directive-name [clause, ...] newline

#pragma omp parallel [clauses]
OpenMP and #pragma One needs to specify blocks of code that are executed in parallel For example, a parallel region: #pragma omp parallel [clauses] Defines a section of the code that will be executed in parallel The “clauses” specify many things including what happens to variables All threads in the section execute the same code

OpenMP Compiler There are several free OpenMP “compiler”
Really more like source-to-source translators The OpenMP compiler on our cluster is called ompicc Located in the directory /share/apps/ompi/bin You can use it just like gcc all the options should work but compilation error messages may be different (meaning worse)

First “Hello World” example
#include <omp.h> int main(){ print(“Start\n”); #pragma omp parallel { // note the { printf(“Hello World\n”); } // note the } /* Resume Serial Code */ printf(“Done\n”); } % my_program Start Hello World Done

First “Hello World” example
#include <omp.h> int main(){ print(“Start\n”); #pragma omp parallel { printf(“Hello World\n”); } /* Resume Serial Code */ printf(“Done\n”); % my_program Start Hello World Done Questions How many threads? This is not useful because all threads do exactly the same thing Conditional compilation?

How Many Threads? Set via an environment variable
setenv OMP_NUM_THREADS 8 Set via the OpenMP API void omp_set_num_threads(int number); int omp_get_num_threads(); Typically, a function of the number of processors available We often take the number of threads identical to the number of processors/cores

Threads Doing Different Things
#include <omp.h> int main() { int iam =0, np = 1; #pragma omp parallel private(iam, np) { np = omp_get_num_threads(); iam = omp_get_thread_num(); printf(“Hello from thread %d out of %d threads\n”, iam, np); } % setenv OMP_NUM_THREADS 3 % my_program Hello from thread 0 out of 3 Hello from thread 1 out of 3 Hello from thread 2 our of 3

Conditional Compilation
The _OPENMP variable is defined if the code is compiled with OpenMP #ifdef _OPENMP #include <omp.h> #endif int main() { int iam = 0, np = 1; #pragma omp parallel private(iam, np) { np = omp_get_num_threads(); iam = omp_get_thread_num(); printf(“Hello from thread %d out of %d threads\n”, iam, np); } This code will work serially!

Data Scoping and Clauses
Shared: all threads access the single copy of the variable, created in the master thread it is the responsibility of the programmer to ensure that it is shared appropriately Private: a volatile copy of the variable is created for each thread, and discarded at the end of the parallel region (but for the master) There are other variations firstprivate: initialization from the master’s copy lastprivate: the master gets the last value updated by the last thread to do an update and several others (Look in the on-line material if you’re interested)

Work Sharing directives
We have seen the concept of a parallel region, which is a brute-force SPMD directive Work Sharing directives make it possible to have threads “share work” within a parallel region. For Loop Sections Single

For Loops Share iterations of the loop across threads
Represents a type of “data parallelism” do the same operation on pieces of the same big piece of data Program correctness must NOT depend on which thread executes which iteration No ordering!

For Loop Example #include <omp.h> #define N 1000 main () {
int i, chunk;float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; #pragma omp parallel shared(a,b,c) private(i) { #pragma omp for schedule(dynamic) c[i] = a[i] + b[i]; } /* end of parallel section */ }

Sections Breaks work into separate sections
Each section is executed by a thread Can be used to implement “task parallelism” do different things on different pieces of data If more threads than sections, then some are idle If fewer threads than sections, then some sections are seralized

Section Example Section #1 Section #2 #include <omp.h>
#define N 1000 main (){ int i;float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; #pragma omp parallel shared(a,b,c) private(i) { #pragma omp sections #pragma omp section for (i=0; i < N/2; i++) c[i] = a[i] + b[i]; } for (i=N/2; i < N; i++) } /* end of sections */ } /* end of parallel section */ Section #1 Section #2

Single Serializes a section of code within a parallel region
Sometimes more convenient than terminating a parallel region and starting it later especially because variables are already shared/private, etc. Typically used to serialize a small section of the code that’s not thread safe e.g., I/O

Combined Directives Sometimes it is cumbersome to create a parallel region and then create a parallel for loop, or sections, just to terminate the parallel region Therefore OpenMP provides a way to do both at the same time #pragma omp parallel for #pragma opm parallel sections

Synchronization and Sharing
When variables are shared among threads, OpenMP provides tools to make sure that the sharing is correct Why could things be unsafe? int x = 0; #pragma omp parallel sections shared(x) { #pragma omp section x = x + 1 x = x + 2 }

Synchronization directive
#pragma omp master Creates a region that only the master executes #pragma omp critical Creates a critical section #pragma omp barrier Creates a “barrier” #pragma omp atomic Create a “mini” critical section

Critical Section #pragma omp parallel for \ shared(sum)
for(i = 0; i < n; i++){ value = f(a[i]); #pragma omp critical { sum = sum + value; }

Barrier if (x == 2) { #pragma omp barrier }
All threads in the current parallel section will synchronize they will all wait for each other at this instruction Must appear within a basic block

Atomic #pragma omp atomic i++; Only for some expressions
x = expr (no mutual exclusion on expr evaluation) x++ ++x x-- --x Is about atomic access to a memory location Some implementations will just replace atomic by critical and create a basic blocks But some may take advantage of cool hardware instructions that work atomically

Scheduling When I talked about the parallel for loops, I didn’t say how the iterations were shared among threads Question: I have 100 iterations. I have 5 threads. Which thread does which iteration? OpenMP provides many options to do this Choice #1: Chunk size a way to group iterations togethers e.g., chunk size = 2 means that iterations are grouped 2 by 2 allows to avoid prohibitive overhead in some situations Choice #2: Scheduling Policy

Loop Scheduling in OpenMP
static: Iterations are divided into pieces of a size specified by chunk. The pieces are statically assigned to threads in the team in a roundrobin fashion in the order of the thread number. dynamic: Iterations are broken into pieces of a size specified by chunk. As each thread finishes a piece of the iteration space, it dynamically obtains the next set of iterations. guided: The chunk size is reduced in an exponentially decreasing manner with each dispatched piece of the iteration space. chunk specifies the smallest piece (except possibly the last). Default schedule: implementation dependent.

Example int chunk = 3 #pragma omp parallel for \ shared(a,b,c,chunk) \
private(i) \ schedule(static,chunk) for (i=0; i < n; i++) c[i] = a[i] + b[i];}

OpenMP Scheduling chunk size = 2 iterations Thread 1 Thread 2 Thread 3

OpenMP Scheduling STATIC chunk size = 2 iterations
Thread 1 Thread 2 Thread 3 STATIC time

So, isn’t static optimal?
The problem is that in many cases the iterations are not identical Some iterations take longer to compute than other Example #1 Each iteration is a rendering of a movie’s frame More complex frames require more work Example #2 Each iteration is a “google search” Some searches are easy Some searches are hard In such cases, load unbalance arises which we know is bad

OpenMP Scheduling STATIC chunk size = 2 iterations
Thread 1 Thread 2 Thread 3 time STATIC

OpenMP Scheduling DYNAMIC chunk size = 2 iterations
Thread 1 Thread 2 Thread 3 time DYNAMIC

So isn’t dynamic optimal?
One thing we haven’t talked much about is the overhead Dynamic scheduling with small chunks causes more overhead than static scheduling In the static case, one can compute what each thread does at the beginning of the loop and then let the threads proceed unhindered In the dynamic case, there needs to be some type of communication: “I am done with my 2 iterations, which ones do I do next?” Can be implemented in a variety of ways internally Using dynamic scheduling with a large chunk size leads to lower overhead, but defeats the purpose with fewer chunks, load-balancing is harder Guided Scheduling: best of both worlds start with large chunks, ends with small ones

OpenMP Scheduling Guided chunk size = 2 iterations
Thread 1 Thread 2 Thread 3 time Guided 3 chunks of size 4 3 chunks of size 2

What should I do? Pick a reasonable chunk size
Use static if computation is evenly spread among iterations Otherwise probably use guided

How does OpenMP work The pragmas allow OpenMP to build some notion of structure of the code And then, OpenMP generates pthread code!! You can see this by running the nm command on your executable OpenMP hides a lot of the complexity But it doesn’t have all the flexibility The two are used in different domains OpenMP: “scientific applications” Pthreads: “system” applications But this distinction is really arbitrary IMHO

More OpenMP Information
OpenMP Homepage: On-line OpenMP Tutorial:

Lessons Although we have only scratched the surface of parallel computing, we have already encountered a few fundamental concepts Load-balancing is good Overhead is bad The two often pose a difficult trade-off Achieving great load-balancing can often only be done with high overhead, which puts us back where we started We will see this trade-off over and over in many different contexts

Shared Memory Programming with Threads

Similar presentations

Presentation on theme: "Shared Memory Programming with Threads"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shared Memory Programming with Threads

Similar presentations

Presentation on theme: "Shared Memory Programming with Threads"— Presentation transcript:

Similar presentations

About project

Feedback