Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples.

Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 2 – Goal of next lectures Introduction to programming with Pthreads. Standard patterns of parallel programs. data parallelism task parallelism Examples of each.

– 3 – Intro to Pthreads for Shared Memory proc1proc2 proc3procN Shared Memory Address Space All threads access the same shared memory data space.

– 4 – Intro to Pthreads (continued) Concretely, it means that a variable x, a pointer p, or an array a[] refer to the same object, no matter what processor the reference originates from. We have more or less implicitly assumed this to be the case in earlier examples.

– 5 – Multithreading User has explicit control over thread. Good: control can be used to performance benefit. Bad: user has to deal with it.

– 6 – Pthreads POSIX standard shared-memory multithreading interface. Provides primitives for process management and synchronization.

– 7 – What does the user have to do? Decide how to decompose the computation into parallel parts. Create (and destroy) threads to support that decomposition. Add synchronization to make sure dependences are covered.

– 8 – General Thread Structure Typically, a thread is a concurrent execution of a function or a procedure. So, your program needs to be restructured such that parallel parts form separate procedures or functions.

– 9 – Example of Thread Creation (contd.) main() pthread_ create(func) func()

– 10 – Thread Joining Example void *func(void *) { ….. } pthread_t id; int X; pthread_create(&id, NULL, func, &X); ….. pthread_join(id, NULL); …..

– 11 – Example of Thread Creation (contd.) main() pthread_ create(func) func() pthread_ join(id) pthread_ exit()

– 12 – Example: Matrix Multiply for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

– 13 – Parallel Matrix Multiply All i- or j-iterations can be run in parallel. If we have p processors, n/p rows to each processor. Corresponds to partitioning i-loop.

– 14 – Matrix Multiply: Parallel Part void mmult(void* s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for(i=from; i<to; i++) for(j=0; j<n; j++) { c[i][j] = 0.0; for(k=0; k<n; k++) c[i][j] += a[i][k]*b[k][j]; }

– 15 – Matrix Multiply: Main int main(){ pthread_t thrd[p]; for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, mmult,(void*) i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); }

– 16 – Summary: Thread Management pthread_create(): creates a parallel thread executing a given function (and arguments), returns thread identifier. pthread_exit(): terminates thread. pthread_join(): waits for thread with particular thread identifier to terminate.

– 17 – Summary: Program Structure Encapsulate parallel parts in functions. Use function arguments to parameterize what a particular thread does. Call pthread_create() with the function and arguments, save thread identifier returned. Call pthread_join() with that thread identifier.

– 18 – Private Data in Pthreads To make a variable private in Pthreads, you need to make an array out of it. Index the array by thread identifier, which you keep Can also get thread id maintained by the system by calling the pthreads_self() call. Not very elegant or efficient.

– 19 – Pthreads Synchronization Need for fine-grain synchronization mutex locks condition variables.

– 20 – Use of Mutex Locks To implement critical sections. Pthreads provides only exclusive locks. Some other systems allow shared-read, exclusive-write locks.

– 21 – Condition variables (1 of 5) pthread_cond_init( pthread_cond_t *cond, pthread_cond_attr *attr) Creates a new condition variable cond. Attribute: ignore for now.

– 22 – Condition Variables (2 of 5) pthread_cond_destroy( pthread_cond_t *cond) Destroys the condition variable cond.

– 23 – Condition Variables (3 of 5) pthread_cond_wait( pthread_cond_t *cond, pthread_mutex_t *mutex) Blocks the calling thread, waiting on cond. Unlocks the mutex.

– 24 – Condition Variables (4 of 5) pthread_cond_signal( pthread_cond_t *cond) Unblocks one thread waiting on cond. Which one is determined by scheduler. If no thread waiting, then signal is a no-op.

– 25 – Condition Variables (5 of 5) pthread_cond_broadcast( pthread_cond_t *cond) Unblocks all threads waiting on cond. If no thread waiting, then broadcast is a no-op.

– 26 – Use of Condition Variables To implement signal-wait synchronization discussed in earlier examples. Important note: a signal is “forgotten” if there is no corresponding wait that has already happened.

– 27 – Example (from a few lectures ago) for( i=1; i<100; i++ ) { a[i] = …; …; … = a[i-1]; } Loop-carried dependence, not parallelizable

– 28 – Example (continued) for( i=...; i<...; i++ ) { a[i] = …; signal(e_a[i]); …; wait(e_a[i-1]); … = a[i-1]; }

– 29 – How to Remember a Signal (1 of 2) semaphore_signal(i) { pthread_mutex_lock(&mutex_rem[i]); arrived [i]= 1; pthread_cond_signal(&cond[i]); pthread_mutex_unlock(&mutex_rem[i]); }

– 30 – How to Remember a Signal (2 of 2) semaphore_wait(i) { pthreads_mutex_lock(&mutex_rem[i]); if( arrived[i] = 0 ) { pthreads_cond_wait(&cond[i], mutex_rem[i]); } arrived[i] = 0; pthreads_mutex_unlock(&mutex_rem[i]); }

– 31 – Example (continued) for( i=...; i<...; i++ ) { a[i] = …; semaphore_signal(e_a[i]); …; semaphore_wait(e_a[i-1]); … = a[i-1]; }

– 32 – More Examples: SOR SOR implements a mathematical model for many natural phenomena, e.g., heat dissipation in a metal sheet. Model is a partial differential equation. Focus is on algorithm, not on derivation.

– 33 – Problem Statement x y F = 1 F = 0 F(x,y) = 0 2

– 34 – Discretization Represent F in continuous rectangle by a 2-dimensional discrete grid (array). The boundary conditions on the rectangle are the boundary values of the array The internal values are found by the relaxation algorithm.

– 35 – Discretized Problem Statement i j

– 36 – Relaxation Algorithm For some number of iterations for each internal grid point compute average of its four neighbors Termination condition: values at grid points change very little (we will ignore this part in our example)

– 37 – Discretized Problem Statement for some number of timesteps/iterations { for (i=1; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=1; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }

– 38 – Parallel SOR No dependences between iterations of first (i,j) loop nest. No dependences between iterations of second (i,j) loop nest. True-dependence between first and second loop nest in the same timestep. True dependence between second loop nest and first loop nest of next timestep.

– 39 – Parallel SOR (continued) First (i,j) loop nest can be parallelized. Second (i,j) loop nest can be parallelized. We must make processors wait at the end of each (i,j) loop nest. Natural synchronization: fork-join.

– 40 – Parallel SOR (continued) If we have P processors, we can give n/P rows or columns to each processor. Or, we can divide the array in P squares, and give each processor a square to compute.

– 41 – Pthreads SOR: main for some number of timesteps { for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_1, (void *)i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_2, (void *)i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); }

– 42 – Pthreads SOR: Parallel parts (1) void* sor_1(void *s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for(i=from;i<to;i++) for( j=0; j<n; j++ ) temp[i][j] = 0.25*(grid[i-1][j] + grid[i+1][j] +grid[i][j-1] + grid[i][j+1]); }

– 43 – Pthreads SOR: Parallel parts (2) void* sor_2(void *s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for(i=from;i<to;i++) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }

– 44 – Reality bites... Create/exit/join is not so cheap. It would be more efficient if we could come up with a parallel program, in which create/exit/join would happen rarely (once!), cheaper synchronization were used. We need something that makes all threads wait, until all have arrived -- a barrier.

– 45 – Barrier Synchronization A wait at a barrier causes a thread to wait until all threads have performed a wait at the barrier. At that point, they all proceed.

– 46 – Implementing Barriers in Pthreads Count the number of arrivals at the barrier. Wait if this is not the last arrival. Make everyone unblock if this is the last arrival. Since the arrival count is a shared variable, enclose the whole operation in a mutex lock-unlock.

– 47 – Implementing Barriers in Pthreads void barrier() { pthread_mutex_lock(&mutex_arr); arrived++; if (arrived<N) { pthread_cond_wait(&cond, &mutex_arr); } else { pthread_cond_broadcast(&cond); arrived=0; /* be prepared for next barrier */ } pthread_mutex_unlock(&mutex_arr); }

– 48 – Parallel SOR with Barriers (1 of 2) void* sor (void* arg) { int slice = (int)arg; int from = (slice * (n-1))/p + 1; int to = ((slice+1) * (n-1))/p + 1; for some number of iterations { … } }

– 49 – Parallel SOR with Barriers (2 of 2) for (i=from; i<to; i++) for (j=1; j<n; j++) temp[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); barrier(); for (i=from; i<to; i++) for (j=1; j<n; j++) grid[i][j]=temp[i][j]; barrier();

– 50 – Parallel SOR with Barriers: main int main(int argc, char *argv[]) { pthread_t *thrd[p]; /* Initialize mutex and condition variables */ for (i=0; i<p; i++) pthread_create (&thrd[i], &attr, sor, (void*)i); for (i=0; i<p; i++) pthread_join (thrd[i], NULL); /* Destroy mutex and condition variables */ }

– 51 – Note again Many shared memory programming systems (other than Pthreads) have barriers as basic primitive. If they do, you should use it, not construct it yourself. Implementation may be more efficient than what you can do yourself.

– 52 – Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, celestial bodies,... Have same basic structure.

– 53 – Molecular Dynamics (Skeleton) for some number of timesteps { for all molecules i for all other molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); }

– 54 – Molecular Dynamics (continued) To reduce amount of computation, account for interaction only with nearby molecules.

– 55 – Molecular Dynamics (continued) for some number of timesteps { for all molecules i for all nearby molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); }

– 56 – Molecular Dynamics (continued) for each molecule i number of nearby molecules count[i] array of indices of nearby molecules index[j] ( 0 <= j < count[i])

– 57 – Molecular Dynamics (continued) for some number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

– 58 – Molecular Dynamics (continued) No loop-carried dependence in first i-loop. Loop-carried dependence (reduction) in j-loop. No loop-carried dependence in second i-loop. True dependence between first and second i-loop.

– 59 – Molecular Dynamics (continued) First i-loop can be parallelized. Second i-loop can be parallelized. Must make processors wait between loops. Natural synchronization: fork-join.

– 60 – Molecular Dynamics (continued) for some number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); } Parallelize the two for loops (assume fork-join parallelism) I will use the notation “Parallel for” to denote fork-join parallelism for each for loop

– 61 – Molecular Dynamics (simple) for some number of timesteps { Parallel for for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); Parallel for for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

– 62 – Irregular vs. regular data parallel In SOR, all arrays are accessed through linear expressions of the loop indices, known at compile time [regular]. In MD, some arrays are accessed through non-linear expressions of the loop indices, some known only at runtime [irregular].

– 63 – Irregular vs. regular data parallel No real differences in terms of parallelization (based on dependences). Will lead to fundamental differences in expressions of parallelism: irregular difficult for parallelism based on data distribution not difficult for parallelism based on iteration distribution.

– 64 – Molecular Dynamics (continued) Parallelization of first loop: has a load balancing issue some molecules have few/many neighbors more sophisticated loop partitioning necessary

– 65 – Flavors of Parallelism Data parallelism: all processors do the same thing on different data. Regular Irregular Task parallelism: processors do different tasks. Task queue Pipelines

– 66 – Task Parallelism Each process performs a different task. Two principal flavors: pipelines task queues Program Examples: PIPE (pipeline), TSP (task queue).

– 67 – Pipeline Often occurs with image processing applications, where a number of images undergoes a sequence of transformations. E.g., rendering, clipping, compression, etc.

– 68 – Sequential Program for( i=0; i<num_pic, read(in_pic[i]); i++ ) { int_pic_1[i] = trans1( in_pic[i] ); int_pic_2[i] = trans2( int_pic_1[i]); int_pic_3[i] = trans3( int_pic_2[i]); out_pic[i] = trans4( int_pic_3[i]); }

– 69 – Parallelizing a Pipeline For simplicity, assume we have 4 processors (i.e., equal to the number of transformations). Furthermore, assume we have a very large number of pictures (>> 4).

– 70 – Parallelizing a Pipeline (part 1) Processor 1: for( i=0; i<num_pics, read(in_pic[i]); i++ ) { int_pic_1[i] = trans1( in_pic[i] ); signal(event_1_2[i]); }

– 71 – Parallelizing a Pipeline (part 2) Processor 2: for( i=0; i<num_pics; i++ ) { wait( event_1_2[i] ); int_pic_2[i] = trans1( int_pic_1[i] ); signal(event_2_3[i] ); } Same for processor 3

– 72 – Parallelizing a Pipeline (part 3) Processor 4: for( i=0; i<num_pics; i++ ) { wait( event_3_4[i] ); out_pic[i] = trans1( int_pic_3[i] ); }

– 73 – Sequential vs. Parallel Execution Sequential (Pattern -- picture; horiz. line -- processor).

– 74 – Another Sequential Program for( i=0; i<num_pic, read(in_pic); i++ ) { int_pic_1 = trans1( in_pic ); int_pic_2 = trans2( int_pic_1); int_pic_3 = trans3( int_pic_2); out_pic = trans4( int_pic_3); }

– 75 – Can we use same parallelization? Processor 2: for( i=0; i<num_pics; i++ ) { wait( event_1_2[i] ); int_pic_2 = trans1( int_pic_1 ); signal(event_2_3[i] ); } Same for processor 3

– 76 – Can we use same parallelization? No, because of anti-dependence between stages, there is no parallelism. We used privatization to enable pipeline parallelism. Used often to avoid dependences (not only with pipelines). Costly in terms of memory.

– 77 – In-between Solution Use n>1 buffers between stages. Block when buffers are full or empty.

– 78 – Perfect Pipeline SequentialParallel (Pattern -- picture; horiz. line -- processor).

– 79 – Things are often not that perfect One stage takes more time than others. Stages take a variable amount of time. Extra buffers provide some cushion against variability.

– 80 – PIPE Using Pthreads Remember - replacing the original wait/signal by a Pthreads condition variable wait/signal will not work. signals before a wait are forgotten. we need to remember a signal (semaphore wait and signal).

– 81 – PIPE with Pthreads P1:for( i=0; i<num_pics, read(in_pic); i++ ) { int_pic_1[i] = trans1( in_pic ); semaphore_signal( event_1_2[i] ); } P2: for( i=0; i<num_pics; i++ ) { semaphore_wait( event_1_2[i] ); int_pic_2[i] = trans2( int_pic_1[i] ); semaphore_signal( event_2_3[i] ); }

– 82 – Note Many shared memory programming systems (other than Pthreads) have semaphores as basic primitive. If they do, you should use it, not construct it yourself. Implementation may be more efficient than what you can do yourself.

– 83 – TSP (Traveling Salesman) Goal: given a list of cities, a matrix of distances between them, and a starting city, find the shortest tour in which all cities are visited exactly once. Example of an NP-hard search problem. Algorithm: branch-and-bound.

– 84 – Branching

– 85 – Branching Initialization: go from starting city to each of remaining cities put resulting partial path into priority queue, ordered by its current length. Further (repeatedly): take head element out of priority queue, expand by each one of remaining cities, put resulting partial path into priority queue.

– 86 – Finding the Solution Eventually, a complete path will be found. Remember its length as the current shortest path. Every time a complete path is found, check if we need to update current best path. When priority queue becomes empty, best path is found.

– 87 – Using a Simple Bound Once a complete path is found, we have a bound on the length of shortest path. No use in exploring partial path that is already longer than the current lower bound.

– 88 – Sequential TSP: Data Structures Priority queue of partial paths. Current best solution and its length. For simplicity, we will ignore bounding.

– 89 – Sequential TSP: Code Outline init_q(); init_best(); while( (p=de_queue()) != NULL ) { for each expansion by one city { q = add_city(p); if( complete(q) ) { update_best(q) }; else { en_queue(q) }; }

– 90 – Parallel TSP: Possibilities Have each process do one expansion. Have each process do expansion of one partial path. Have each process do expansion of multiple partial paths. Issue of granularity/performance, not an issue of correctness. Assume: process expands one partial path.

– 91 – Parallel TSP: Synchronization True dependence between process that puts partial path in queue and the one that takes it out. Dependences arise dynamically. Required synchronization: need to make process wait if q is empty.

– 92 – Parallel TSP: First cut (part 1) process i: while( (p=de_queue()) != NULL ) { for each expansion by one city { q = add_city(p); if complete(q) { update_best(q) }; else en_queue(q); }

– 93 – Parallel TSP: First cut (part 2) In de_queue: wait if q is empty In en_queue: signal that q is no longer empty

– 94 – Parallel TSP: More synchronization All processes operate, potentially at the same time, on q and best. This must not be allowed to happen. Critical section: only one process can execute in critical section at once.

– 95 – Parallel TSP: Critical Sections All shared data must be protected by critical section. Update_best must be protected by a critical section. En_queue and de_queue must be protected by the same critical section.

– 96 – Termination condition How do we know when we are done? All processes are waiting inside de_queue. Count the number of waiting processes before waiting. If equal to total number of processes, we are done.

– 97 – Parallel TSP Complete parallel program will be provided on the Web. Includes wait/signal on empty q. Includes critical sections. Includes termination condition.

– 98 – Parallel TSP process i: while( (p=de_queue()) != NULL ) { for each expansion by one city { q = add_city(p); if complete(q) { update_best(q) }; else en_queue(q); }

– 99 – Parallel TSP Need critical section in update_best, in en_queue/de_queue. In de_queue wait if q is empty, terminate if all processes are waiting. In en_queue: signal q is no longer empty.

– 100 – Parallel TSP: Mutual Exclusion en_queue() / de_queue() { pthreads_mutex_lock(&queue); …; pthreads_mutex_unlock(&queue); } update_best() { pthreads_mutex_lock(&best); …; pthreads_mutex_unlock(&best); }

– 101 – Parallel TSP: Condition Synchronization de_queue() { while( (q is empty) and (not done) ) { waiting++; if( waiting == p ) { done = true; pthreads_cond_broadcast(&empty, &queue); } else { pthreads_cond_wait(&empty, &queue); waiting--; } if( done ) return null; else remove and return head of the queue; }

– 102 – Other Primitives in Pthreads Set the attributes of a thread. Set the attributes of a mutex lock. Set scheduling parameters.

– 103 – Busy Waiting Not an explicit part of the API. Available in a general shared memory programming environment.

– 104 – Busy Waiting initially: flag = 0; P1:produce data; flag = 1; P2:while( !flag ) ; consume data;

– 105 – Use of Busy Waiting On the surface, simple and efficient. In general, not a recommended practice. Often leads to messy and unreadable code (blurs data/synchronization distinction). May be inefficient

Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples.

Similar presentations

Presentation on theme: "Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples.

Similar presentations

Presentation on theme: "Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples."— Presentation transcript:

Similar presentations

About project

Feedback