ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.

Slides:



Advertisements
Similar presentations
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Advertisements

Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples.
Threads By Dr. Yingwu Zhu. Review Multithreading Models Many-to-one One-to-one Many-to-many.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
Unix Threads operating systems. User Thread Packages pthread package mach c-threads Sun Solaris3 UI threads Kernel Threads Windows NT, XP operating systems.
Threads© Dr. Ayman Abdel-Hamid, CS4254 Spring CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer Science Department.
Comp 422: Parallel Programming Shared Memory Multithreading: Pthreads Synchronization.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Pthread (continue) General pthread program structure –Encapsulate parallel parts (can be almost the whole program) in functions. –Use function arguments.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
ECE 1747H : Parallel Programming Message Passing (MPI)
10/16/ Realizing Concurrency using the thread model B. Ramamurthy.
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Today’s topic Pthread Some materials and figures are obtained from the POSIX threads Programming tutorial at
B. RAMAMURTHY 10/24/ Realizing Concurrency using the thread model.
Threads and Thread Control Thread Concepts Pthread Creation and Termination Pthread synchronization Threads and Signals.
CS345 Operating Systems Threads Assignment 3. Process vs. Thread process: an address space with 1 or more threads executing within that address space,
Includes slides from course CS194 at UC Berkeley, by prof. Katherine Yelick Shared Memory Programming Pthreads: an overview Ing. Andrea Marongiu
1 Threads Chapter 11 from the book: Inter-process Communications in Linux: The Nooks & Crannies by John Shapley Gray Publisher: Prentice Hall Pub Date:
Source: Operating System Concepts by Silberschatz, Galvin and Gagne.
CS333 Intro to Operating Systems Jonathan Walpole.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.
Professor: Shu-Ching Chen TA: Samira Pouyanfar.  An independent stream of instructions that can be scheduled to run  A path of execution int a, b; int.
Pthreads: A shared memory programming model
1 Pthread Programming CIS450 Winter 2003 Professor Jinhua Guo.
Lecture 7: POSIX Threads - Pthreads. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ECE 1747H: Parallel Programming Lecture 2: Data Parallelism.
12/22/ Thread Model for Realizing Concurrency B. Ramamurthy.
Chapter 6 P-Threads. Names The naming convention for a method/function/operation is: – pthread_thing_operation(..) – Where thing is the object used (such.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
NCHU System & Network Lab Lab #6 Thread Management Operating System Lab.
Case Study 5: Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, forces acting on them... Have.
B. RAMAMURTHY 5/10/2013 Amrita-UB-MSES Realizing Concurrency using the thread model.
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.
Realizing Concurrency using the thread model
Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.
Introduction to OpenMP
ECE 1747 Parallel Programming
Realizing Concurrency using the thread model
Threads Threads.
CS399 New Beginnings Jonathan Walpole.
Chapter 4: Threads.
Computer Engg, IIT(BHU)
Realizing Concurrency using Posix Threads (pthreads)
Realizing Concurrency using the thread model
ECE1747 Parallel Programming
ECE1747 Parallel Programming
Realizing Concurrency using the thread model
CS510 Operating System Foundations
Operating System Concepts
Programming with Shared Memory
Multithreading Tutorial
Realizing Concurrency using the thread model
Realizing Concurrency using Posix Threads (pthreads)
Realizing Concurrency using the thread model
Programming with Shared Memory
Realizing Concurrency using Posix Threads (pthreads)
MPI (continue) An example for designing explicit message passing programs Emphasize on the difference between shared memory code and distributed memory.
Presentation transcript:

ECE1747 Parallel Programming Shared Memory Multithreading Pthreads

Shared Memory proc1proc2 proc3procN Shared Memory Address Space All threads access the same shared memory data space.

Shared Memory (continued) Concretely, it means that a variable x, a pointer p, or an array a[] refer to the same object, no matter what processor the reference originates from. We have more or less implicitly assumed this to be the case in earlier examples.

Shared Memory proc1proc2proc3procN a

Distributed Memory - Message Passing The alternative model to shared memory. proc1proc2proc3procN mem1mem2mem3memN network aaaa

Shared Memory vs. Message Passing Same terminology is used in distinguishing hardware. For us: distinguish programming models, not hardware.

Programming vs. Hardware One can implement –a shared memory programming model –on shared or distributed memory hardware –(also in software or in hardware) One can implement –a message passing programming model –on shared or distributed memory hardware

Portability of programming models shared memory programming message passing programming distr. memory machine shared memory machine

Shared Memory Programming: Important Point to Remember No matter what the implementation, it conceptually looks like shared memory. There may be some (important) performance differences.

Multithreading User has explicit control over thread. Good: control can be used to performance benefit. Bad: user has to deal with it.

Pthreads POSIX standard shared-memory multithreading interface. Provides primitives for process management and synchronization.

What does the user have to do? Decide how to decompose the computation into parallel parts. Create (and destroy) processes to support that decomposition. Add synchronization to make sure dependences are covered.

General Thread Structure Typically, a thread is a concurrent execution of a function or a procedure. So, your program needs to be restructured such that parallel parts form separate procedures or functions.

Example of Thread Creation (contd.) main() pthread_ create(func) func()

Thread Joining Example void *func(void *) { ….. } pthread_t id; int X; pthread_create(&id, NULL, func, &X); ….. pthread_join(id, NULL); …..

Example of Thread Creation (contd.) main() pthread_ create(func) func() pthread_ join(id) pthread_ exit()

Sequential SOR for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }

Parallel SOR First (i,j) loop nest can be parallelized. Second (i,j) loop nest can be parallelized. Must wait to start second loop nest until all processors have finished first. Must wait to start first loop nest of next iteration until all processors have second loop nest of previous iteration. Give n/p rows to each processor.

Pthreads SOR: Parallel parts (1) void* sor_1(void *s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for( i=from; i<to; i++) for( j=0; j<n; j++ ) temp[i][j] = 0.25*(grid[i-1][j] + grid[i+1][j] +grid[i][j-1] + grid[i][j+1]); }

Pthreads SOR: Parallel parts (2) void* sor_2(void *s) { int slice = (int) s; int from = (slice*n)/p; int to = ((slice+1)*n)/p; for( i=from; i<to; i++) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }

Pthreads SOR: main for some number of timesteps { for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_1, (void *)i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_2, (void *)i); for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL); }

Summary: Thread Management pthread_create(): creates a parallel thread executing a given function (and arguments), returns thread identifier. pthread_exit(): terminates thread. pthread_join(): waits for thread with particular thread identifier to terminate.

Summary: Program Structure Encapsulate parallel parts in functions. Use function arguments to parameterize what a particular thread does. Call pthread_create() with the function and arguments, save thread identifier returned. Call pthread_join() with that thread identifier.

Pthreads Synchronization Create/exit/join –provide some form of synchronization, –at a very coarse level, –requires thread creation/destruction. Need for finer-grain synchronization –mutex locks, –condition variables.

Use of Mutex Locks To implement critical sections. Pthreads provides only exclusive locks. Some other systems allow shared-read, exclusive-write locks.

Barrier Synchronization A wait at a barrier causes a thread to wait until all threads have performed a wait at the barrier. At that point, they all proceed.

Implementing Barriers in Pthreads Count the number of arrivals at the barrier. Wait if this is not the last arrival. Make everyone unblock if this is the last arrival. Since the arrival count is a shared variable, enclose the whole operation in a mutex lock-unlock.

Implementing Barriers in Pthreads void barrier() { pthread_mutex_lock(&mutex_arr); arrived++; if (arrived<N) { pthread_cond_wait(&cond, &mutex_arr); } else { pthread_cond_broadcast(&cond); arrived=0; /* be prepared for next barrier */ } pthread_mutex_unlock(&mutex_arr); }

Parallel SOR with Barriers (1 of 2) void* sor (void* arg) { int slice = (int)arg; int from = (slice * (n-1))/p + 1; int to = ((slice+1) * (n-1))/p + 1; for some number of iterations { … } }

Parallel SOR with Barriers (2 of 2) for (i=from; i<to; i++) for (j=1; j<n; j++) temp[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); barrier(); for (i=from; i<to; i++) for (j=1; j<n; j++) grid[i][j]=temp[i][j]; barrier();

Parallel SOR with Barriers: main int main(int argc, char *argv[]) { pthread_t *thrd[p]; /* Initialize mutex and condition variables */ for (i=0; i<p; i++) pthread_create (&thrd[i], &attr, sor, (void*)i); for (i=0; i<p; i++) pthread_join (thrd[i], NULL); /* Destroy mutex and condition variables */ }

Note again Many shared memory programming systems (other than Pthreads) have barriers as basic primitive. If they do, you should use it, not construct it yourself. Implementation may be more efficient than what you can do yourself.

Busy Waiting Not an explicit part of the API. Available in a general shared memory programming environment.

Busy Waiting initially: flag = 0; P1:produce data; flag = 1; P2:while( !flag ) ; consume data;

Use of Busy Waiting On the surface, simple and efficient. In general, not a recommended practice. Often leads to messy and unreadable code (blurs data/synchronization distinction). May be inefficient

Private Data in Pthreads To make a variable private in Pthreads, you need to make an array out of it. Index the array by thread identifier, which you should keep track of. Not very elegant or efficient.

Other Primitives in Pthreads Set the attributes of a thread. Set the attributes of a mutex lock. Set scheduling parameters.

ECE 1747 Parallel Programming Machine-independent Performance Optimization Techniques

Returning to Sequential vs. Parallel Sequential execution time: t seconds. Startup overhead of parallel execution: t_st seconds (depends on architecture) (Ideal) parallel execution time: t/p + t_st. If t/p + t_st > t, no gain.

General Idea Parallelism limited by dependences. Restructure code to eliminate or reduce dependences. Sometimes possible by compiler, but good to know how to do it by hand.

Optimizations: Example 16 for (i = 0; i < ; i++) a[i ] = a[i] + 1; Cannot be parallelized as is. May be parallelized by applying certain code transformations.

Summary Reorganize code such that –dependences are removed or reduced –large pieces of parallel work emerge –loop bounds become known –…–… Code can become messy … there is a point of diminishing returns.

Factors that Determine Speedup Characteristics of parallel code –granularity –load balance –locality –communication and synchronization

Granularity Granularity = size of the program unit that is executed by a single processor. May be a single loop iteration, a set of loop iterations, etc. Fine granularity leads to: –(positive) ability to use lots of processors –(positive) finer-grain load balancing –(negative) increased overhead

Granularity and Critical Sections Small granularity => more processors => more critical section accesses => more contention.

Issues in Performance of Parallel Parts Granularity. Load balance. Locality. Synchronization and communication.

Load Balance Load imbalance = different in execution time between processors between barriers. Execution time may not be predictable. –Regular data parallel: yes. –Irregular data parallel or pipeline: perhaps. –Task queue: no.

Static vs. Dynamic Static: done once, by the programmer –block, cyclic, etc. –fine for regular data parallel Dynamic: done at runtime –task queue –fine for unpredictable execution times –usually high overhead Semi-static: done once, at run-time

Choice is not inherent MM or SOR could be done using task queues: put all iterations in a queue. –In heterogeneous environment. –In multitasked environment.

Static Load Balancing Block –best locality –possibly poor load balance Cyclic –better load balance –worse locality Block-cyclic –load balancing advantages of cyclic (mostly) –better locality

Dynamic Load Balancing (1 of 2) Centralized: single task queue. –Easy to program –Excellent load balance Distributed: task queue per processor. –Less communication/synchronization

Dynamic Load Balancing (2 of 2) Task stealing: –Processes normally remove and insert tasks from their own queue. –When queue is empty, remove task(s) from other queues. Extra overhead and programming difficulty. Better load balancing.

Semi-static Load Balancing Measure the cost of program parts. Use measurement to partition computation. Done once, done every iteration, done every n iterations.

Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, celestial bodies,... Have same basic structure.

Molecular Dynamics (Skeleton) for some number of timesteps { for all molecules i for all other molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics To reduce amount of computation, account for interaction only with nearby molecules.

Molecular Dynamics (continued) for some number of timesteps { for all molecules i for all nearby molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (continued) for each molecule i number of nearby molecules count[i] array of indices of nearby molecules index[j] ( 0 <= j < count[i])

Molecular Dynamics (continued) for some number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (simple) for some number of timesteps { parallel for for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); parallel for for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (simple) Simple to program. Possibly poor load balance –block distribution of i iterations (molecules) –could lead to uneven neighbor distribution –cyclic does not help

Better Load Balance Assign iterations such that each processor has ~ the same number of neighbors. Array of “assign records” –size: number of processors –two elements: beginning i value (molecule) ending i value (molecule) Recompute partition periodically

Frequency of Balancing Every time neighbor list is recomputed. –once during initialization. –every iteration. –every n iterations. Extra overhead vs. better approximation and better load balance.

Summary Parallel code optimization –Critical section accesses. –Granularity. –Load balance.