Programming with Shared Memory

Programming with Shared Memory
Chapter 8 Programming with Shared Memory Shared Memory Multiprocessors Programming Shared Memory Multiprocessors Process Based Programming: UNIX Processes Parallel Programming Languages and Constructs Directives Based Programming: OpenMP Performance Issues Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Shared Memory Multiprocessor System
Any memory location can be accessible by any of the processors. A single address space exists: each memory location is given a unique address within a single range of addresses. A common architecture is the single-bus architecture: Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Address Space Organization: UMA & NUMA
UMA: Uniform Memory Access Each processor has equal access to all of the system memory Each processor also maintains its own on-board data caches Cache coherence and bus contention limit scalability NUMA: Non-Uniform Memory Access Model designed in the 1990s to increase scalability and assure memory coherency and integrity Processors have direct access to a private area of main memory Remote memory access is slower than local memory access A processor accesses the private memory of others using the system bus Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Shared Memory vs Distributed Memory
Attractive to programmer: convenience for sharing Scales more easily than shared memory Data sharing mediated via the shared memory Data sharing achieved through message passing Synchronization achieved via shared variables (e.g., semaphores) Synchronization achieved through message passing Synchronization mechanisms (which can increase execution time of a parallel program) are necessary Concurrent access to data must be controlled Message-passing is error-prone, makes programs difficult to debug and data must be copied instead of shared Employs a single address space (the shared memory) Each processor has its own local memory Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

@ 2004 Pearson Education Inc. All rights reserved.
Programming Shared Memory Multiprocessors Using heavy weight processes Typically assumes all data associated with a process to be private Important for providing protection in multiuser systems This protection not necessary for processes cooperating to solve a problem UNIX processes Using threads. Example: Pthreads, Java threads Using a completely new programming language for parallel programming Requires learning a new language from scratch Not popular these days Example: Ada. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Programming Shared Memory (cont’d) Using library routines with an existing sequential programming language. Modifying the syntax of an existing sequential programming language to create a parallel programming language. Example UPC Using an existing sequential programming language supplemented with compiler directives for specifying parallelism. Example OpenMP Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Using Heavyweight Processes Operating systems often based upon notion of a process. Processor time shared between processes, switching from one process to another. Might occur at regular intervals or when an active process becomes delayed. Offers opportunity to deschedule processes blocked from proceeding for some reason, e.g. waiting for an I/O operation to complete. Concept could be used for parallel programming. Not much used because of overhead but fork/join concepts used elsewhere. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

FORK-JOIN Construct Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

UNIX System Calls SPMD model with different code for master process and forked slave process. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Unix Processes: Example Program To sum the elements of an array, a[1000]: int sum, a[1000]; sum = 0; for (i = 0; i < 1000; i++) sum = sum + a[i]; Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Unix Processes: Example (cont’d) Calculation will be divided into two parts, one doing even i and one doing odd i; i.e., Process Process 2 sum1 = 0; sum2 = 0; for (i = 0; i < 1000; i = i + 2) for (i = 1; i < 1000; i = i + 2) sum1 = sum1 + a[i]; sum2 = sum2 + a[i]; Each process will add its result (sum1 or sum2) to an accumulating result, sum : sum = sum + sum1; sum = sum + sum2; Sum will need to be shared and protected by a lock. Shared data structure is created: Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Shared Memory Locations for UNIX Program Example
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Unix Processes: Example (cont’d) *sum = 0; for (i = 0; i < array size; i++) *(a+i) = i+1; pid = fork(); if (pid == 0){ partial_sum = 0; for (i = 0; i < array size; i = i + 2) partial sum += *(a + i); }else{ for (i = 1; i < array size; i = i + 2) } P(&s); *sum += partial sum; V(&s); Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Language Constructs for Parallelism A few specific language constructs for shared memory parallel programs are: Shared data may be specified as shared int; Different concurrent statements may be introduced using the par statement: par {S1;S2;……;Sn} Similar concurrent statements may be introduced using the forall statement: forall (i=0; i<n; i++){body} generates n concurrent blocks, one for each i. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Dependency Analysis An automatic technique used by parallelizing compilers to detect which processes/statements may be executed in parallel Some dependencies are easy to see, like forall (i = 0; i < 5; i++) a[i] = 0; In general dependencies may not be that obvious. An algorithmic way of recognizing dependencies is needed for automation by compilers Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Bernstein Conditions A set of sufficient conditions for two processes to be executed simultaneously is the following concurrentread /exclusive-write situation: Define two sets of memory locations Ii - the set of memory locations read by process Pi Oi - the set of memory locations where process Pi writes Two processes P1, P2 may be executed simultaneously if In words: the writing locations are disjoint and no process reads from a location where the other process writes. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Bernstein Conditions: An Example Consider the following pair of statements (in C): a = x + y; b = x + z; From these, we have I1 = {x,y} O1 = {a} I2 = {x,z} O2 = {b} We see that the three conditions are satisfied: Hence, the statements a = x + y; and b = x + z; can be executed simultaneously. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Directives Based Parallel Programming: OpenMP An API for programming shared memory multiprocessors OpenMP API consists of three primary components: Compiler directives Runtime library routines Environment variables OpenMP is standardized: Jointly defined and endorsed by a group of major computer hardware and software vendors Expected to become an ANSI standard later OpenMP stand for Open specifications for Multi Processing The base languages for OpenMP are C, C++ and Fortran Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

OpenMP Programming Model
Explicit parallelism OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization. Compiler directive based: Virtually all of OpenMP parallelism is specified through the use of compiler directives which are imbedded in C/C++ or Fortran source code. Thread based parallelism A shared memory process can consist of multiple threads. OpenMP is based upon the existence of multiple threads in the shared memory programming paradigm. Makes it easy to create multi-threaded (MT) programs in Fortran, C and C++ Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

OpenMP Programming Model (cont’d)
OpenMP uses the fork-join model of parallel execution: Programs begin as a single process: the master thread. Master thread executes sequentially until the first parallel region construct is encountered. FORK: the master thread then creates a team of parallel threads The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

OpenMP Program Structure
#include <omp.h> int main () { int var1, var2, var3; // Serial code . . . // Beginning of parallel section. Fork a team of threads. //Specify variable scoping #pragma omp parallel private(var1, var2) shared(var3) { // Parallel section executed by all threads . . . // All threads join master thread and disband } // Resume serial code Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

OpenMP Programming For C/C++, the OpenMP directives are contained in #pragma statements. The OpenMP #pragma statements have the format: #pragma omp directive_name ... where omp is an OpenMP keyword. May be additional parameters (clauses) after the directive name for different options. Some directives require code to specified in a structured block (a statement or statements) that follows the directive and then the directive and structured block form a “construct”. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Anatomy of OpenMP OpenMP’s constructs fall into 5 categories: Parallel Regions Work-sharing Data Environment Synchronization Runtime functions/environment variables Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Parallel Regions: The parallel Directive
This is the fundamental OpenMP parallel construct. #pragma omp parallel structured_block When a thread reaches a parallel directive, it creates a team of threads and becomes the master of the team. The master is a member of that team and has thread number 0 within that team. Each thread executes the specified structured_block Structured_block is either a single statement or a compound statement created with { ...} with a single entry point and a single exit point. There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

The parallel Directive: Detailed Form
#pragma omp parallel [clause ...] newline if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) structured_block Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Example 1: Parallel Salam Shabab
#include <omp.h> main () { int nthreads, tid; #pragma omp parallel private(nthreads, tid) { tid = omp_get_thread_num(); /* Obtain and print thread id */ printf(“Salaam Shabab from thread = %d\n", tid); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } /* All threads join master thread and terminate */ Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Number of Threads in a Team
Established by either: num_threads clause after the parallel directive, or omp_set_num_threads() library routine being previously called, the environment variable OMP_NUM_THREADS is defined in the order given or is system dependent if none of the above. Implementation default Number of threads available can also be altered automatically to achieve best use of system resources by a “dynamic adjustment” mechanism. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Example 2: Vector Addition
#pragma omp parallel { int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) a[i] = a[i] + b[i]; } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Work-Sharing Three constructs in this classification, namely: sections for single Note There is an implicit barrier at the end of the construct, in each case, unless a nowait clause is included. These constructs do not start a new team of threads. That done by an enclosing parallel construct. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

The sections Directive
The construct #pragma omp sections { #pragma omp section structured_block . } Causes the structured blocks to be shared among threads in team. Notes: #pragma omp sections precedes the set of structured blocks. #pragma omp section prefixes each structured block. The first section directive is optional. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

The sections Directive: Detailed Form
#pragma omp sections [clause ...] newline private (list) firstprivate (list) lastprivate (list) reduction (operator: list) nowait { #pragma omp section newline structured_block … } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Example 3:The sections Directive
#pragma omp parallel shared(a,b,c) private(i) { #pragma omp sections nowait #pragma omp section for (i=0; i < N/2; i++) c[i] = a[i] + b[i]; for (i=N/2; i < N; i++) } /* end of sections */ } /* end of parallel section */ Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

The for Directive The construct #pragma omp for for_loop Causes the for_loop to be divided into parts and parts shared among threads in the team. The for loop must be of a simple form. The way the for_loop divided can be specified by an additional “schedule” clause. Example: the clause schedule(static, chunk_size) Causes the for_loop to be divided into sizes specified by chunk_size and allocated in a round robin fashion. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

The for Directive: Detailed Form
#pragma omp for [clause ...] newline schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) nowait for_loop Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

The schedule Clause The schedule clause effects how loop iterations are mapped onto threads Schedule(static [,chunk]) Deal out blocks of iterations of size “chunk” to each thread The default for chunk is ceiling(N/p) Schedule(dynamic[,chunk]) Each thread grabs “chunk” iterations off a queue until all iterations have been handled The default for chunk is 1 Schedule(guided[,chunk]) Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

The schedule Clause (cont’d)
Schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE environment variable A chunk cannot be specified with runtime. Example of run-time specified scheduling setenv OMP_SCHEDULE “dynamic,2” Which of the scheduling methods is better in terms of Lower runtime overhead? Better data locality? Better load balancing? Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Example 4: The for Directive
for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++){ c[i] = a[i] + b[i]; } /* end of parallel section */ Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Combined Parallel Work-sharing Constructs
A parallel directive followed by a single for directive can be combined into: #pragma omp parallel for for_loop A parallel directive followed by a single sections directive can be combined into: #pragma omp parallel sections { #pragma omp section structured_block … } In both cases, the nowait clause is not allowed. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Example 5: Parallel Region and Work-sharing
#pragma omp parallel #pragma omp for schedule(static) for(i=0;i<N;i++) a[i] = a[i] + b[i]; What is the chunk size distributed? How is the distribution done? Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

The reduction Clause reduction (op: list) The reduction clause performs a reduction on the variables that appear in its list. A private copy for each list variable is created for each thread. At the end of the reduction, the reduction variable is applied to all private copies of the shared variable, and the final result is written to the global shared variable. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Example 6: The reduction Clause
#pragma omp parallel for \ default(shared) private(i) \ schedule(static,chunk) \ reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); printf("Final result= %f\n",result); How many threads execute the printf statement? Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

The master Directive The master directive: #pragma omp master structured_block Causes the master thread to execute the structured block. Different from those in the work sharing group in that there is no implied barrier at the end of the construct (nor the beginning). Other threads encountering this directive will ignore it and the associated structured block, and will move on. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Data Environment Most variables are shared by default. Example: file scope variables and static variables in C But not everything is shared... Stack variables in sub-programs called from parallel regions are private Automatic variables within a statement block are private . The default status can be modified with: default (private | shared | none) The defaults can be changed selectively with: shared private firstprivate threadprivate The value of a private inside a parallel loop can be transmitted to a global value outside the loop with: lastprivate Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Synchronization Constructs
Five constructs in this classification, namely: critical: defines a critical section #pragma omp critical name structured_block barrier: defines a barrier (that must be reachable by all threads) #pragma omp barrier atomic: implementation of a critical section that increments decrements or does some other simple arithmetic operation #pragma omp atomic expression_statement flush: Synchronization point which causes thread to have a “consistent” view of variable_list shared variables in memory. #pragma omp flush (variable_list) ordered: Used with for and parallel for to order execution of a loop as in the equivalent sequential code Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Example 7: The critical Directive
#include <omp.h> main() { int x; x = 0; #pragma omp parallel shared(x) #pragma omp critical x = x + 1; } /* end of parallel section */ } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Example 8: The barrier Directive
#pragma omp parallel shared (A, B, C) private(id) { id=omp_get_thread_num(); A[id] = big_calc1(id); #pragma omp barrier #pragma omp for for(i=0;i<N;i++){C[i]=big_calc3(i,A);} //barrier #pragma omp for nowait for(i=0;i<N;i++){ B[i]=big_calc2(C, i); } A[id] = big_calc3(id); } //barrier Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

OpenMP: Library Routines Lock routines omp_init_lock(), omp_set_lock(), omp_unset_lock(), omp_test_lock() Runtime environment routines: Modify/Check the number of threads omp_set_num_threads(), omp_get_num_threads(), omp_get_thread_num(), omp_get_max_threads() Turn on/off nesting and dynamic mode omp_set_nested(), omp_set_dynamic(), omp_get_nested(), omp_get_dynamic() Are we in a parallel region? omp_in_parallel() How many processors in the system? omp_num_procs() Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

OpenMP: Environment Variables Control how “omp for schedule(RUNTIME)” loop iterations are scheduled. OMP_SCHEDULE “schedule[, chunk_size]” Set the default number of threads to use. OMP_NUM_THREADS int_literal Can the program use a different number of threads in each parallel region? OMP_DYNAMIC TRUE || FALSE Will nested parallel regions create new teams of threads, or will they be serialized? OMP_NESTED TRUE || FALSE Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Example 9: threadprivate Directive
int alpha[10], beta[10], i; #pragma omp threadprivate(alpha) main () { /* First parallel region */ #pragma omp parallel private(i,beta) for (i=0; i < 10; i++) alpha[i] = beta[i] = i; /* Second parallel region */ #pragma omp parallel printf("alpha[3]= %d and beta[3]= %d\n", alpha[3],beta[3]); } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Shared Memory Programming: Performance Issues Regardless of the programming model, these include: Shared data access Shared memory synchronization Sequential consistency Shared data access: design parallel algorithms to Fully exploit caching characteristics Avoid false sharing Exploiting caching characteristics involves Caching data to mitigate processor-memory data access speed differences Modifying cached data Caches may need to be made coherent more frequently due to false sharing Can lead to serious reduction in performance Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Shared Memory Programming: Performance Issues Cache coherence protocols: need to assure processes have coherent caches; two solutions: Update policy - copies in all caches are updated anytime when one copy is modified Invalidate policy - when a datum in one copy is modified, the same datum in the other caches is invalidated (resetting a bit); the new data is updated only if the processor needs it A false sharing may appear when different processors alter different data within the same block Solution for false sharing: Compiler to alter the layout of the data stored in the main memory, separating data only altered by one processor into different blocks. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

False Sharing Different parts of block required by different processors but not same bytes. If one processor writes to one part of the block, copies of the complete block in other caches must be updated or invalidated though the actual data is not shared. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

2. Shared Memory Synchronization Too much use of synchronization primitives can be a major source of reduced performance in shared memory programs Shared memory synchronization used for three main purposes Mutual exclusion synchronization: used to control access to critical sections Have as few critical sections as possible Process/thread synchronization: used to make threads wait for each other (sometimes needlessly) Have as few barriers as possible Event synchronization: used to tell a thread (e.g., through a shared flag) that a certain event, like updating a variable, has occurred Reduce busy waiting/critical section Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

3. Sequential Consistency Definition (Lamport 1979): A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processors occur in this sequence in the order specified by its program That is, the overall effect of a parallel program is not changed by any arbitrary interleaving of instruction execution in time. Is sequential consistency a property parallel programs should have? How does sequential consistency affect performance? Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Sequential Consistency Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Sequential Consistency (cont’d) Programs in a sequentially consistent parallel system Must produce the results that the program is designed to produce even though the individual requests from the processors can be interleaved in any order Enable us to reason about the result of the program Example Process P1 Process 2 data = new; . flag = TRUE; . while (flag != TRUE) { }; data_copy = data; Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

How Sequential Consistency Affect Performance Sequential consistency refers to “operations of each individual processor .. occur in the order specified in its program” or program order. Clever compilers sometimes reorder statements to improve performance See example in Textbook Similarly, modern high-performance processors typically reorder machine instructions to improve performance Enforcing sequential consistency can significantly limit compiler optimizations and processor performance Some processors have special machine instructions to achieve sequential consistency while improving performance See examples in Textbook Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Additional References Introduction to OpenMP, Lawrence Livermore National Laboratory Parallel Computing with OpenMP on distributed shared memory platforms, National Research Council Canada, Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Programming with Shared Memory

Similar presentations

Presentation on theme: "Programming with Shared Memory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming with Shared Memory

Similar presentations

Presentation on theme: "Programming with Shared Memory"— Presentation transcript:

Similar presentations

About project

Feedback