Download presentation
Presentation is loading. Please wait.
Published byIshver Lal Modified over 6 years ago
1
Programming with Shared Memory Specifying parallelism
Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Oct 2, slides 8d.ppt.
2
With OpenMP, programmer decides on what parts of the code should be parallelized and inserts compiler directives (pragma’s) accordingly. Issue for programmer is deciding what can be done in parallel. Applies to any tools especially lower-level tools such as threads APIs and OpenMP. Let us use generic language constructs for parallelism.
3
par Construct For specifying concurrent statements: par { S1; S2; .
Sn; } par construct says one can execute all statement S1 to Sn simultaneously if resources available, or execute them in any order and still get the correct result statements
4
forall Construct To start multiple similar processes together:
forall (i = 0; i < n; i++) { S1; S2; . Sm; } Says each iteration of body can be executed simultaneously if resources available, or in any order and still get the correct result. The statements of each instance of the body executed in order given. Each instance of the body uses a different value of i. Body
5
Example forall (i = 0; i < 5; i++) a[i] = 0;
clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.
6
Dependency Analysis To identify which processes could be executed together. Example Can see immediately in the code forall (i = 0; i < 5; i++) a[i] = 0; that every instance of the body is independent of other instances and all instances can be executed simultaneously. However, it may not be that obvious. Need algorithmic way of recognizing dependencies, which could be used for a parallelizing compiler.
9
Can use Berstein’s conditions at:
Machine instruction level inside processor – have logic to detect if conditions satisfied (see computer architecture course) At the program statement level to detect whether multiple statements can be executed simultaneously. At the process/thread level to detect whether two processes/threads can be executed simultaneously (using the inputs and outputs of processes/threads). Can be extended to more than two processes but number of conditions rises – need every input/out combination checked. For three statements, need how many conditions checked?
10
Performance issues with Threads
Program might actually go slower when parallelized. Too many threads can significantly reduce the program performance.
11
Reasons Thread creation overhead -- if each thread given too little work, overhead of starting and terminating threads dominates time.* Too many concurrent threads – pending threads scheduled to use available resources typically in a round robin time-slice fashion. Each context changes requires registers saved, has effects on cache memory, memory management, … Threads waiting to acquire a lock – If a thread is suspended while holding a lock, all threads waiting for lock will have to wait for that thread to re-start. * in a regular processor. For GPUs see later. Source: Multi-core programming by S. Akhter and J. Roberts, Intel Press.
12
Some Strategies Limit number of runnable threads to number of hardware threads. (We do not do this with GPUs and actually do the opposite, see later.) For a n-core machine have n runnable threads. If hyper- threaded (with 2 virtual threads per core) double this. Can have more threads in total but others may be blocked. Separate I/O threads from compute threads. I/O threads wait for external events. Never hard-code number of threads – leave as a tuning parameter. Implement a thread pool Implement a work stealing approach in which threads has a work queue. Threads with no work take work from other threads or are given work from other threads. (Textbook goes into which is best under different situations.)
13
Critical Sections Serializing Code
High performance programs should have as few as possible critical sections as their use can serialize code. Suppose, all processes happen to come to their critical section together. They will execute their critical sections one after the other. In that situation, execution time becomes almost that of a single processor.
14
Shared Data in Systems with Caches
All modern computer systems have cache memory, high-speed memory closely attached to each processor for holding recently referenced data and code. Processors Usually two or three levels of cache, L1, L2, ... Cache memory Main memory
15
Some features of cache operation relevant to parallel programming
(More details in a computer architecture course) Information must be in cache memory for processor to access it. Temporal locality (locality in time) – individual locations, once referenced in programs, likely to be referenced again in near future – hence usefulness of cache. Spatial locality (locality in space) – references likely to be near last reference. To take advantage of spatial locality, transfer into (and out) of cache a series of consecutive locations called a line or a block, preferable simultaneously.
16
Multiprocessor cache coherence protocols
To ensure data in a cache and memory is up to date when accessed. Update policy - copies of data in all caches are updated at the time one copy in a cache is altered, or Invalidate policy - when one copy of data is altered, same data in any other cache is invalidated (by resetting a valid bit in cache associated with each cache block). These copies are only updated when associated processor makes reference for it. (Full details in a computer architecture course.)
17
False Sharing Different parts of block required by different processors but not same bytes. If one processor writes to one part of the block, copies of the complete block in other caches must be updated or invalidated although actual data is not shared.
18
Solutions for False Sharing
Compiler to alter layout of data stored in main memory, separating data only altered by one processor into different blocks. Programmer to try to separate out data into different blocks by padding – need to know details about cache organization and block size.
19
Sequential Consistency
Formally defined by Lamport (1979): “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processors occur in this sequence in the order specified by its program.” i.e., the overall effect of a parallel program is not changed by any arbitrary interleaving of instruction execution in time.
20
Sequential Consistency
Writing a parallel program for a system known to be sequentially consistent enables us to reason about result of program.
21
Example Process P1 Process 2 . . data = new; . flag = TRUE; .
data = new; . flag = TRUE; . . while (flag != TRUE) { }; . data_copy = data; Expect data_copy to be set to new because we expect data = new to be executed before flag = TRUE and while (flag != TRUE) { } to be executed before data_copy = data. Ensures that process 2 reads new data from process 1. Process 2 will simply wait for the new data to be produced.
22
Program Order Sequential consistency refers to “operations of each individual processor .. occur in the order specified in its program” or program order. In previous figure, this order is that of the machine instructions as they are executed.
23
Compiler Optimizations
Order of execution not necessarily same as order of corresponding high level statements in source program as a compiler may reorder statements for improved performance. Note: it may be that a regular compiler during compilation may re-order instructions without any regard that other separately compiled programs may be designed to operate on shared data in a cooperative fashion.
24
High Performance Processors
Modern processors also usually reorder machine instructions internally during execution for increased performance. Does not alter a multiprocessor being sequential consistency, if processor only produces final results in program order (that is, retires values to registers in program order, which most processors do). All multiprocessors will have the option of operating under the sequential consistency model. However, it can severely limit compiler optimizations and processor performance.
25
Example of Processor Re-ordering
Process P1 Process 2 new = a * b; . data = new; . flag = TRUE; . . while (flag != TRUE) { }; . data_copy = data; Multiply machine instruction corresponding to new = a * b is issued for execution. Next instruction corresponding to data = new cannot be issued until the multiply has produced its result. However following statement, flag = TRUE, completely independent and a clever processor could start this operation before multiply has completed leading to the sequence:
26
Process P1 Process 2 . . new = a * b; . flag = TRUE; . data = new; . . while (flag != TRUE) { }; . data_copy = data; Now while statement might occur before new assigned to data, and code would fail. All multiprocessors have option of operating under sequential consistency model, i.e. not reorder instructions and forcing multiply instruction above to complete before starting subsequent instruction that depend upon its result.
27
Relaxing Read/Write Orders
Processors may be able to relax the consistency in terms of the order of reads and writes of one processor with respect to those of another processor to obtain higher performance, and instructions to enforce consistency when needed. Examples of possible machine instructions: Memory barrier (MB) instruction - waits for all previously issued memory accesses instructions to complete before issuing any new memory operations. Write memory barrier (WMB) instruction - as MB but only on memory write operations. Could move forward even if read operations pending.
28
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.