Presentation is loading. Please wait.

Presentation is loading. Please wait.

Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike.

Similar presentations


Presentation on theme: "Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike."— Presentation transcript:

1 Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike 2.5 License You are free: –to Share — to copy, distribute and transmit the work –to Remix — to adapt the work Under the following conditions: –Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). –Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to –http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.

2 Parallel and distributed programming Adam Piotrowski Grzegorz Jabłoński Lecture I

3 3 Parallel and distributed programming lux.dmcs.pl/padp Grading policy Lecture slides and material Additional material

4 Course Overview 4 Introduction to multiprocessor programming Using OpenMP Distributed programming MPI, CORBA programming Theoretical approche to multiprocessor programming Fundamentals - Models, algorithms Real-World programming - Architectures Techniques

5 Books 5 The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit Using OpenMP Portable Shared Memory Parallel Programming by Barbara Chapman, Gabriele Jost and Ruud van der Pas

6 Art of Multiprocessor Programming6 Still on some of your desktops: The Uniprocesor memory cpu

7 Art of Multiprocessor Programming7 In the Enterprise: The Shared Memory Multiprocessor (SMP) cache Bus shared memory cache

8 Art of Multiprocessor Programming8 Traditional Scaling Process User code Traditional Uniprocessor Speedup 1.8x 7x 3.6x Time: Moore’s law

9 Art of Multiprocessor Programming9 Multicore Scaling Process User code Multicore Speedup 1.8x7x3.6x Unfortunately, not so simple…

10 Art of Multiprocessor Programming10 Real-World Scaling Process 1.8x 2x 2.9x User code Multicore Speedup Parallelization and Synchronization require great care…

11 Art of Multiprocessor Programming 11 Sequential Computation memory object thread

12 Art of Multiprocessor Programming 12 Concurrent Computation memory object threads

13 Art of Multiprocessor Programming13 Asynchrony Sudden unpredictable delays –Cache misses (short) –Page faults (long) –Scheduling quantum used up (really long)

14 Art of Multiprocessor Programming14 Model Summary Multiple threads –Sometimes called processes Single shared memory Objects live in memory Unpredictable asynchronous delays

15 Art of Multiprocessor Programming15 Parallel Primality Testing Challenge –Print primes from 1 to 10 10 Given –Ten-processor multiprocessor –One thread per processor Goal –Get ten-fold speedup (or close)

16 Art of Multiprocessor Programming16 Load Balancing Split the work evenly Each thread tests range of 10 9 … … 10 910 2·10 9 1 P0P0 P1P1 P9P9

17 Art of Multiprocessor Programming 17 Procedure for Thread i void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*10 9 +1, j<(i+1)*10 9 ; j++) { if (isPrime(j)) print(j); }

18 Art of Multiprocessor Programming18 Issues Higher ranges have fewer primes Yet larger numbers harder to test Thread workloads –Uneven –Hard to predict Need dynamic load balancing

19 Art of Multiprocessor Programming19 Issues Higher ranges have fewer primes Yet larger numbers harder to test Thread workloads –Uneven –Hard to predict Need dynamic load balancing rejected

20 Art of Multiprocessor Programming 20 17 18 19 Shared Counter each thread takes a number

21 Art of Multiprocessor Programming 21 Procedure for Thread i int counter = new Counter(1); void primePrint { long j = 0; while (j < 10 10 ) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); }

22 Art of Multiprocessor Programming 22 Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 10 10 ) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } Procedure for Thread i Shared counter object

23 Art of Multiprocessor Programming 23 Procedure for Thread i Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 10 10 ) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } Stop when every value taken

24 Art of Multiprocessor Programming 24 Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 10 10 ) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } Procedure for Thread i Increment & return each new value

25 Art of Multiprocessor Programming 25 Counter Implementation public class Counter { private long value; public long getAndIncrement() { return value++; }

26 Art of Multiprocessor Programming 26 Counter Implementation public class Counter { private long value; public long getAndIncrement() { return value++; } OK for single thread, not for concurrent threads

27 Art of Multiprocessor Programming 27 What It Means public class Counter { private long value; public long getAndIncrement() { return value++; }

28 Art of Multiprocessor Programming 28 What It Means public class Counter { private long value; public long getAndIncrement() { return value++; } temp = value; value = value + 1; return temp;

29 Art of Multiprocessor Programming 29 time Not so good… Value… 1 read 1 read 1 write 2 read 2 write 3 write 2 232

30 Art of Multiprocessor Programming 30 Is this problem inherent? If we could only glue reads and writes… read write read write

31 Art of Multiprocessor Programming 31 Challenge public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; }

32 Art of Multiprocessor Programming 32 Challenge public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } Make these steps atomic (indivisible)

33 Art of Multiprocessor Programming 33 Hardware Solution public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } ReadModifyWrite() instruction

34 Art of Multiprocessor Programming 34 An Aside: Java™ public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; }

35 Art of Multiprocessor Programming 35 An Aside: Java™ public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } Synchronized block

36 Art of Multiprocessor Programming 36 An Aside: Java™ public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } Mutual Exclusion

37 Art of Multiprocessor Programming37 Why do we care? We want as much of the code as possible to execute concurrently (in parallel) A larger sequential part implies reduced performance Amdahl’s law: this relation is not linear…

38 Art of Multiprocessor Programming38 Amdahl’s Law Speedup= …of computation given n CPUs instead of 1

39 Art of Multiprocessor Programming39 Amdahl’s Law Speedup=

40 Art of Multiprocessor Programming40 Amdahl’s Law Speedup= Parallel fraction

41 Art of Multiprocessor Programming41 Amdahl’s Law Speedup= Parallel fraction Sequential fraction

42 Art of Multiprocessor Programming42 Amdahl’s Law Speedup= Parallel fraction Number of processors Sequential fraction

43 Art of Multiprocessor Programming 43 Example Ten processors 60% concurrent, 40% sequential How close to 10-fold speedup?

44 Art of Multiprocessor Programming 44 Example Ten processors 60% concurrent, 40% sequential How close to 10-fold speedup? Speedup=2.17=

45 Art of Multiprocessor Programming 45 Example Ten processors 80% concurrent, 20% sequential How close to 10-fold speedup?

46 Art of Multiprocessor Programming 46 Example Ten processors 80% concurrent, 20% sequential How close to 10-fold speedup? Speedup=3.57=

47 Art of Multiprocessor Programming 47 Example Ten processors 90% concurrent, 10% sequential How close to 10-fold speedup?

48 Art of Multiprocessor Programming 48 Example Ten processors 90% concurrent, 10% sequential How close to 10-fold speedup? Speedup=5.26=

49 Art of Multiprocessor Programming 49 Example Ten processors 99% concurrent, 1% sequential How close to 10-fold speedup?

50 Art of Multiprocessor Programming 50 Example Ten processors 99% concurrent, 1% sequential How close to 10-fold speedup? Speedup=9.17=

51 Art of Multiprocessor Programming51 The Moral Making good use of our multiple processors (cores) means –Finding ways to effectively parallelize our code Minimize sequential parts Reduce idle time in which threads wait without

52 Art of Multiprocessor Programming 52 Implementing a Counter public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } Make these steps indivisible using locks

53 Art of Multiprocessor Programming 53 Locks (Mutual Exclusion) public interface Lock { public void lock(); public void unlock(); }

54 Art of Multiprocessor Programming 54 Locks (Mutual Exclusion) public interface Lock { public void lock(); public void unlock(); } acquire lock

55 Art of Multiprocessor Programming 55 Locks (Mutual Exclusion) public interface Lock { public void lock(); public void unlock(); } release lock acquire lock

56 Art of Multiprocessor Programming 56 Using Locks public class Counter { private long value; private Lock lock; public long getAndIncrement() { lock.lock(); try { int temp = value; value = value + 1; } finally { lock.unlock(); } return temp; }}

57 Art of Multiprocessor Programming 57 Using Locks public class Counter { private long value; private Lock lock; public long getAndIncrement() { lock.lock(); try { int temp = value; value = value + 1; } finally { lock.unlock(); } return temp; }} acquire Lock

58 Art of Multiprocessor Programming 58 Using Locks public class Counter { private long value; private Lock lock; public long getAndIncrement() { lock.lock(); try { int temp = value; value = value + 1; } finally { lock.unlock(); } return temp; }} Release lock (no matter what)

59 Art of Multiprocessor Programming 59 Using Locks public class Counter { private long value; private Lock lock; public long getAndIncrement() { lock.lock(); try { int temp = value; value = value + 1; } finally { lock.unlock(); } return temp; }} Critical section

60 Art of Multiprocessor Programming60 Today: Revisit Mutual Exclusion Think of performance, not just correctness and progress Begin to understand how performance depends on our software properly utilizing the multiprocessor machine’s hardware And get to know a collection of locking algorithms… (1)

61 Art of Multiprocessor Programming61 What Should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor (1)

62 Art of Multiprocessor Programming62 What Should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor our focus

63 Art of Multiprocessor Programming63 Basic Spin-Lock CS Resets lock upon exit spin lock critical section...

64 Art of Multiprocessor Programming64 Basic Spin-Lock CS Resets lock upon exit spin lock critical section... …lock introduces sequential bottleneck

65 Art of Multiprocessor Programming65 Review: Test-and-Set Boolean value Test-and-set (TAS) –Swap true with current value –Return value tells if prior value was true or false Can reset just by writing false

66 Art of Multiprocessor Programming66 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } (5)

67 Art of Multiprocessor Programming67 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Swap old and new values

68 Art of Multiprocessor Programming68 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true)

69 Art of Multiprocessor Programming69 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true) (5) Swapping in true is called “test-and-set” or TAS

70 Art of Multiprocessor Programming70 Test-and-Set Locks Locking –Lock is free: value is false –Lock is taken: value is true Acquire lock by calling TAS –If result is false, you win –If result is true, you lose Release lock by writing false

71 Art of Multiprocessor Programming71 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

72 Art of Multiprocessor Programming72 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Lock state is AtomicBoolean

73 Art of Multiprocessor Programming73 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Keep trying until lock acquired

74 Art of Multiprocessor Programming74 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Release lock by resetting state to false

75 Art of Multiprocessor Programming75 Performance Experiment –n threads –Increment shared counter 1 million times How long should it take? How long does it take?

76 Art of Multiprocessor Programming76 Graph ideal time threads no speedup because of sequential bottleneck

77 Art of Multiprocessor Programming77 Mystery #1 time threads TAS lock Ideal (1) What is going on?

78 Art of Multiprocessor Programming78 Bus-Based Architectures Bus cache memory cache

79 Art of Multiprocessor Programming79 Bus-Based Architectures Bus cache memory cache Random access memory (10s of cycles)

80 Art of Multiprocessor Programming80 Bus-Based Architectures cache memory cache Shared Bus Broadcast medium One broadcaster at a time Processors and memory all “snoop” Bus

81 Art of Multiprocessor Programming81 Bus-Based Architectures Bus cache memory cache Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information

82 Art of Multiprocessor Programming82 Jargon Watch Cache hit –“I found what I wanted in my cache” –Good Thing™

83 Art of Multiprocessor Programming83 Jargon Watch Cache hit –“I found what I wanted in my cache” –Good Thing™ Cache miss –“I had to shlep all the way to memory for that data” –Bad Thing™

84 Art of Multiprocessor Programming84 Cave Canem This model is still a simplification –But not in any essential way –Illustrates basic principles Will discuss complexities later

85 Art of Multiprocessor Programming85 Bus Processor Issues Load Request cache memory cache data

86 Art of Multiprocessor Programming86 Bus Processor Issues Load Request Bus cache memory cache data Gimme data

87 Art of Multiprocessor Programming87 cache Bus Memory Responds Bus memory cache data Got your data right here data

88 Art of Multiprocessor Programming88 Bus Processor Issues Load Request memory cache data Gimme data

89 Art of Multiprocessor Programming89 Bus Processor Issues Load Request Bus memory cache data Gimme data

90 Art of Multiprocessor Programming90 Bus Processor Issues Load Request Bus memory cache data I got data

91 Art of Multiprocessor Programming91 Bus Other Processor Responds memory cache data I got data data Bus

92 Art of Multiprocessor Programming92 Bus Other Processor Responds memory cache data Bus

93 Art of Multiprocessor Programming93 Modify Cached Data Bus data memory cachedata (1)

94 Art of Multiprocessor Programming94 Modify Cached Data Bus data memory cachedata (1)

95 Art of Multiprocessor Programming95 memory Bus data Modify Cached Data cachedata

96 Art of Multiprocessor Programming96 memory Bus data Modify Cached Data cache What’s up with the other copies? data

97 Art of Multiprocessor Programming97 Cache Coherence We have lots of copies of data –Original copy in memory –Cached copies at processors Some processor modifies its own copy –What do we do with the others? –How to avoid confusion?

98 Art of Multiprocessor Programming98 Write-Back Caches Accumulate changes in cache Write back when needed –Need the cache for something else –Another processor wants it On first modification –Invalidate other entries –Requires non-trivial protocol …

99 Art of Multiprocessor Programming99 Write-Back Caches Cache entry has three states –Invalid: contains raw seething bits –Valid: I can read but I can’t write –Dirty: Data has been modified Intercept other load requests Write back to memory before using cache

100 Art of Multiprocessor Programming100 Bus Invalidate memory cachedata

101 Art of Multiprocessor Programming101 Bus Invalidate Bus memory cachedata Mine, all mine!

102 Art of Multiprocessor Programming102 Bus Invalidate Bus memory cachedata cache Uh,oh

103 Art of Multiprocessor Programming103 cache Bus Invalidate memory cachedata Other caches lose read permission

104 Art of Multiprocessor Programming104 cache Bus Invalidate memory cachedata Other caches lose read permission This cache acquires write permission

105 Art of Multiprocessor Programming105 cache Bus Invalidate memory cachedata Memory provides data only if not present in any cache, so no need to change it now (expensive) (2)

106 Art of Multiprocessor Programming106 cache Bus Another Processor Asks for Data memory cachedata (2) Bus

107 Art of Multiprocessor Programming107 cache data Bus Owner Responds memory cachedata (2) Bus Here it is!

108 Art of Multiprocessor Programming108 Bus End of the Day … memory cachedata (1) Reading OK, no writing data

109 Art of Multiprocessor Programming109 Simple TASLock TAS invalidates cache lines Spinners –Miss in cache –Go to bus Thread wants to release lock –delayed behind spinners

110 Art of Multiprocessor Programming110 Test-and-test-and-set Wait until lock “looks” free –Spin on local cache –No bus use while lock busy Problem: when lock is released –Invalidation storm …

111 Art of Multiprocessor Programming111 Local Spinning while Lock is Busy Bus memory busy

112 Art of Multiprocessor Programming112 Bus On Release memory freeinvalid free

113 Art of Multiprocessor Programming113 On Release Bus memory freeinvalid free miss Everyone misses, rereads (1)

114 Art of Multiprocessor Programming114 On Release Bus memory freeinvalid free TAS(…) Everyone tries TAS (1)

115 Art of Multiprocessor Programming115 Mystery Explained TAS lock TTAS lock Ideal time threads Better than TAS but still not as good as ideal

116 OpenMP (Shared Memory, Thread Based Parallelism) OpenMP is a shared- memory application programming interface (API) whose features, are based on prior efforts to facilitate shared-memory parallel programming. (Compiler Directive Based) OpenMP provides directives, library functions and environment variables to create and control the execution of parallel programs. (Explicit Parallelism) OpenMP’s directives let the user tell the compiler which instructions to execute in parallel and how to distribute them among the threads that will run the code. 116

117 OpenMP example int main() { for (i = 0; i < n; i++) { a [i] = i * 0.5; b [i] = i * 2.0; } sum = 0; for (i = 1; i <= n; i++ ) { sum = sum + a[i]*b[i]; } printf ("sum = %f", sum); } int main() { for (i = 0; i < n; i++) { a [i] = i * 0.5; b [i] = i * 2.0 } sum = 0; t = 0; #pragma omp parallel private(t) \ shared(sum, a, b, n) {t = 0; #pragma omp for for (i = 1; i <= n; i++ ) { t = t + a[i]*b[i]; } #pragma omp critical (update_sum) sum += t; } printf ("sum = %f \n", sum); } 117

118 PARALLEL Region Construct A parallel region is a block of code that will be executed by multiple threads. This is the fundamental OpenMP parallel construct. Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code. There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point. How Many Threads? –Setting of the NUM_THREADS clause –Use of the omp_set_num_threads() library function –Setting of the OMP_NUM_THREADS environment variable –Implementation default - usually the number of CPUs on a node A parallel region must be a structured block that does not span multiple routines or code files It is illegal to branch into or out of a parallel region 118

119 PARALLEL Region Construct #pragma omp parallel [clause...] newline if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) num_threads (integer-expression) structured_block 119

120 PARALLEL Region Construct #include void main () { int nthreads, tid; /* Fork a team of threads with each thread having a private tid variable */ #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } /* All threads join master thread and terminate */ } 120

121 Work-Sharing Constructs for Directive The for directive specifies that the iterations of the loop immediately following it must be executed in parallel by the team. This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor. 121

122 Work-Sharing Constructs for Directive #pragma omp for [clause...] newline schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) nowait for_loop 122

123 Work-Sharing Constructs for Directive #include #define CHUNKSIZE 100 #define N 1000 void main () { int i, chunk; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) c[i] = a[i] + b[i]; } /* end of parallel section */ } 123

124 Work-Sharing Constructs SECTIONS Directive The SECTIONS directive is a non- iterative work-sharing construct. It specifies that the enclosed section(s) of code are to be divided among the threads in the team. 124

125 Work-Sharing Constructs SECTIONS Directive #pragma omp sections [clause...] newline private (list) firstprivate (list) lastprivate (list) nowait { #pragma omp section newline structured_block } 125

126 Work-Sharing Constructs SECTIONS Directive #include #define N 1000 void main () { int i; float a[N], b[N], c[N], d[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = i * 1.5; b[i] = i + 22.35; #pragma omp parallel shared(a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i < N; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=0; i < N; i++) d[i] = a[i] * b[i]; } /* end of sections */ } /* end of parallel section */ } 126

127 Work-Sharing Constructs SINGLE Directive The SINGLE directive specifies that the enclosed code is to be executed by only one thread in the team. May be useful when dealing with sections of code that are not thread safe (such as I/O) #pragma omp single [clause...] newline private (list) firstprivate (list) nowait structured_block 127

128 Data Scope Attribute Clauses - PRIVATE The PRIVATE clause declares variables in its list to be private to each thread. A new object of the same type is declared once for each thread in the team All references to the original object are replaced with references to the new object Variables declared PRIVATE should be assumed to be uninitialized for each thread 128

129 Data Scope Attribute Clauses - SHARED The SHARED clause declares variables in its list to be shared among all threads in the team. A shared variable exists in only one memory location and all threads can read or write to that address It is the programmer's responsibility to ensure that multiple threads properly access SHARED variables (such as via CRITICAL sections) 129

130 Data Scope Attribute Clauses - FIRSTPRIVATE The FIRSTPRIVATE clause combines the behavior of the PRIVATE clause with automatic initialization of the variables in its list. Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct. 130

131 Data Scope Attribute Clauses - LASTPRIVATE The LASTPRIVATE clause combines the behavior of the PRIVATE clause with a copy from the last loop iteration or section to the original variable object. The value copied back into the original variable object is obtained from the last (sequentially) iteration or section of the enclosing construct. For example, the team member which executes the final iteration for a DO section, or the team member which does the last SECTION of a SECTIONS context performs the copy with its own values 131

132 Synchronization Constructs CRITICAL Directive The CRITICAL directive specifies a region of code that must be executed by only one thread at a time. If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region. 132

133 Synchronization Constructs CRITICAL Directive #include void main() { int x; x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } / * end of parallel section */ } 133

134 Synchronization Constructs MASTER Directive –The MASTER directive specifies a region that is to be executed only by the master thread of the team. All other threads on the team skip this section of code BARRIER Directive –The BARRIER directive synchronizes all threads in the team. –When a BARRIER directive is reached, a thread will wait at that point until all other threads have reached that barrier. All threads then resume executing in parallel the code that follows the barrier. 134

135 Synchronization Constructs ATOMIC Directive –The ATOMIC directive specifies that a specific memory location must be updated atomically, rather than letting multiple threads attempt to write to it. In essence, this directive provides a mini-CRITICAL section. –The directive applies only to a single, immediately following statement 135

136 Find an error int main() { for (i = 0; i < n; i++) { a [i] = i * 0.5; b [i] = i * 2.0 } sum = 0; t = 0; #pragma omp parallel shared(sum, a, b, n) { #pragma omp for private(t) for (i = 1; i <= n; i++ ) { t = t + a[i]*b[i]; } #pragma omp critical (update_sum) sum += t; } printf ("sum = %f \n", sum); } 136

137 Find an error for (i=0; i<n-1; i++) a[i] = a[i] + b[i]; for (i=0; i<n-1; i++) a[i] = a[i+1] + b[i]; 137

138 Find an error #pragma omp parallel { int Xlocal = omp_get_thread_num(); Xshared = omp_get_thread_num(); printf("Xlocal = %d Xshared = %d\n",Xlocal,Xshared); } int i, j; #pragma omp parallel for for (i=0; i<n; i++) for (j=0; j<m; j++) { a[i][j] = compute(i,j); } 138

139 Find an error void compute(int n) { int i; double h, x, sum; h = 1.0/(double) n; sum = 0.0; #pragma omp for reduction(+:sum) shared(h) for (i=1; i <= n; i++) { x = h * ((double)i - 0.5); sum += (1.0 / (1.0 + x*x)); } pi = h * sum; } 139

140 Find an error void main () {............. #pragma omp parallel for private(i,a,b) for (i=0; i<n; i++) { b++; a = b+i; } /*-- End of parallel for --*/ c = a + b;............. } 140

141 Art of Multiprocessor Programming141 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike 2.5 License You are free: –to Share — to copy, distribute and transmit the work –to Remix — to adapt the work Under the following conditions: –Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). –Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to –http://creativecommons.org/licenses/by-sa/3.0/. Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.


Download ppt "Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike."

Similar presentations


Ads by Google