Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7.

Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 2 Improving Parallel Performance Objectives At the end of this module, you should be able to Give two reasons why one sequential algorithm may more suitable than another for parallelization Use loop fusion, loop fission, and loop inversion to create or improve opportunities for parallel execution Explain the pros and cons of static versus dynamic loop scheduling Explain why it can be difficult both to optimize load balancing and maximize locality

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 3 Improving Parallel Performance General Rules of Thumb Start with best sequential algorithm Maximize locality

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 4 Improving Parallel Performance Start with Best Sequential Algorithm Dont confuse speedup with speed Speedup: ratio of programs execution time on 1 processor to its execution time on p processors What if start with inferior sequential algorithm? Naïve, higher complexity algorithms Easier to make parallel Usually dont lead to fastest parallel algorithm

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 5 Improving Parallel Performance Example: Search for Chess Move Naïve minimax algorithm Exhaustive search of game tree Branching factor around 35 Nodes evaluated in search of depth d: 35 d Alpha-beta search algorithm Prunes useless subtrees Branching factor around 6 Nodes evaluated in search of depth d: 6 d

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 6 Improving Parallel Performance Minimax Search 7 3 1 5 4 6 1 2 8 0 3 -5 -2 4 7 6 My move choose max His move choose min 3 3 0 3406 4610-5-213 My move choose max His move choose min

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 7 Improving Parallel Performance Alpha-Beta Pruning 7 3 14 6 8 0 3 -5 3 3 0 340 40 13 My move choose max His move choose min My move choose max His move choose min

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 8 Improving Parallel Performance How Deep the Search? Nodes EvaluatedMinimax on one processor Parallel minimax, speedup 35 Alpha-beta pruning on one processor 100,000346 100 million5610 100 billion7814

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 9 Improving Parallel Performance Maximize Locality Temporal locality: If a processor accesses a memory location, there is a good chance it will revisit that memory location soon Data locality: If a processor accesses a memory location, there is a good chance it will visit a nearby location soon Programs tend to exhibit locality because they tend to have loops indexing through arrays Principle of locality makes cache memory worthwhile

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 10 Improving Parallel Performance Parallel Processing and Locality Multiple processors multiple caches When a processor writes a value, the system must ensure no processor tries to reference an obsolete value (cache coherence problem) A write by one processor can cause the invalidation of another processors cache line, leading to a cache miss Rule of thumb: Better to have different processors manipulating totally different chunks of arrays We say a parallel program has good locality if processors memory writes tend not to interfere with the work being done by other processors

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 11 Improving Parallel Performance Example: Array Initialization for (i = 0; i < N; i++) a[i] = 0; 0 1 2 3 0 1 2 3 0 1 2 3 Terrible allocation of work to processors 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 Better allocation of work to processors... unless sub-arrays map to same cache lines!

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 12 Improving Parallel Performance Loop Transformations Loop fission Loop fusion Loop inversion

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 13 Improving Parallel Performance Loop Fission Begin with single loop having loop-carried dependence Split loop into two or more loops New loops can be executed in parallel

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 14 Improving Parallel Performance Before Loop Fission float *a, *b; int i; for (i = 1; i < N; i++) { if (b[i] > 0.0) a[i] = 2.0 * b[i]; else a[i] = 2.0 * fabs(b[i]); b[i] = a[i-1]; } Perfectly parallel Loop-carried dependence

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 15 Improving Parallel Performance After Loop Fission #pragma omp parallel { #pragma omp for for (i = 1; i < N; i++) { if (b[i] > 0.0) a[i] = 2.0 * b[i]; else a[i] = 2.0 * fabs(b[i]); } #pragma omp for for (i = 1; i < N; i++) { b[i] = a[i-1]; } This works because there is a barrier synchronization after a parallel for loop

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 16 Improving Parallel Performance Loop Fission and Locality Another use of loop fission is to increase data locality Before fission, nested loops reference too many data values, leading to poor cache hit rate Break nested loops into multiple nested loops New nested loops have higher cache hit rate

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 17 Improving Parallel Performance Before Fission for (i = 0; i < list_len; i++) for (j = prime[i]; j < N; j += prime[i]) marked[j] = 1; marked

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 18 Improving Parallel Performance After Fission for (k = 0; k < N; k += CHUNK_SIZE) for (i = 0; i < list_len; i++) { start = f(prime[i], k); end = g(prime[i], k); for (j = start; j < end; j += prime[i]) marked[j] = 1; } marked etc.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 19 Improving Parallel Performance Loop Fusion The opposite of loop fission Combine loops increase grain size

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 20 Improving Parallel Performance Before Loop Fusion float *a, *b, x, y; int i;... for (i = 0; i < N; i++) a[i] = foo(i); x = a[N-1] – a[0]; for (i = 0; i < N; i++) b[i] = bar(a[i]); y = x * b[0] / b[N-1]; Functions foo and bar are side-effect free.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 21 Improving Parallel Performance After Loop Fusion #pragma omp parallel for for (i = 0; i < N; i++) { a[i] = foo(i); b[i] = bar(a[i]); } x = a[N-1] – a[0]; y = x * b[0] / b[N-1]; Now one barrier instead of two

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 22 Improving Parallel Performance Loop Inversion Nested for loops may have data dependences that prevent parallelization Inverting the nesting of for loops may Expose a parallelizable loop Increase grain size Improve parallel programs locality

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 23 Improving Parallel Performance Before Loop Inversion for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1]; Can execute inner loop in parallel, but grain size small

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 24 Improving Parallel Performance After Loop Inversion #pragma omp parallel for for (i = 0; i < m; i++) for (j = 1; j < n; j++) a[i][j] = 2 * a[i][j-1]; Can execute outer loop in parallel

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 25 Improving Parallel Performance Reducing Parallel Overhead Loop scheduling Conditionally executing in parallel Replicating work

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 26 Improving Parallel Performance Loop Scheduling Loop schedule: how loop iterations are assigned to threads Static schedule: iterations assigned to threads before execution of loop Dynamic schedule: iterations assigned to threads during execution of loop

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 27 Improving Parallel Performance Loop Scheduling in OpenMP NameChunkChunk Size # Chunks Static/ Dynamic StaticNoN/PPStatic InterleavedYesCN/CStatic DynamicOptionalCN/CDynamic GuidedOptionalDecreasing< N/CDynamic RuntimeNoVaries From Parallel Programming in OpenMP by Chandra et al.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 28 Improving Parallel Performance Loop Scheduling Example #pragma omp parallel for for (i = 0; i < 12; i++) for (j = 0; j <= i; j++) a[i][j] =...;

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 29 Improving Parallel Performance AB CD

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 30 Improving Parallel Performance Locality Smaller Data Sets Larger Data Sets Load Balance Locality v. Load Balance

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 31 Improving Parallel Performance Conditionally Enable Parallelism Suppose sequential loop has execution time jn Suppose barrier synchronization time is kp We should make loop parallel only if OpenMPs if clause lets us conditionally enable parallelism

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 32 Improving Parallel Performance Example of if Clause Suppose benchmarking shows a loop executes faster in parallel only when n > 1250 #pragma omp parallel for if (n > 1250) for (i = 0; i < n; i++) {... }

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 33 Improving Parallel Performance Replicate Work Every thread interaction has a cost Example: Barrier synchronization Sometimes its faster for threads to replicate work than to go through a barrier synchronization

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 34 Improving Parallel Performance Before Work Replication for (i = 0; i < N; i++) a[i] = foo(i); x = a[0] / a[N-1]; for (i = 0; i < N; i++) b[i] = x * a[i]; Both for loops are amenable to parallelization Synchronization among threads required if x is shared and one thread performs assignment

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 35 Improving Parallel Performance After Work Replication #pragma omp parallel private (x) { x = foo(0) / foo(N-1); #pragma omp for for (i = 0; i < N; i++) { a[i] = foo(i); b[i] = x * a[i]; }

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 36 Improving Parallel Performance References Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, Parallel Programming in OpenMP, Morgan Kaufmann (2001). Peter Denning, The Locality Principle, Naval Postgraduate School (2005). Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw-Hill (2004).

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel ® Software College 37 Improving Parallel Performance

Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7.

Similar presentations

Presentation on theme: "Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7.

Similar presentations

Presentation on theme: "Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7."— Presentation transcript:

Similar presentations

About project

Feedback