INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.

Slides:



Advertisements
Similar presentations
Implementing Task Decompositions Intel Software College Introduction to Parallel Programming – Part 5.
Advertisements

Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7.
Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
INTEL CONFIDENTIAL Implementing a Task Decomposition Introduction to Parallel Programming – Part 9.
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
1 Optimizing compilers Managing Cache Bercovici Sivan.
The Intel® Software Community Real-world resources you can use.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
Software Methods to Increase Data Cache Performance Presented by Philip Marshall.
INTEL CONFIDENTIAL Deadlock Introduction to Parallel Programming – Part 7.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Parallel Sorting – Odd-Even Sort David Monismith CS 599 Notes based upon multiple sources provided in the footers of each slide.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
INTEL CONFIDENTIAL Confronting Race Conditions Introduction to Parallel Programming – Part 6.
INTEL CONFIDENTIAL OpenMP for Task Decomposition Introduction to Parallel Programming – Part 8.
Threaded Programming Methodology Intel Software College.
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
Intel® Processor Architecture: Multi-core Overview Intel® Software College.
INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.
INTEL CONFIDENTIAL Parallel Decomposition Methods Introduction to Parallel Programming – Part 2.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
INTEL CONFIDENTIAL Finding Parallelism Introduction to Parallel Programming – Part 3.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Recognizing Potential Parallelism Intel Software College Introduction to Parallel Programming – Part 1.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
OpenMP fundamentials Nikita Panov
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Concurrency and Performance Based on slides by Henri Casanova.
Thinking in Parallel - Introduction New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
Tuning Threaded Code with Intel® Parallel Amplifier.
1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.
Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.
Introduction to OpenMP
Dependence Analysis Important and difficult
The Hardware/Software Interface CSE351 Winter 2013
The University of Adelaide, School of Computer Science
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Introduction to OpenMP
Exploiting Parallelism
Cache Memory Presentation I
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Parallel Programming in C with MPI and OpenMP
ე ვ ი ო Ш Е Т И О А С Д Ф К Ж З В Н М W Y U I O S D Z X C V B N M
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Multithreading Why & How.
M4 and Parallel Programming
Optimization.
Introduction to Optimization
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Review & Objectives Previously: Define speedup and efficiency Use Amdahl’s Law to predict maximum speedup At the end of this part you should be able to: Explain why it can be difficult both to optimize load balancing and maximize locality Use loop fusion, loop fission, and loop inversion to create or improve opportunities for parallel execution 2

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. General Rules of Thumb Start with best sequential algorithm Maximize locality 3

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Start with Best Sequential Algorithm Don’t confuse “speedup” with “speed” Speedup: ratio of program’s execution time on 1 core to its execution time on p cores What if start with inferior sequential algorithm? Naïve, higher complexity algorithms Easier to make parallel Usually don’t lead to fastest parallel algorithm 4

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Maximize Locality Temporal locality: If a processor accesses a memory location, there is a good chance it will revisit that memory location soon Data locality: If a processor accesses a memory location, there is a good chance it will visit a nearby location soon Programs tend to exhibit locality because they tend to have loops indexing through arrays Principle of locality makes cache memory worthwhile 5

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Parallel Processing and Locality Multiple cores  multiple caches When a core writes a value, the system must ensure no core tries to reference an obsolete value (cache coherence problem) A write by one core can cause the invalidation of another core’s copy of cache line, leading to a cache miss Rule of thumb: Better to have different cores manipulating totally different chunks of arrays We say a parallel program has good locality if cores’ memory writes tend not to interfere with the work being done by other cores 6

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Example: Array Initialization 7 for (i = 0; i < N; i++) a[i] = 0; Terrible allocation of work to processors Better allocation of work to processors... unless sub-arrays map to same cache lines!

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Transformations Loop fission Loop fusion Loop inversion 8

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Fission Begin with single loop having loop-carried dependence Split loop into two or more loops New loops can be executed in parallel 9

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Before Loop Fission float *a, *b; int i; for (i = 1; i < N; i++) { if (b[i] > 0.0) a[i] = 2.0 * b[i]; else a[i] = 2.0 * fabs(b[i]); b[i] = a[i-1]; } 10 Perfectly parallel Loop-carried dependence

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. After Loop Fission #pragma omp parallel { #pragma omp for for (i = 1; i < N; i++) { if (b[i] > 0.0) a[i] = 2.0 * b[i]; else a[i] = 2.0 * fabs(b[i]); } #pragma omp for for (i = 1; i < N; i++) b[i] = a[i-1]; } 11

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Fission and Locality Another use of loop fission is to increase data locality Before fission, nested loops reference too many data values, leading to poor cache hit rate Break nested loops into multiple nested loops New nested loops have higher cache hit rate 12

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Before Fission for (i = 0; i < list_len; i++) for (j = prime[i]; j < N; j += prime[i]) marked[j] = 1; 13 marked

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. After Fission for (k = 0; k < N; k += CHUNK_SIZE) for (i = 0; i < list_len; i++) { start = f(prime[i], k); end = g(prime[i], k); for (j = start; j < end; j += prime[i]) marked[j] = 1; } 14 marked etc.

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Fusion The opposite of loop fission Combine loops increase grain size 15

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Before Loop Fusion float *a, *b, x, y; int i;... for (i = 0; i < N; i++) a[i] = foo(i); x = a[N-1] – a[0]; for (i = 0; i < N; i++) b[i] = bar(a[i]); y = x * b[0] / b[N-1]; Functions foo and bar are side-effect free. 16

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. After Loop Fusion #pragma omp parallel for for (i = 0; i < N; i++) { a[i] = foo(i); b[i] = bar(a[i]); } x = a[N-1] – a[0]; y = x * b[0] / b[N-1]; Now one barrier instead of two 17

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Coalescing Example #define N 23 #define M for (k = 0; k < N; k++) for (j = 0; j < M; j++) w_new[k][j] = DoSomeWork(w[k][j], k, j); Prime number of iterations will never be perfectly load balanced Parallelize inner loop? Are there enough iterations to overcome overhead? 18

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Coalescing Example #define N 23 #define M for (kj = 0; kj < N*M; kj++) { k = kj / M; j = kj % M; w_new[k][j] = DoSomeWork(w[k][j], k, j); } Larger number of iterations gives better opportunity for load balance and hiding overhead DIV and MOD are overhead 19

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Inversion Nested for loops may have data dependences that prevent parallelization Inverting the nesting of for loops may Expose a parallelizable loop Increase grain size Improve parallel program’s locality 20

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. for (j = 1; j < n; j++) for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1]; Loop Inversion Example 21...

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Before Loop Inversion for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1]; 22 Can execute inner loop in parallel, but grain size small...

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Before Loop Inversion for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1]; 23 Can execute inner loop in parallel, but grain size small...

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Before Loop Inversion for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1]; 24 Can execute inner loop in parallel, but grain size small...

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Before Loop Inversion for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1]; 25 Can execute inner loop in parallel, but grain size small...

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Before Loop Inversion for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1]; 26 Can execute inner loop in parallel, but grain size small...

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. After Loop Inversion #pragma omp parallel for for (i = 0; i < m; i++) for (j = 1; j < n; j++) a[i][j] = 2 * a[i][j-1]; 27 Can execute outer loop in parallel...

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. References Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, Parallel Programming in OpenMP, Morgan Kaufmann (2001). Peter Denning, “The Locality Principle,” Naval Postgraduate School (2005). Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw-Hill (2004). 28