INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.

Slides:

Advertisements

Similar presentations

Implementing Task Decompositions Intel Software College Introduction to Parallel Programming – Part 5.

Advertisements

Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7.

Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.

Intel Software College Tuning Threading Code with Intel® Thread Profiler for Explicit Threads.

INTEL CONFIDENTIAL Implementing a Task Decomposition Introduction to Parallel Programming – Part 9.

INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Parallel Programming – Barriers, Locks, and Continued Discussion of Parallel Decomposition David Monismith Jan. 27, 2015 Based upon notes from the LLNL.

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

The Intel® Software Community Real-world resources you can use.

INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.

INTEL CONFIDENTIAL Deadlock Introduction to Parallel Programming – Part 7.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

Parallel Programming in C with MPI and OpenMP

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

INTEL CONFIDENTIAL Confronting Race Conditions Introduction to Parallel Programming – Part 6.

INTEL CONFIDENTIAL OpenMP for Task Decomposition Introduction to Parallel Programming – Part 8.

INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.

INTEL CONFIDENTIAL Parallel Decomposition Methods Introduction to Parallel Programming – Part 2.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

INTEL CONFIDENTIAL Finding Parallelism Introduction to Parallel Programming – Part 3.

Programming Models using Windows* Threads Intel Software College.

Programming with OpenMP* Intel Software College. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Recognizing Potential Parallelism Intel Software College Introduction to Parallel Programming – Part 1.

INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Programming with OpenMP* Intel Software College. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

Threaded Programming Lecture 4: Work sharing directives.

INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Symbolic Analysis of Concurrency Errors in OpenMP Programs Presented by : Steve Diersen Contributors: Hongyi Ma, Liqiang Wang, Chunhua Liao, Daniel Quinlen,

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

Thinking in Parallel - Introduction New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.

Programming with OpenMP*” Part II Intel Software College.

Tuning Threaded Code with Intel® Parallel Amplifier.

1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

SHARED MEMORY PROGRAMMING WITH OpenMP

Parallel Programming in C with MPI and OpenMP

Shared-memory Programming

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Introduction to OpenMP

Shared-Memory Programming

September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.

Programming with Shared Memory Introduction to OpenMP

ე ვ ი ო Ш Е Т И О А С Д Ф К Ж З В Н М W Y U I O S D Z X C V B N M

Introduction to OpenMP

Presentation transcript:

INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Review & Objectives Previously: Use loop fusion, loop fission, and loop inversion to create or improve opportunities for parallel execution Explain why it can be difficult both to optimize load balancing and maximize locality At the end of this part you should be able to: Explain the pros and cons of static versus dynamic loop scheduling Explain the different OpenMP schedule clauses and the situations each one is best suited for 2

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Reducing Parallel Overhead Loop scheduling Replicating work 3

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Scheduling Example for (i = 0; i < 12; i++) for (j = 0; j <= i; j++) a[i][j] =...; 4

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Scheduling Example #pragma omp parallel for for (i = 0; i < 12; i++) for (j = 0; j <= i; j++) a[i][j] =...; 5 How are the iterations divided among threads?

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Scheduling Example #pragma omp parallel for for (i = 0; i < 12; i++) for (j = 0; j <= i; j++) a[i][j] =...; 6 Typically, the iterations are divided by the number of threads and assigned as chunks to a thread

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Scheduling Loop schedule: how loop iterations are assigned to threads Static schedule: iterations assigned to threads before execution of loop Dynamic schedule: iterations assigned to threads during execution of loop The OpenMP schedule clause affects how loop iterations are mapped onto threads 7

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. The schedule clause schedule(static [,chunk]) Blocks of iterations of size “chunk” to threads Round robin distribution Low overhead, may cause load imbalance Best used for predictable and similar work per iteration 8

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Scheduling Example #pragma omp parallel for schedule(static, 2) for (i = 0; i < 12; i++) for (j = 0; j <= i; j++) a[i][j] =...; 9

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. The schedule clause schedule(dynamic[,chunk]) Threads grab “chunk” iterations When done with iterations, thread requests next set Higher threading overhead, can reduce load imbalance Best used for unpredictable or highly variable work per iteration 10

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Scheduling Example #pragma omp parallel for schedule(dynamic, 2) for (i = 0; i < 12; i++) for (j = 0; j <= i; j++) a[i][j] =...; 11

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. The schedule clause schedule(guided[,chunk]) Dynamic schedule starting with large block Size of the blocks shrink; no smaller than “chunk” Best used as a special case of dynamic to reduce scheduling overhead when the computation gets progressively more time consuming 12

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Loop Scheduling Example #pragma omp parallel for schedule(guided) for (i = 0; i < 12; i++) for (j = 0; j <= i; j++) a[i][j] =...; 13

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Replicate Work Every thread interaction has a cost Example: Barrier synchronization Sometimes it’s faster for threads to replicate work than to go through a barrier synchronization 14

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. Before Work Replication for (i = 0; i < N; i++) a[i] = foo(i); x = a[0] / a[N-1]; for (i = 0; i < N; i++) b[i] = x * a[i]; Both for loops are amenable to parallelization 15

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. First OpenMP Attempt #pragma omp parallel { #pragma omp for for (i = 0; i < N; i++) a[i] = foo(i); #pragma omp single x = a[0] / a[N-1]; #pragma omp for for (i = 0; i < N; i++) b[i] = x * a[i]; } Synchronization among threads required if x is shared and one thread performs assignment 16 Implicit Barrier

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. After Work Replication #pragma omp parallel private (x) { x = foo(0) / foo(N-1); #pragma omp for for (i = 0; i < N; i++) { a[i] = foo(i); b[i] = x * a[i]; } 17

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. References Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, Parallel Programming in OpenMP, Morgan Kaufmann (2001). Peter Denning, “The Locality Principle,” Naval Postgraduate School (2005). Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw-Hill (2004). 18