Parallel Programming with OpenMP

Slides:

Advertisements

Similar presentations

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

Advertisements

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 OpenMP—An API for Shared Memory Programming Slides are based on:

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: Krste Asanovic & Vladimir Stojanovic.

Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.

CS240A, T. Yang, 2013 Modified from Demmel/Yelick’s and Mary Hall’s Slides 1 Parallel Programming with OpenMP.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

Instructor: Sagar Karandikar

2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Introduction to OpenMP. OpenMP Introduction Credits:

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Threaded Programming Lecture 4: Work sharing directives.

09/08/2011CS4961 CS4961 Parallel Programming Lecture 6: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 8, 2011.

Introduction to OpenMP

Senior Lecturer SOE Dan Garcia 1 CS 61C: Great Ideas in Computer Architecture Synchronization, OpenMP.

09/09/2010CS4961 CS4961 Parallel Programming Lecture 6: Data Parallelism in OpenMP, cont. Introduction to Data Parallel Algorithms Mary Hall September.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

CS240A, T. Yang, Parallel Programming with OpenMP.

CS 110 Computer Architecture Lecture 20: Thread-Level Parallelism (TLP) and OpenMP Intro Instructor: Sören Schwertfeger School.

COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University

Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: Nicholas Weaver & Vladimir Stojanovic.

Parallel Programming in C with MPI and OpenMP

MPSoC Architectures OpenMP Alberto Bosio

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

Lecture 5: Shared-memory Computing with Open MP

Shared-memory Programming

CS427 Multicore Architecture and Parallel Computing

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Shared-Memory Programming

Computer Science Department

John Wawrzynek & Vladimir Stojanovic

CS4961 Parallel Programming Lecture 5: Data and Task Parallelism, cont

Multi-core CPU Computing Straightforward with OpenMP

CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 4, /04/2012 CS4230.

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

Instructors: Yuanqing Cheng

Guest Lecturer: Justin Hsia

Introduction to OpenMP

OpenMP Martin Kruliš.

Instructor: Michael Greenbaum

Parallel Programming with OPENMP

Presentation transcript:

Parallel Programming with OpenMP CS240A, T. Yang CS267 Lecture 2

A Programmer’s View of OpenMP What is OpenMP? Open specification for Multi-Processing “Standard” API for defining multi-threaded shared-memory programs openmp.org – Talks, examples, forums, etc. OpenMP is a portable, threaded, shared-memory programming specification with “light” syntax Exact behavior depends on OpenMP implementation! Requires compiler support (C or Fortran) OpenMP will: Allow a programmer to separate a program into serial regions and parallel regions, rather than T concurrently-executing threads. Hide stack management Provide synchronization constructs OpenMP will not: Parallelize automatically Guarantee speedup Provide freedom from data races CS267 Lecture 2

Motivation – OpenMP int main() { // Do this part in parallel printf( "Hello, World!\n" ); return 0; } CS267 Lecture 2

Motivation – OpenMP All OpenMP directives begin: #pragma int main() { omp_set_num_threads(4); // Do this part in parallel #pragma omp parallel { printf( "Hello, World!\n" ); } return 0; All OpenMP directives begin: #pragma Printf Printf Printf Printf CS267 Lecture 2

OpenMP parallel region construct Block of code to be executed by multiple threads in parallel Each thread executes the same code redundantly (SPMD) Work within work-sharing constructs is distributed among the threads in a team Example with C/C++ syntax #pragma omp parallel [ clause [ clause ] ... ] new-line structured-block clause can include the following: private (list) shared (list) Example: OpenMP default is shared variables To make private, need to declare with pragma: #pragma omp parallel private (x)

OpenMP Programming Model - Review Fork - Join Model: OpenMP programs begin as single process (master thread) and executes sequentially until the first parallel region construct is encountered FORK: Master thread then creates a team of parallel threads Statements in program that are enclosed by the parallel region construct are executed in parallel among the various threads JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread Thread 0 Thread 0 Thread 1 Thread 1 Sequential code

parallel Pragma and Scope – More Examples #pragma omp parallel num_threads(2) { x=1; y=1+x; } X=1; y=1+x; x=1; y=1+x; X and y are shared variables. There is a risk of data race

parallel Pragma and Scope - Review #pragma omp parallel { x=1; y=1+x; } Assume number of threads=2 Thread 0 Thread 1 X=1; y=1+x; x=1; y=1+x; X and y are shared variables. There is a risk of data race

parallel Pragma and Scope - Review #pragma omp parallel num_threads(2) { x=1; y=1+x; } X=1; y=x+1; x=1; y=x+1; X and y are shared variables. There is a risk of data race

Divide for-loop for parallel sections for (int i=0; i<8; i++) x[i]=0; //run on 4 threads #pragma omp parallel { int numt=omp_get_num_thread(); int id = omp_get_thread_num(); //id=0, 1, 2, or 3 for (int i=id; i<8; i +=numt) x[i]=0; } // Assume number of threads=4 Thread 0 Thread 1 Thread 2 Thread 3 Id=0; x[0]=0; X[4]=0; Id=1; x[1]=0; X[5]=0; Id=2; x[2]=0; X[6]=0; Id=3; x[3]=0; X[7]=0;

Use pragma parallel for for (int i=0; i<8; i++) x[i]=0; #pragma omp parallel for { for (int i=0; i<8; i++) x[i]=0; } System divides loop iterations to threads Id=0; x[0]=0; X[4]=0; Id=1; x[1]=0; X[5]=0; Id=2; x[2]=0; X[6]=0; Id=3; x[3]=0; X[7]=0;

OpenMP Data Parallel Construct: Parallel Loop Compiler calculates loop bounds for each thread directly from serial source (computation decomposition) Compiler also manages data partitioning Synchronization also automatic (barrier)

Programming Model – Parallel Loops Requirement for parallel loops No data dependencies (reads/write or write/write pairs) between iterations! Preprocessor calculates loop bounds and divide iterations among parallel threads ? #pragma omp parallel for for( i=0; i < 25; i++ ) { printf(“Foo”); } CS267 Lecture 2

In general, don’t jump outside of any pragma block Example for (i=0; i<max; i++) zero[i] = 0; Breaks for loop into chunks, and allocate each to a separate thread e.g. if max = 100 with 2 threads: assign 0-49 to thread 0, and 50-99 to thread 1 Must have relatively simple “shape” for an OpenMP-aware compiler to be able to parallelize it Necessary for the run-time system to be able to determine how many of the loop iterations to assign to each thread No premature exits from the loop allowed i.e. No break, return, exit, goto statements “Shape” restrictions: The FOR loop can not be a DO WHILE loop, or a loop without loop control. Also, the loop iteration variable must be an integer and the loop control parameters must be the same for all threads. In general, don’t jump outside of any pragma block

Parallel Statement Shorthand #pragma omp parallel { #pragma omp for for(i=0;i<max;i++) { … } } can be shortened to: #pragma omp parallel for Also works for sections This is the only directive in the parallel section

Example: Calculating π

Sequential Calculation of π in C #include <stdio.h> /* Serial Code */ static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double)num_steps; for (i = 1; i <= num_steps; i++) { x = (i - 0.5) * step; sum = sum + 4.0 / (1.0 + x*x); } pi = sum / num_steps; printf ("pi = %6.12f\n", pi);

Parallel OpenMP Version (1) #include <omp.h> #define NUM_THREADS 4 static long num_steps = 100000; double step; void main () { int i; double x, pi, sum[NUM_THREADS]; step = 1.0/(double) num_steps; #pragma omp parallel private ( i, x ) { int id = omp_get_thread_num(); for (i=id, sum[id]=0.0; i< num_steps; i=i+NUM_THREADS) x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } for(i=1; i<NUM_THREADS; i++) sum[0] += sum[i]; pi = sum[0] / num_steps printf ("pi = %6.12f\n", pi); a thread handles every other rectangle with 2 threads

OpenMP Reduction Problem is that we really want sum over all threads! double avg, sum=0.0, A[MAX]; int i; #pragma omp parallel for private ( sum ) for (i = 0; i <= MAX ; i++) sum += A[i]; avg = sum/MAX; // bug Problem is that we really want sum over all threads! Reduction: specifies that 1 or more variables that are private to each thread are subject of reduction operation at end of parallel region: reduction(operation:var) where Operation: operator to perform on the variables (var) at the end of the parallel region Var: One or more variables on which to perform scalar reduction. #pragma omp for reduction(+ : sum) avg = sum/MAX; Sum+=A[0] Sum+=A[1] Sum+=A[2] Sum+=A[3] Sum+=A[2] Sum+=A[0] Sum+=A[3] Sum+=A[1]

OpenMp: Parallel Loops with Reductions OpenMP supports reduce operation sum = 0; #pragma omp parallel for reduction(+:sum) for (i=0; i < 100; i++) { sum += array[i]; } Reduce ops and init() values (C and C++): + 0 bitwise & ~0 logical & 1 - 0 bitwise | 0 logical | 0 * 1 bitwise ^ 0

Calculating π Version (1) - review #include <omp.h> #define NUM_THREADS 4 static long num_steps = 100000; double step; void main () { int i; double x, pi, sum[NUM_THREADS]; step = 1.0/(double) num_steps; #pragma omp parallel private ( i, x ) { int id = omp_get_thread_num(); for (i=id, sum[id]=0.0; i< num_steps; i=i+NUM_THREADS) x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } for(i=1; i<NUM_THREADS; i++) sum[0] += sum[i]; pi = sum[0] / num_steps printf ("pi = %6.12f\n", pi); a thread handles every other rectangle with 2 threads

Version 2: parallel for, reduction #include <omp.h> #include <stdio.h> /static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for private(x) reduction(+:sum) for (i=1; i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = sum / num_steps; printf ("pi = %6.8f\n", pi);

Loop Scheduling in Parallel for pragma #pragma omp parallel for for (i=0; i<max; i++) zero[i] = 0; Master thread creates additional threads, each with a separate execution context All variables declared outside for loop are shared by default, except for loop index which is private per thread (Why?) Implicit “barrier” synchronization at end of for loop Divide index regions sequentially per thread Thread 0 gets 0, 1, …, (max/n)-1; Thread 1 gets max/n, max/n+1, …, 2*(max/n)-1 Why?

Impact of Scheduling Decision Load balance Same work in each iteration? Processors working at same speed? Scheduling overhead Static decisions are cheap because they require no run-time coordination Dynamic decisions have overhead that is impacted by complexity and frequency of decisions Data locality Particularly within cache lines for small chunk sizes Also impacts data reuse on same processor

OpenMP environment variables OMP_NUM_THREADS sets the number of threads to use during execution when dynamic adjustment of the number of threads is enabled, the value of this environment variable is the maximum number of threads to use For example, setenv OMP_NUM_THREADS 16 [csh, tcsh] export OMP_NUM_THREADS=16 [sh, ksh, bash] OMP_SCHEDULE applies only to do/for and parallel do/for directives that have the schedule type RUNTIME sets schedule type and chunk size for all such loops setenv OMP_SCHEDULE GUIDED,4 [csh, tcsh] export OMP_SCHEDULE= GUIDED,4 [sh, ksh, bash]

Programming Model – Loop Scheduling schedule clause determines how loop iterations are divided among the thread team static([chunk]) divides iterations statically between threads Each thread receives [chunk] iterations, rounding as necessary to account for all iterations Default [chunk] is ceil( # iterations / # threads ) dynamic([chunk]) allocates [chunk] iterations per thread, allocating an additional [chunk] iterations when a thread finishes Forms a logical work queue, consisting of all loop iterations Default [chunk] is 1 guided([chunk]) allocates dynamically, but [chunk] is exponentially reduced with each allocation CS267 Lecture 2

Loop scheduling options 2 (2)

Programming Model – Data Sharing Parallel programs often employ two types of data Shared data, visible to all threads, similarly named Private data, visible to a single thread (often stack-allocated) // shared, globals int bigdata[1024]; void* foo(void* bar) { // private, stack int tid; /* Calculation goes here */ } int bigdata[1024]; void* foo(void* bar) { int tid; #pragma omp parallel \ shared ( bigdata ) \ private ( tid ) { /* Calc. here */ } PThreads: Global-scoped variables are shared Stack-allocated variables are private OpenMP: shared variables are shared private variables are private CS267 Lecture 2

Programming Model - Synchronization OpenMP Synchronization OpenMP Critical Sections Named or unnamed No explicit locks / mutexes Barrier directives Explicit Lock functions When all else fails – may require flush directive Single-thread regions within parallel regions master, single directives #pragma omp critical { /* Critical code here */ } #pragma omp barrier omp_set_lock( lock l ); /* Code goes here */ omp_unset_lock( lock l ); #pragma omp single { /* Only executed once */ } Master, single: both let only thread execute indicated code: Single: any thread can do it Master (the one that spawns the other threads): only master can do it CS267 Lecture 2

Omp critical vs. atomic #pragma omp atomic x += exper int sum=0 #pragma omp parallel for for(int j=1; j <n; j++){ int x = j*j; #pragma omp critical { sum=sum+x;// One thread enters the critical section at a time. } * May also use #pragma omp atomic x += exper Faster, but can support only limited arithmetic operation such as ++, --, +=, -=, *=, /=, &=, |=

OpenMP Timing Elapsed wall clock time: double omp_get_wtime(void); Returns elapsed wall clock time in seconds Time is measured per thread, no guarantee can be made that two distinct threads measure the same time Time is measured from “some time in the past,” so subtract results of two calls to omp_get_wtime to get elapsed time

Parallel Matrix Multiply: Run Tasks Ti in parallel on multiple threads

Parallel Matrix Multiply: Run Tasks Ti in parallel on multiple threads

Matrix Multiply in OpenMP // C[M][N] = A[M][P] × B[P][N] start_time = omp_get_wtime(); #pragma omp parallel for private(tmp, j, k) for (i=0; i<M; i++){ for (j=0; j<N; j++){ tmp = 0.0; for( k=0; k<P; k++){ /* C(i,j) = sum(over k) A(i,k) * B(k,j)*/ tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; run_time = omp_get_wtime() - start_time; Outer loop spread across N threads; inner loops inside a single thread

OpenMP Summary OpenMP is a compiler-based technique to create concurrent code from (mostly) serial code OpenMP can enable (easy) parallelization of loop-based code with fork-join parallelism pragma omp parallel pragma omp parallel for pragma omp parallel private ( i, x ) pragma omp atomic pragma omp critical #pragma omp for reduction(+ : sum) OpenMP performs comparably to manually-coded threading Not a silver bullet for all applications CS267 Lecture 2