OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Slides:

Advertisements

Similar presentations

Advertisements

Introduction to Openmp & openACC

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Games at Bolton OpenMP Techniques Andrew Williams

Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

OpenMPI Majdi Baddourah

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

INTEL CONFIDENTIAL Confronting Race Conditions Introduction to Parallel Programming – Part 6.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

Programming with Shared Memory Introduction to OpenMP

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Parallel Programming in Java with Shared Memory Directives.

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

DDT Debugging Techniques Carlos Rosales Scaling to Petascale 2010 July 7, 2010.

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Threaded Programming Lecture 4: Work sharing directives.

Introduction to OpenMP

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Threaded Programming Lecture 2: Introduction to OpenMP.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Symbolic Analysis of Concurrency Errors in OpenMP Programs Presented by : Steve Diersen Contributors: Hongyi Ma, Liqiang Wang, Chunhua Liao, Daniel Quinlen,

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

Introduction to OpenMP

SHARED MEMORY PROGRAMMING WITH OpenMP

4D-VAR Optimization Efficiency Tuning

Lecture 5: Shared-memory Computing with Open MP

SHARED MEMORY PROGRAMMING WITH OpenMP

CS427 Multicore Architecture and Parallel Computing

Computer Engg, IIT(BHU)

Primitive Data, Variables, Loops (Maybe)

Introduction to OpenMP

Shared-Memory Programming

Computer Science Department

Shared Memory Programming with OpenMP

OpenMP Quiz B. Wilkinson January 22, 2016.

Multi-core CPU Computing Straightforward with OpenMP

Hybrid Parallel Programming

Lab. 3 (May 11th) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute :

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

Variables variable: A piece of the computer's memory that is given a name and type, and can store a value. Like preset stations on a car stereo, or cell.

DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.

Introduction to OpenMP

Speaker : Guo-Zhen Wu Date : 2008/12/28

OpenMP Parallel Programming

Shared-Memory Paradigm & OpenMP

Presentation transcript:

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center

Introduction How to compile Code (C and Fortran) with OpenMP How to parallelize code with OpenMP –Use the correct header declarations –Parallelize simple loops How to effectively hide OpenMP statements What you will learn What you will do Modify example code READ the CODE COMMENTS Compile and execute the example Compare the run-time of the serial codes and the OpenMP parallel codes with different scheduling methods

Accessing Lab Files Log on to Stampede using your account. Untar the file lab_OpenMP.tar file (in ~train00). The new directory (lab_openmp) contains sub-directories for exercises 1-3. cd into the appropriate subdirectory for an exercise. ssh tar -xvf ~train00/lab_OpenMP.tar cd lab_openmp

Running on compute nodes Interactively You will be compiling your code on the login node You will be running on the compute nodes In one of the sessions, run idev: –idev -A TRAINING-HPC -t 1:00:00 –idev -A TG-TRA t 1:00:00 This will give you access to a compute node

Compiling All OpenMP statements are activated by the OpenMP flag: –Intel compiler: icc/ifort -openmp -fpp source. Compilation with the OpenMP flag (-openmp): Activates OpenMP comment directives (…) : Fortran: !$OMP... C: #pragma omp... Enables the macro named _OPENMP #ifdef _OPENMP evaluates to true (Fortraners: compile with –fpp) Enables ”hidden” statements (Fortran only!) !$...

Exercises – Lab 1 Exercise 1: Kernel check f_kernel.f90/c_kernel.c Kernel of the calculation (see exercise 2) Parallelize one Loop Exercise 2: Calculation of  f_pi.f90/c_pi.c Parallelize one Loop with a reduction Exercise 3: daxpy (a * x + b) f_daxpy.f90/c_daxpy.c Parallelize one Loop

Exercise 1:  Integration Kernel Check Parallelize the code Compile Run with 1, 2, 4, 8,12, 16 threads Compare the timings Timings decrease with more threads. If you execute with more threads than cores the timings will NOT decrease. Why? cd exercise_kernel Codes: f_kernel.f90/c_kernel.c Number of intervals is varied (Trial loop) Parallelize the Loop over i : Use omp parallel do/for Set appropriate variables to private Compile with: ifort -openmp f_kernel.f90 icc -openmp c_kernel.c e.g. export OMP_NUM_THREADS=4./a.out Try also: export KMP_AFFINITY=compact

Exercise 2:  Integration Parallelize the code Complete OpenMP statements –Initialization –omp_get_max_threads –omp_get_thread_num cd exercise_pi Codes: f_pi.90/c_pi.c Number of intervals is varied (Trial loop) Parallelize the Loop over i : Use omp parallel do/for with the default(none) clause Compile with: make f_pi or make c_pi Run with 1, 2, 4, 8,12 threads Compare timings Timings decrease with more threads What is the scale up at 12 threads?. e.g. export OMP_NUM_THREADS=4./c_pi or./f_pi 2 2

Exercise 3: daxpy cd exercise_daxpy Codes: f_daxpy.f90/c_daxpy.c Number of intervals is varied (Trial loop) Parallelize the Loop over i : Use omp parallel do/for with the default(none) clause Compile with: make f_daxpy or make c_daxpy Parallelize the code complete OpenMP statements –Initialization –omp_get_max_threads Hint: Parallel performance can be limited by memory bandwidth– what is happening for every daxpy operation? (Is there cache reuse?) Run with 1 and 12 Compare timings Why is performance only doubled?

Exercises – Lab 2 Exercise 4: Update from neighboring cells (2 arrays) f_neighbor.f90/c_neighbor.c Create a Parallel Region Use a Single construct to initialize Use a Critical construct to update Use dynamic or guided scheduling Exercise 5: Update from neighboring cells (same array) f_red black.f90/c_red black.c Parallelize 3 individual loops, use a reduction Create a Parallel Region Combine loops 1 and 2 Use a Single construct to initialize

Exercise 4: Neighbor Update; Part 1 cd exercise_neighbor Codes: f_neighbor.f90/c_neighbor.c Compile with: make f_neighbor make c_neighbor Parallelize the Loop over i Use a single construct for initialization Would a master construct work, too? Use critical for increment of j_update Use omp parallel do/for with the default(none) clause Try different schedules: static, dynamic, guided

Exercise 4: Neighbor Update; Part 2 Compile with: make f_neighbor make c_neighbor Change the single to a master construct Run with 1 and 12 threads How does the number of j_update change?

Exercise 5: Red-Black Update; Part 1 cd exercise_redblack Codes: f_red_black.f90/c_red_black.c make a copy and create f_red_black_v1.f90/c_read_black_v1.c Compile with: make f_red_black_v1 make c_red_black_v1 Part 1 Parallelize each loop separately Use omp parallel do/for for the Update loops Use a reduction for the Error calculation with the default(none) clause Try static scheduling

Exercise 5: Red-Black Update; Part 2 cd exercise_redblack Start from version 1 Codes: f_red_black.f90/c_red_black.c make a copy and create f_red_black_v2.f90/c_read_black_v2.c Compile with: make f_red_black_v2 make c_red_black_v2 Part 2 Can the loops be combined? Why can the update loops be combined? Why can the error loop not be combined with the update loops? Task: Combine the update loops Try static scheduling

Solution 5: Red-Black Update; Part 2

Exercise 5: Red-Black Update; Part 3 cd exercise_redblack Start from version 2 Codes: f_red_black.f90/c_red_black.c make a copy and create f_red_black_v3.f90/c_read_black_v3.c Compile with: make f_red_black_v3 make c_red_black_v3 Part 3 Make one parallel region around both loops: update and error. The initialization of error has to be done by one thread Use a single construct Would a master construct work?

Exercise 6: Orphaned work-sharing cd exercise_print Codes: f_print.f90/c_print.c make a copy and create f_print_v1.f90/c_print_v1.c Compile with: make f_print make c_print Inspect the code Run with 1, 2,... threads Explain the output How often are the 5 print statements executed? Why?