Information Technology Services B. Estrade, Louisiana Optical Network LSU OpenMP I B. Estrade.

Slides:



Advertisements
Similar presentations
Parallel Processing with OpenMP
Advertisements

Introduction to Openmp & openACC
Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Parallel Programming in Java with Shared Memory Directives.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.
Chapter 4 – Thread Concepts
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Shared Memory Parallelism - OpenMP
CS427 Multicore Architecture and Parallel Computing
Chapter 4 – Thread Concepts
Computer Engg, IIT(BHU)
Introduction to OpenMP
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Computer Science Department
Multi-core CPU Computing Straightforward with OpenMP
Chapter 4: Threads.
Prof. Thomas Sterling Department of Computer Science
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.
Introduction to OpenMP
Hybrid MPI and OpenMP Parallel Programming
Introduction to Parallel Computing
OpenMP Martin Kruliš.
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Information Technology Services B. Estrade, Louisiana Optical Network LSU OpenMP I B. Estrade

Information Technology Services B. Estrade, Louisiana Optical Network LSU Objectives ● get exposure to the shared memory programming model ● get familiar with the basics of OpenMP's capabilities ● write and compile a very simple OpenMP program ● learn of sources for more information

Information Technology Services B. Estrade, Louisiana Optical Network LSU Shared Memory ● In an ideal world, you don't want a network connecting your compute nodes (even the fastest networks are too slow!) ● 2 or more identical CPUs (or processing cores) that share main memory are called “symmetric multi-processors”, or SMP for short. ● In contrast, “asymmetric multi-processor” systems shared main memory among different, potentially special purpose, processors (i.e., an SMP computer with Cell, GPGPU, or FPGA accelerators) ● Traditionally, “supercomputers” strive to provide as much shared memory among as many cpus as possible ● As the number of cores per processor increase, clusters are trending towards distributed shared memory clusters, meaning each node is itself a shared memory machine ● However, providing a single, large shared memory machine is always the goal for companies like: Cray, IBM, Sun, and SGI

Information Technology Services B. Estrade, Louisiana Optical Network LSU Distributed Memory ● Distributed memory systems may consists of: 2 or more compute nodes, which may also be SMP themselves ● Non-uniform Memory Access (NUMA) systems provide for the cooperation of 2 or more processors, whereby not all processors have direct (i.e., local) access to all main memory; strictly speaking NUMA does not provide cache coherency – it only facilitates access to memory ● NUMA clusters (e.g., Beowulf clusters) are a fairly recent phenomenon (compared to “traditional” supercomputing techniques, fueled by the availability of inexpensive x86 hardware and networking products ● More recently, there have been efforts to provide “virtualized” SMP environments, using “ccNUMA” techniques - “cc” standing for “cache coherency,” which is a required mechanism if one wants to provide a true shared memory environment (e.g., 3leaf systems) ● Distributed memory systems always introduce latency through the methods connecting the disjoint memory; it gets even more inefficient when attempting to make the disjoint memories and CPU caches “coherent” in order to provide a virtualized global view

Information Technology Services B. Estrade, Louisiana Optical Network LSU Limitations ● the number of cpus on a single compute node bounds the number of threads ● memory overhead bounds performance scaling, mainly due to “cache coherency” issues ● on distributed memory clusters (i.e., Beowulf clusters), shared memory among compute nodes must be facilitated by network communication (thus introducing latency/bandwidth overhead) ● “real” supercomputers address these issues by focusing on very high bandwidth interconnects among highly integrated shared memory nodes (Sun's Ranger, IBM's Blue Gene, etc)

Information Technology Services B. Estrade, Louisiana Optical Network LSU Local SMP Resources ● Your own desktop/laptop – if it has 2 or more “cores”! ● LSU HPC's “Santaka” (SGI Altix SSI) provides for a true 30 CPU SMP environment (Itanium) ● LONI's Power5 systems provides 8 SMP processors per compute node ● LSU HPC's Power5 system,“Pelican,” provides 16 SMP processors per compute node ● LONI's x86 cluster, “Queen Bee,” provides 8 SMP cores per compute node ● LSU HPC's newest x86 cluster, “Philip,” provides 8 cores via 2wo 2.93 GHz Quad Core Nehalem Xeon 64-bit Processors on 37 compute nodes (2 nodes up to 96 gb RAM, 3 up to 48 gb RAM, 32 nodes up to 24 gb RAM) ●...there are more LONI and LSU HPC systems, but these are the highlights

Information Technology Services B. Estrade, Louisiana Optical Network LSU Shared Memory Programming ● Uses the concept of “threads” ● Each thread is related to a single process, and shares the same memory space on the SMP machine ● Threads do not communicate with explicit messages (like MPI) – they communicate implicitly through shared variables ● OpenMP makes creating basic multi-threaded programs easily

Information Technology Services B. Estrade, Louisiana Optical Network LSU Alternatives to OpenMP ● Unified Parallel C ● Co-Array Fortran ● POSIX Threads (pthreads) ● GNU Portable Threads ● Java Threads ● Win32 Threads ● Netscape Portable Runtime ●.... and many more

Information Technology Services B. Estrade, Louisiana Optical Network LSU OpenMP's Execution Model ● OpenMP is built around a shared memory space and related, concurrent threads – this is how the parallelism is facilitated. ● Each thread is typically run on its own processor, though it is becoming common for each CPU or “core” to handle more than one thread “simultaneously”; this is called “hyper- threading” in x86-land and “symmetric multi-threading” in POWER architecture land (e.g., IBM/AIX). ● Thread (process) communication is implicit, and uses variables pointing to shared memory locations; this is in contrast with MPI, which uses explicit messages passed among processes. ● OpenMP simply makes it easier to manage the traditional fork/join paradigm using special “hints” or directives given to an OpenMP enabled compiler ● These days, most major compilers do support OpenMP directives for most platforms (IBM's XL suite, Intel, even GCC ≥ 4.2).

Information Technology Services B. Estrade, Louisiana Optical Network LSU OpenMP's Execution Model ● Any parallel program may have its execution expressed as a directed acyclic graph. ● A fork is when a single thread is made into multiple, concurrently executing threads. ● A join is when the concurrently executing threads synchronize back into a single thread. ● During the overall execution, threads may communicate with one another through shared variables. ● OpenMP programs essentially consist of a series of forks and joins. F J Master Thread Master Thread

Information Technology Services B. Estrade, Louisiana Optical Network LSU A Simple OpenMP Example #include #include int main (int argc, char *argv[]) { int id, nthreads; #pragma omp parallel private(id) { id = omp_get_thread_num(); printf("hi from %d\n", id); #pragma omp barrier if ( id == 0 ) { nthreads = omp_get_num_threads(); printf("%d threads say hi!\n",nthreads); } return 0; } program hello90 use omp_lib integer:: id, nthreads !$omp parallel private(id) id = omp_get_thread_num() write (*,*) 'hi from', id !$omp barrier if ( id == 0 ) then nthreads = omp_get_num_threads() write (*,*) nthreads,'threads say hi!' end if !$omp end parallel end program C/C++ F90 Output using 5 threads: hi from 0 hi from 4 hi from 2 hi from 3 hi from 1 5 threads say hi!

Information Technology Services B. Estrade, Louisiana Optical Network LSU Trace of Execution 0w0 hi from 05 threads say hi! 1w1 hi from 11 != 0 2w2 hi from 22 !=0 3w3 hi from 33 != 0 4w4 hi from 44 != 0 W = wait for all threads before progressing.

Information Technology Services B. Estrade, Louisiana Optical Network LSU Controlling Data Access ● Data access refers to the sharing of variables among threads ● shared( var1, var2,..., varN ) – specifies variables that may safely be shared among threads ● private( var1, var2,..., varN ) – specifies uninitialized variables that are only accessible to a specific thread ● default( shared|private|none ) – allows one to define the default scoping of the threads' variables

Information Technology Services B. Estrade, Louisiana Optical Network LSU Data Access Examples int id, nthreads; #pragma omp parallel private(id) { id = omp_get_thread_num(); integer:: id, nthreads !$omp parallel private(id) id = omp_get_thread_num() C/C++ Fortran int id, nthreads, A, B; A = getA(); B = getB(); #pragma omp parallel #pragma omp default(private) #pragma omp shared(A,B) { id = omp_get_thread_num(); integer:: id, nthreads, A, B call getA(A) call getB(B) !$omp parallel !$omp default(private) !$omp shared(A,B) id = omp_get_thread_num() C/C++ Fortran PRIVAT E SHARED

Information Technology Services B. Estrade, Louisiana Optical Network LSU Variable Initialization ● Variable initializations controls how private and shared variables get assigned an initial value. ● firstprivate( var1,..., varN ) – allows a variable to be globally initialized, but private to each thread once program execution has entered the parallel section ● threadprivate( /BLOCK1/,..., /BLOCKN/ ) – allows named globally defined COMMON blocks (in Fortran) to be private to each thread, but global within the thread itself

Information Technology Services B. Estrade, Louisiana Optical Network LSU Reductions ● Reductions allow for a variable used privately by each thread to be aggregated into a single value; ● Variable specification implies a private variable (error if specified in both) ● Always initialize your reduced variables properly! ● Reduction operations (C/C++): – Arithmetic: + - * / – Bitwise: & ^ | – Logical: && || ● Specified in main OMP clause, or a work sharing construct: int i = 0; //important! #pragma omp parallel reduction(+:i) { //... code }

Information Technology Services B. Estrade, Louisiana Optical Network LSU Trace of Reduction Execution 0 i 0 = 1 1 i 1 = 1 2 i 2 = 1 3 i 3 = 1 4 i 4 = 1 i = 0 reduction(+:i) i = i 0 +i 1 +i 2 + i3 + i 4 i = 5

Information Technology Services B. Estrade, Louisiana Optical Network LSU Reduction Limits & Traps! ● Reduced variable must be a scalar, i.e., a single value (no arrays, data structures, etc) ● (trap!) Initialized value does affect the outcome; for example, initializing a variable to 0 (zero) will make the outcome of an aggregate multiplication 0 (zero) as well! int i = 0;...reduction(*:i) Means: 0 * i 0 * i 1 *... * i n, which is == 0

Information Technology Services B. Estrade, Louisiana Optical Network LSU Synchronization ● barrier – Provides a place in the code for all threads to wait for the others – No thread may progress until all threads have made it to this point

Information Technology Services B. Estrade, Louisiana Optical Network LSU Trace of Execution 0w0 hi from 05 threads say hi! 1w1 hi from 11 != 0 2w2 hi from 22 !=0 3w3 hi from 33 != 0 4w4 hi from 44 != 0 W = wait for all threads before progressing.

Information Technology Services B. Estrade, Louisiana Optical Network LSU (some) OpenMP Runtime Routines ● omp_get_num_threads ● omp_set_num_threads ● omp_in_parallel

Information Technology Services B. Estrade, Louisiana Optical Network LSU OpenMP Environmental Variables ● OMP_NUM_THREADS – required, informs execution of the number of threads to use ● OMP_SCHEDULE – The OMP_SCHEDULE environment variable applies to PARALLEL DO and work-sharing DO directives that have a schedule type of RUNTIME. ● OMP_DYNAMIC – The OMP_DYNAMIC environment variable enables or disables dynamic adjustment of the number of threads available for the execution of parallel regions

Information Technology Services B. Estrade, Louisiana Optical Network LSU Compiling and Executing ● IBM p5 575s: – xlf_r, xlf90_r, xlf95_r, xlc_r, cc_r, c89_r, c99_r, xlc128_r, cc128_r, c89_128_r, c99_128_r ● x86 Clusters: ● ifort, icc %xlc_r -qsmp=omp test.c && OMP_NUM_THREADS=5./a.out %icc -openmp test.c && OMP_NUM_THREADS=5./a.out bash

Information Technology Services B. Estrade, Louisiana Optical Network LSU The “Hard” Part ● The difficult aspect of creating a shared memory program is translating what you want to do into a multi-threaded version. ● It is even more difficult to make this multi-threaded version optimally efficient. ● The most difficult part of this is verifying that your multi-threaded version is correct, and that there are no issues with shared variables over writing one another (race conditions, dead locks, etc) ● Program verification and detecting/debugging race conditions (and other run time issues) are beyond the scope of this tutorial, and they will be covered in future talks on advanced OpenMP.

Information Technology Services B. Estrade, Louisiana Optical Network LSU Lab 1 ● Using the following description, design on paper (visually) how a multi-threaded version may look. ● Description: – for a threaded application with N threads, return the number N 2 – each thread may use only a single private (or firstprivate) variable – this variable must be reduced using addition, so all threads must contribute to the answer

Information Technology Services B. Estrade, Louisiana Optical Network LSU Lab 1- Solution ● declare private variable, int i ● for each thread, set i = #threads ● reduce i to the sum of all threads' i values

Information Technology Services B. Estrade, Louisiana Optical Network LSU Lab 2 ● Implement the solution to Lab 1 in either C or Fortran ● Compile, then run the code using the examples provided in the presentation

Information Technology Services B. Estrade, Louisiana Optical Network LSU Lab 2 – A Solution in C #include int main (int argc, char *argv[]) { int c = 0; // required on AIX (xlc_r) #pragma omp parallel reduction(+:c) { c=omp_get_num_threads(); } printf("%d\n",c); return 0; } %xlc_r -qsmp=omp test.c && OMP_NUM_THREADS=5./a.out 25 %icc -openmp test.c && OMP_NUM_THREADS=5./a.out test.c(5) : (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED. 25 On AIX, p5 575 On Linux, x86

Information Technology Services B. Estrade, Louisiana Optical Network LSU Lab 2 – A Solution in Fortran program numthreadssquared use omp_lib integer::c=0 !$omp parallel default(private) reduction(+:c) c = omp_get_num_threads() !$omp end parallel WRITE (*,*) c end program %xlf90_r -qsmp=omp test.f90 && OMP_NUM_THREADS=5./a.out 25 %ifort -openmp test.f90 && export OMP_NUM_THREADS=5./a.out 25 On Ducky (AIX, p5 575) On Eric (Linux, x86)

Information Technology Services B. Estrade, Louisiana Optical Network LSU Additional Resources ● ● help: ● ●

Information Technology Services B. Estrade, Louisiana Optical Network LSU References ● ● ● ● ●