Parallelize Codes Using Intel Software

Slides:



Advertisements
Similar presentations
Introductions to Parallel Programming Using OpenMP
Advertisements

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
Introduction to Pragnesh Patel 1 NICS CSURE th June 2015.
Parallel Programming Models (Shared Address Space) 5 th week.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Martin Kruliš Jiří Dokulil
Shared Memory Parallelism - OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Computer Science Department
Shared Memory Programming with OpenMP
Multi-core CPU Computing Straightforward with OpenMP
Parallel Programming with OpenMP
Programming with Shared Memory
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
OpenMP Parallel Programming
Parallel Programming with OPENMP
Presentation transcript:

Parallelize Codes Using Intel Software 逸奇科技資深研發工程師 吳俊德 james@ancad.com Included below are copies of slides that were presented by the ARB during a tutorial on OpenMP programming at SC'99.

iThinker64 Tutorial: Agenda Setting the stage Parallel computing, hardware, software, etc. OpenMP: A quick overview OpenMP: A detailed introduction How Intel software support. OpenMP case studies. Including performance tuning. Common bugs in OpenMP programs and how to avoid them.

Parallel Computing What is it? Parallel computing is when a program uses concurrency to either: decrease the runtime for the solution to a problem. Increase the size of the problem that can be solved.

Parallel Computing: Writing a parallel application.

Parallel Computing: The Hardware is in great shape.

Parallel Computing: … but where is the software? Most ISV’s have ignored parallel computing (other than coarse-grained multithreading for GUI’s and systems programming) Why? The perceived difficulties of writing parallel software out-weigh the benefits The benefits are clear. To increase the amount of parallel software, we need to reduce the perceived difficulties.

Solution: Effective Standards for parallel programming Thread Libraries Win32 API - a low level approach. threads.h++ - a high level class library. Compiler Directives OpenMP - portable shared memory parallelism. Message Passing Libraries MPI - www.mpi-softtech.com High Level Tools www.apri.com, www.kai.com, www.pgroup.com Our focus

Simple Example : PI Program(no threads) static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum;

Simple Example : PI Program(Windows threads) #define NUM_THREADS 2 HANDLE thread_handles[NUM_THREADS]; CRITICAL_SECTION hUpdateMutex; double global_sum = 0.0; static long num_steps = 100000; double step; void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *) arg; step = 1.0/(double) num_steps; for (i=start;i<= num_steps; i=i+NUM_THREADS){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); void main () { double pi; int i; DWORD threadID; int threadArg[NUM_THREADS]; for(i=0; i<NUM_THREADS; i++) threadArg[i] = i+1; InitializeCriticalSection(&hUpdateMutex); for (i=0; i<NUM_THREADS; i++){ thread_handles[i] = CreateThread(0, 0, (LPTHREAD_START_ROUTINE) Pi, &threadArg[i], 0, &threadID); } WaitForMultipleObjects(NUM_THREADS, thread_handles, TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi); Double Code Size!

Simple Example : PI Program(OpenMP) #include <omp.h> static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for reduction(+:sum) private(x) for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; OpenMP add 2 lines of code.

OpenMP: Introduction

OpenMP: Supporters Hardware vendors Software tools vendors Intel, HP, SGI, IBM, SUN, Compaq Software tools vendors KAI, PGI, PSR, APR, Absoft Applications vendors ANSYS, Fluent, Oxford Molecular, NAG, DOE,ASCI, Dash, Livermore Software, and many others

OpenMP: Programming Model Fork-Join Parallelism: Master thread spawns a team of threads as needed. Parallelism is added incrementally: i.e. the sequential program evolves into a parallel program.

OpenMP: How is OpenMP typically used? OpenMP is usually used to parallelize loops: Find your most time consuming loops. Split them up between threads.

OpenMP: How do threads interact? OpenMP is a shared memory model. Threads communicate by sharing variables. Unintended sharing of data can lead to race conditions: race condition: when the program’s outcome changes as the threads are scheduled differently. To control race conditions: Use synchronization to protect data conflicts. Synchronization is expensive so: Change how data is stored to minimize the need for synchronization.

OpenMP: Some syntax details to get us started Most of the constructs in OpenMP are compiler directives or pragmas. For C and C++, the pragmas take the form: #pragma omp construct [clause [clause]…] For Fortran, the directives take one of the forms: C$OMP construct [clause [clause]…] !$OMP construct [clause [clause]…] *$OMP construct [clause [clause]…] Since the constructs are directives, an OpenMP program can be compiled by compilers that don’t support OpenMP.

OpenMP: Structured blocks Most OpenMP constructs apply to structured blocks. Structured block: a block of code with one point of entry at the top and one point of exit at the bottom. The only other branches allowed are STOP statements in Fortran and exit() in C/C++.

OpenMP: Contents OpenMP’s constructs fall into 5 categories: Parallel Regions Worksharing Data Environment Synchronization Runtime functions/environment variables OpenMP is basically the same between Fortran and C/C++

OpenMP: Parallel Regions You create threads in OpenMP with the “omp parallel” pragma. For example, To create a 4 thread Parallel region:

OpenMP: Parallel Regions

OpenMP : Scope of OpenMP constructs OpenMP constructs can span multiple source files.

OpenMP: Work-Sharing Constructs The “for” Work-Sharing construct splits up loop iterations among the threads in a team

Work Sharing Constructs A motivating example

OpenMP For constuct: The schedule clause The schedule clause effects how loop iterations are mapped onto threads schedule(static [,chunk]) Deal-out blocks of iterations of size “chunk” to each thread. schedule(dynamic[,chunk]) Each thread grabs “chunk” iterations off a queue until all iterations have been handled. schedule(guided[,chunk]) Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds. schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE environment variable.

OpenMP: Work-Sharing Constructs The Sections work-sharing construct gives a different structured block to each thread.

OpenMP: Combined Parallel Work-Sharing Constructs A short hand notation that combines the Parallel and work-sharing construct. There’s also a “parallel sections” construct. #pragma omp parallel for for (I=0;I<N;I++){ NEAT_STUFF(I); }

Data Environment: Default storage attributes Shared Memory programming model: Most variables are shared by default Global variables are SHARED among threads Fortran: COMMON blocks, SAVE variables, MODULE variables C: File scope variables, static But not everything is shared... Stack variables in sub-programs called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE.

Data Environment: Example storage attributes

Data Environment: Changing storage attributes One can selectively change storage attributes constructs using the following clauses* SHARED PRIVATE FIRSTPRIVATE THREADPRIVATE The value of a private inside a parallel loop can be transmitted to a global value outside the loop with: LASTPRIVATE The default status can be modified with: DEFAULT (PRIVATE | SHARED | NONE)

Private Clause private(var) creates a local copy of var for each thread. The value is uninitialized Private copy is not storage associated with the original

Firstprivate Clause Firstprivate is a special case of private. Initializes each private copy with the corresponding value from the master thread.

Lastprivate Clause Lastprivate passes the value of a private from thelast iteration to a global variable.

OpenMP: Default Clause Note that the default storage attribute is DEFAULT(SHARED) (so no need to specify) To change default: DEFAULT(PRIVATE) each variable in static extent of the parallel region is made private as if specified in a private clause mostly saves typing DEFAULT(NONE): no default for variables in static extent. Must list storage attribute for each variable in static extent

OpenMP: Default Clause Example

Threadprivate Makes global data private to a thread Fortran: COMMON blocks,Save Variables C: File scope and static variables Different from making them PRIVATE with PRIVATE global variables are masked. THREADPRIVATE preserves global scope within each thread Threadprivate variables can be initialized using COPYIN or by using DATA statements. Needs to be placed just after the declarations of the variables and before the main part of the software unit.

OpenMP: Reduction Another clause that effects the way variables are shared: reduction (op : list) The variables in “list” must be shared in the enclosing parallel region. Inside a parallel or a worksharing construct: A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”) pair wise “op” is updated on the local value Local copies are reduced into a single global copy at the end of the construct.

OpenMP: Reduction example

OpenMP: Synchronization OpenMP has the following constructs to support synchronization: critical section atomic barrier flush ordered single master

OpenMP: Synchronization Only one thread at a time can enter a critical section.

OpenMP: Synchronization Atomic is a special case of a critical section that can be used for certain simple statements. It applies only to the update of a memory location (the update of X in the following example)

OpenMP: Synchronization Barrier: Each thread waits until all threads arrive.

OpenMP: Synchronization The ordered construct enforces the sequential order for a block.

OpenMP: Synchronization The master construct denotes a structured block that is only executed by the master thread. The other threads just skip it (no implied barriers or flushes).

OpenMP: Synchronization The single construct denotes a block of code that is executed by only one thread. A barrier and a flush are implied at the end of the single block.

OpenMP: Synchronization The flush construct denotes a sequence point where a thread tries to create a consistent view of memory. All memory operations (both reads and writes) defined prior to the sequence point must complete. All memory operations (both reads and writes) defined after the sequence point must follow the flush. Variables in registers or write buffers must be updated in memory. Arguments to flush specify which variables are flushed. No arguments specifies that all thread visible variables are flushed.

OpenMP: A flush example This example shows how flush is used to implement pair-wise synchronization.

OpenMP: Implicit synchronization Barriers are implied on the following OpenMP constructs: Flush is implied on the following OpenMP constructs:

OpenMP: Library routines Lock routines omp_init_lock(),omp_set_lock(), omp_unset_lock(),omp_test_lock() Runtime environment routines: Modify/Check the number of threads omp_set_num_threads(),omp_get_num_threads(), omp_get_thread_num(),omp_get_max_threads() Turn on/off nesting and dynamic mode omp_set_nested(), omp_set_dynamic(), omp_get_nested(),omp_get_dynamic() Are we in a parallel region? omp_in_parallel() How many processors in the system? omp_num_procs()

OpenMP: Library Routines Protect resources with locks.

OpenMP: Library Routines To fix the number of threads used in a program, first turn off dynamic mode and then set the number of threads.

OpenMP: Environment Variables Control how “omp for schedule(RUNTIME)” loop iterations are scheduled. OMP_SCHEDULE “schedule[, chunk_size]” Set the default number of threads to use. OMP_NUM_THREADS int_literal Can the program use a different number of threads in each parallel region? OMP_DYNAMIC TRUE || FALSE Will nested parallel regions create new teams of threads, or will they be serialized? OMP_NESTED TRUE || FALSE

SMP Programming errors Shared memory parallel programming is a mixed bag: It saves the programmer from having to map data onto multiple processors. In this sense, its much easier. It opens up a range of new errors coming from unanticipated shared resource conflicts.

2 major SMP errors Race Conditions Deadlock The outcome of a program depends on the detailed timing of the threads in the team. Deadlock Threads lock up waiting on a locked resource that will never become free.