NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.

Slides:



Advertisements
Similar presentations
J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.
Advertisements

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
MPI version of the Serial Code With One-Dimensional Decomposition Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy.
Introduction to Openmp & openACC
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Mixed Language Programming on Seaborg Mark Durst NERSC User Services.
1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001.
Scheduling and Performance Issues for Programming using OpenMP
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE NPACI Parallel Computing Seminars San Diego Supercomputing.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Dependency Analysis We want to “parallelize” program to make it run faster. For that, we need dependence analysis to ensure correctness.
Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE SDSU Science Bridge - CG302 - July 13, The Future of the Internet Kris Stewart.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
Games at Bolton OpenMP Techniques Andrew Williams
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Message Passing Interface (MPI) Part I NPACI Parallel.
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Parallel Programming in Java with Shared Memory Directives.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE NPACI Parallel Computing Seminars San Diego Supercomputing.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP Blue Waters Undergraduate Petascale Education Program May 29 – June
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note
University of Wisconsin-Madison N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Pathways from On-campus to Distance Learning of Computational.
Parallel Programming with MPI and OpenMP
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Comparing Cray Tasking and OpenMP NERSC User Services Overview of Cray Tasking Overview of OpenMP.
Beginning Fortran Fortran (77) Advanced 29 October 2009 *Black text on white background provided for easy printing.
A Prototype Finite Difference Model Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.
HPC1 OpenMP E. Bruce Pitman October, HPC1 Outline What is OpenMP Multi-threading How to use OpenMP Limitations OpenMP + MPI References.
Threaded Programming Lecture 2: Introduction to OpenMP.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 6 TB SSA Disk StorageTek Tape Libraries 830 GB MaxStrat.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Supercomputing and Science An Introduction to High Performance Computing Part V: Shared Memory Parallelism Henry Neeman, Director OU Supercomputing Center.
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE MPI 2 Part II NPACI Parallel Computing Institute.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
SHARED MEMORY PROGRAMMING WITH OpenMP
4D-VAR Optimization Efficiency Tuning
Computer Engg, IIT(BHU)
Shared Memory Programming with OpenMP
OpenMP Quiz B. Wilkinson January 22, 2016.
Value returning Functions
Allen D. Malony Computer & Information Science Department
OpenMP Quiz.
Parallel Computing Explained How to Parallelize a Code
Presentation transcript:

NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Stommel model with OpenMP

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 2 Stommel model with OpenMP A good case for using loop level parallelism. Most calculation is done inside a loop No dependency between iterations A lot of work in each iteration Subroutine do_jacobi(psi,new_psi,diff,i1,i2,j1,j2)... do j=j1,j2 do i=i1,i2 new_psi(i,j)=a1*psi(i+1,j) + a2*psi(i-1,j) + & a3*psi(i,j+1) + a4*psi(i,j-1) - a5*force(i,j) diff=diff+abs(new_psi(i,j)-psi(i,j)) enddo psi(i1:i2,j1:j2)=new_psi(i1:i2,j1:j2)

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 3 Serial program IBM: compile with xlf90 –O3 stf_00.f –o stf_00 Using standard input: e e e Without write_grid routine, running time on Blue Horizon, 1 node, 1 CPU: 83.5 sec

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 4 OpenMP parallelization, version 1 Put directive around outer loop –diff is a reduction variable –We want i to be private $OMP PARALLEL DO private(i) reduction(+:diff) do j=j1,j2 do i=i1,i2 new_psi(i,j)= a1*psi(i+1,j) + a2*psi(i-1,j) + & a3*psi(i,j+1) + a4*psi(i,j-1) - & a5*for(i,j) diff=diff+abs(new_psi(i,j)-psi(i,j)) enddo

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 5 OpenMP parallelization, version 1, cont. Compile on IBM: xlf90_r –O3 –qsmp=noauto:omp stf_00.f Running time on 8 threads: setenv OMP_NUM_THREADS 8 poe a.out < st.in –nodes 1 –tasks_per_node 1 –rmpool 1 Runtime: sec (compare with serial time of 83.4 sec) WHAT HAPPENED???

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 6 OpenMP parallelization, version 1, cont. Let’s look at the code again: $OMP PARALLEL DO private(i) reduction(+:diff) do j=j1,j2 do i=i1,i2 new_psi(i,j)= a1*psi(i+1,j) + a2*psi(i-1,j) + & a3*psi(i,j+1) + a4*psi(i,j-1) - & a5*for(i,j) diff=diff+abs(new_psi(i,j)-psi(i,j)) enddo psi(i1:i2,j1:j2)=new_psi(i1:i2,j1:j2) The last line was not parallelized – array syntax is not yet recognized. Also, the program probably spent much time in synchronization after the do loops. SOLUTION: another directive.

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 7 OpenMP parallelization, version 2 New version: $OMP PARALLEL DO private(i) reduction(+:diff) do j=j1,j2 do i=i1,i2 psi(i,j) = new_psi(i,j) enddo Runtime on 8 CPUs: 17.3 sec A much better result

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 8 Our final subroutine: header subroutine do_jacobi(psi,new_psi,diff,i1,i2,j1,j2) use numz use constants implicit none integer,intent(in) :: i1,i2,j1,j2 real(b8),dimension(i1-1:i2+1,j1-1:j2+1):: psi real(b8),dimension(i1-1:i2+1,j1-1:j2+1):: new_psi real(b8) diff integer i,j real(b8) y diff=0.0_b8

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 9 Important part !$OMP PARALLEL DO private(i) reduction(+:diff) do j=j1,j2 do i=i1,i2 new_psi(i,j)=a1*psi(i+1,j) + a2*psi(i-1,j) + & a3*psi(i,j+1) + a4*psi(i,j-1) - & a5*for(i,j) diff=diff+abs(new_psi(i,j)-psi(i,j)) enddo !$OMP END PARALLEL DO ! psi(i1:i2,j1:j2)=new_psi(i1:i2,j1:j2) !$OMP PARALLEL DO private(i) do j=j1,j2 do i=i1,i2 psi(i,j)=new_psi(i,j) enddo !$OMP END PARALLEL DO end subroutine do_jacobi

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 10 Summary Using two OpenMP directives allowed to achieve reasonable scaling Need to be careful with variable scopes OpenMP compilers still rather immature