Download presentation
Presentation is loading. Please wait.
Published byHarry Terry Modified over 9 years ago
1
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011
2
Outline Parallelization techniques OpenMP: do-loop based MPI: communication Auto-parallelization, CUDA Remark: – –It is at introduction level – –It is NOT a comprehensive introduction
3
Introduction Speed up the computing Mathematic, physics, computation Hardware – –number of CPU – –size of memory – –CPU : multi-processer vs. cluster; GPU – –Memory: distributed vs. shared Software – –Auto-parallelization by compiler – –OpenMP – –MPI – –Cuda
4
Shared vs. Distributed Hardware: Desktop vs. Supercomputer Software: distributed= shared
5
Auto-parallelization Easy to employ –Set environment variable setenv OMP_NUM_THREADS 2 –Compiler options pgf77 –mp –static … … ifort –parallel … … Not smart enough –Only efficient for dual core CPU –Some time even slower than the single thread
6
OpenMP-introduction Open Multi-Processing –An API supporting multi-platform shared memory multiprocessing programming. –It consists of a set of compiler directives, library routines and environment variables. –History: –1997, version 1.0 in Fortran –1998, version 1.0 in C, C++ –2000,version 2.0 in Fortran –2002, version 2.0 in C, C++ –2005, version 2.5 in Fortran, C, C++ –2008, version 3.0 in Fortran, C, C++ … –Compilers: GNU, Intel, IBM, PGI, MS …
7
Coding with OpenMP Step 1: define parallel region Step 2: define the types of the variables Step 3: mark the do-loops to be paralleled Remark: –you can parallel your code (parts by parts) incrementally. –The number of parallel regions should be as less as possible.
8
Example of OpenMP code !$omp parallel !$omp& default (shared) !$omp& private (tmp) !$omp do do i=1,nx tmp=a(i)**2+b(i)**2 tmp=sqrt(tmp) c(i)=a(i)/tmp d(i)=b(i)/tmp enddo !$omp end do !$omp single write(*,*)maxval(c), maxval(b) !$omp end single !$omp do do j=1,ny tmp=a(j)**2+b(j)**2 tmp=sqrt(tmp) c(j)=b(j)/tmp d(j)=a(j)/tmp enddo !$omp end do !$omp end parallel
9
Run the OpenMP code Set environment variable – –setenv OMP_NUM_THREADS 4 ifort –openmp –intel-static *.f –o openbbs1.e ./openbbs1.e
10
Scalability of OpenMP code Ideally it should be linear. But the initializing, finalizing, and synthesis etc. takes time.
11
MPI Message Pass Interface – –A specification for an API that allows many computers to communicate with one another. – –Language-independent protocol, programmer interface, semantic specification – –History: 1994 May, version 1.0, the final report of MPIF 1995 June, version 1.1 1997 July, version 1.2, MPI-1; 2.0 MPI-2 2008 May, version 1.3 2008 June, version 2.1 2009 Sept., version 2.2 Remark: – –Open MPI ≠ OpenMP – –MPICH, HP MPI, Intel MPI, MS MPI, …
12
Coding with MPI 1: determine the number of blocks 2: define virtual CPU topology 3: define the parallel region 4: assign tasks to different threads. 5: communication between threads. 6: manage the threads: master-slave non-master
13
Example of MPI coding Include ‘mpi.h’ nx=100, ny=100 !number of grids mx=2, my=5 !number of blocks call MPI_INIT(ierr) !initialize the parallelization call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) !get id … … … … … … … call MPI_Finalize(ierr) !finalize the parallelization myid myidx,myidy the IDs of myid’s neighbours !virtual topology call MPI_SEND(vb,nx*2, MPI_REAL8, receiverid,tag,MPI_COMM_WORLD,ierr) !send data call MPI_RECV(va,nx*2,MPI_REAL8, senderid, tag, MPI_COMM_WORLD,ierr) !receive data
14
CPU Virtual Topology 1. each thread has a unique ID; 2. each thread has more than one neighbors; 3. cpus can be arranged as one- or multi- dimensional array; 4. the topology should be as simple as possible.
15
MPI Communication Point-point: one CPU to one CPU Collective: –one to multiple: broadcast; scatter; gather; reduce, etc. Block –Send and then check the receiving buffer Non-block –Send and return
16
Run the MPI code Compiling – –mpif77 –O3 *.f -o mpimod4.e Start mpd – –mpdboot Run code – –mpirun –n 7 mpidmode.4
17
CUDA what's next ? GPU-SUPERCOMPUTING It is do-loop based method. Do-loop cuda subroutine
18
Summary Parallelization Three levels of parallelization (compiler, OpenMP, MPI) Employment: Easy Difficult Scalability: inefficient efficient? Principle Do-loop based parallelization Massage passing
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.