High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011

Outline   Parallelization techniques   OpenMP: do-loop based   MPI: communication   Auto-parallelization, CUDA   Remark: – –It is at introduction level – –It is NOT a comprehensive introduction

Introduction   Speed up the computing   Mathematic, physics, computation   Hardware – –number of CPU – –size of memory – –CPU : multi-processer vs. cluster; GPU – –Memory: distributed vs. shared   Software – –Auto-parallelization by compiler – –OpenMP – –MPI – –Cuda

Shared vs. Distributed   Hardware: Desktop vs. Supercomputer   Software: distributed=  shared

Auto-parallelization  Easy to employ –Set environment variable  setenv OMP_NUM_THREADS 2 –Compiler options  pgf77 –mp –static … …  ifort –parallel … …  Not smart enough –Only efficient for dual core CPU –Some time even slower than the single thread

OpenMP-introduction  Open Multi-Processing –An API supporting multi-platform shared memory multiprocessing programming. –It consists of a set of compiler directives, library routines and environment variables. –History: –1997, version 1.0 in Fortran –1998, version 1.0 in C, C++ –2000,version 2.0 in Fortran –2002, version 2.0 in C, C++ –2005, version 2.5 in Fortran, C, C++ –2008, version 3.0 in Fortran, C, C++ … –Compilers: GNU, Intel, IBM, PGI, MS …

Coding with OpenMP  Step 1: define parallel region  Step 2: define the types of the variables  Step 3: mark the do-loops to be paralleled  Remark: –you can parallel your code (parts by parts) incrementally. –The number of parallel regions should be as less as possible.

Example of OpenMP code   !$omp parallel   !$omp& default (shared)   !$omp& private (tmp)   !$omp do do i=1,nx tmp=a(i)**2+b(i)**2 tmp=sqrt(tmp) c(i)=a(i)/tmp d(i)=b(i)/tmp enddo   !$omp end do   !$omp single write(*,*)maxval(c), maxval(b)  !$omp end single  !$omp do do j=1,ny tmp=a(j)**2+b(j)**2 tmp=sqrt(tmp) c(j)=b(j)/tmp d(j)=a(j)/tmp enddo  !$omp end do  !$omp end parallel

Run the OpenMP code   Set environment variable – –setenv OMP_NUM_THREADS 4   ifort –openmp –intel-static *.f –o openbbs1.e  ./openbbs1.e

Scalability of OpenMP code   Ideally it should be linear.   But the initializing, finalizing, and synthesis etc. takes time.

MPI   Message Pass Interface – –A specification for an API that allows many computers to communicate with one another. – –Language-independent protocol, programmer interface, semantic specification – –History:   1994 May, version 1.0, the final report of MPIF   1995 June, version 1.1   1997 July, version 1.2, MPI-1; 2.0 MPI-2   2008 May, version 1.3   2008 June, version 2.1   2009 Sept., version 2.2   Remark: – –Open MPI ≠ OpenMP – –MPICH, HP MPI, Intel MPI, MS MPI, …

Coding with MPI  1: determine the number of blocks  2: define virtual CPU topology  3: define the parallel region  4: assign tasks to different threads.  5: communication between threads.  6: manage the threads:  master-slave  non-master

Example of MPI coding  Include ‘mpi.h’  nx=100, ny=100 !number of grids  mx=2, my=5 !number of blocks  call MPI_INIT(ierr) !initialize the parallelization  call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) !get id  … … … … … … …  call MPI_Finalize(ierr) !finalize the parallelization  myid  myidx,myidy  the IDs of myid’s neighbours !virtual topology  call MPI_SEND(vb,nx*2, MPI_REAL8, receiverid,tag,MPI_COMM_WORLD,ierr) !send data  call MPI_RECV(va,nx*2,MPI_REAL8, senderid, tag, MPI_COMM_WORLD,ierr) !receive data

CPU Virtual Topology  1. each thread has a unique ID;  2. each thread has more than one neighbors;  3. cpus can be arranged as one- or multi- dimensional array;  4. the topology should be as simple as possible.

MPI Communication  Point-point: one CPU to one CPU  Collective: –one to multiple: broadcast; scatter; gather; reduce, etc.  Block –Send and then check the receiving buffer  Non-block –Send and return

Run the MPI code   Compiling – –mpif77 –O3 *.f -o mpimod4.e   Start mpd – –mpdboot   Run code – –mpirun –n 7 mpidmode.4

CUDA   what's next ? GPU-SUPERCOMPUTING It is do-loop based method. Do-loop cuda subroutine

Summary  Parallelization  Three levels of parallelization  (compiler, OpenMP, MPI)  Employment: Easy Difficult  Scalability: inefficient efficient?  Principle  Do-loop based parallelization  Massage passing

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

Similar presentations

Presentation on theme: "High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

Similar presentations

Presentation on theme: "High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011."— Presentation transcript:

Similar presentations

About project

Feedback