Shuxia Zhang, Amidu Oloso, Birali Runesha

Porting of 64-bit MPI and Distributed Parallel Math/Numerical Libraries on the new SP
Shuxia Zhang, Amidu Oloso, Birali Runesha University of Minnesota, Supercomputing Institute For Digital Simulation and Advanced Computation October 7, 1999

Outline Introduction Rebuild of the 64-bit libraries on the new SP
What does 64-bit computing mean? Why do we need the 64-bit libraries? Rebuild of the 64-bit libraries on the new SP Layering structure Tricky tips How to use these libraries Performance evaluation and existing problems Conclusions References

Introduction The 64-bit computing means:
Double the number of bits from 32 to 64 Manipulate operations of 64-bit data Long integer Precision extension(?) Perform 64-bit addressing A 32-bit application has a limit of 2 GB virtual addressable memory A 64-bit application can address up to 16 EB (16x10 20 bytes) of virtual memory Very large buffers may be used for loading a big (>2 GB) data file Very large memory (>> 2GB) can be used for executing an executable. Perform 64-bit addressing Enhance Performance relative to the 32-bit computing loading large data file (database -- industry drive) 64 bit integer computation These are the benefits of 64-bit computing that a programmer sees.

Motivation: 64-bit computing requires: SP provides:
64-bit hardware Yes 64-bit operating system Yes 64-bit supporting software *No MPI yet Large physical memory (> 2GB) *Yes *IBM will not implement 64-bit MPI until when (?). *We will have 17 NightHawk nodes. Each of them has 4 CPUs and 16 GB of memory

The 64-bit distributed math/numerical parallel libraries:
The User Support Staff has ported the following commonly used parallel libraries on the new SP: BLAS, LAPACK - now 64-bit ESSL can also be linked. MPICH, BLACS, PBLAS, SCALAPACK, PETSC Objectives: to support 64-bit distributed parallel computing on the new SP to effectively use the memory resources to conduct performance evaluation

When should the 64-bit computing be used?
It should be used when the application needs 33 or higher bits i.e., 64-bit long integer or a large data file ( > 2GB) or a large memory ( > 2GB) Comments: The new 64-bit SP offers compatibility with 32-bit computing. These applications, which do not require the above properties, should not be changed .

Basic functions MPICH Message passing interface distributed parallel
Library Name Capability Description Computing Mode MPICH Message passing interface distributed parallel BLAS Basic Linear Algebra Subprograms sequential LAPACK Linear algebra PACKage sequential BLACS Basic linear algebra communication Subprograms MPI - based PBLAS Parallel version of BLAS MPI - based SCALAPACK dense and band matrix, MPI - based large sparse eigenvalue, sparse direct systems preconditioners for large- sparse iterative solvers . PETSC Data structures and routines of partial differential equations MPI - based Fortan and C, single and double precision, complex and complex16 versions were built.

MPI ADI Channel Interface Implemented via P4 or shmem
Layering structure MPICH: a portable implementation of MPI MPI A set of function definition in terms of user callable MPI routines, specifying message to be sent or received, moving data between the API and MP hardware, managing lists of pending messages, and providing basic functions about the execution environment ADI Channel Interface Transfer data from one process’ address space to another’s Implemented via P4 or shmem

Layering structure ScaLAPACK

Layering structure PETSc:

Examples of using the 64-bit libraries
Blacs_hello_world: Scalapack Example of solving Ax=b via ScaLAPACK routine PDGESV: PETSc example of solving a Helmholtz equation:

Configuration of 64-bit MPICH
Rebuilding on SP Configuration of 64-bit MPICH Configure options -arch=rs6000 -device=ch_p4 or -device=ch_shmem -noromio -noc++ -rshnol -cc="xlc -q64" -cflags="-qlanglvl=ansii" -clinker="xlc -q64" -fc="xlf -q64" -flinker="xlf -q64" -AR = “ar -X64” Comments: combination of “ -device=ch_p4 -comm=shared” did not work with 64-bit mpich library.

Rebuilding on SP Rebuild of 64-bit PETSc
1. Specify the compiler options and the paths and names for the requested libraries: i.e. modify the base, base.O, and base.site files. 2. To fix errors and warnings appearing during compiling, you can try the following: Comment out BS_INCLUDE = -I/home/petsc/BlockSolve95/include BS_LIB = -L/home/petsc/BlockSolve95/lib/libO/${PETSC_ARCH} -lBS95 Remove -DHAVE_BLOCKSOLVE from the line PCONF = -DHAVE_ESSL -DHAVE_BLOCKSOLVE modify "petscconf.h" Change “#define _XOPEN_SOURCE” to #if !defined(_XOPEN_SOURCE) #define _XOPEN_SOURCE #endifAdd add #define HAVE_64BITS

How to create 64-bit executables?
Load “lib64” module: module add lib64 For a FORTRAN code: mpif77 -qarch=pwr3 -O3 -qstrict mpi_code.f For a C code, do the following: mpicc -q64 -qarch=pwr3 -O3 -qstrict mpi_code.c If the code uses the SCALAPACK library: mpif77 -qarch=pwr3 -O3 -qstrict SCALAPACK.f -lscalapack If the code uses the PETSc library: mpif77 -qarch=pwr3 -O3 -qstrict PETSc.f -lpetscmat -lpetsc... Comments: please note that the scripts "mpif77" and "mpicc" created for the 64-bit MPICH contain the option "-q64" and the link to the mpich libraries.

How to run 64-bit executables?
MPI jobs: module add lib64 mpirun.ch_p4 -p4pg host-list a.out "host-list" in above example is a file containing: sp71css0 0 /homes/sp7/szhang/a.out sp68css0 1 / homes/sp7/szhang/a.out sp69css0 1 / homes/sp7/szhang/a.out

Submit 64-bit mpi jobs via LoadLeveler
Example of LL script file for 64-bit MPICH_shmem job #!/bin/csh initialdir = /homes/sp9/szhang job_type = parallel intput = input_file output = JOB$(jobid).out error = JOB$(jobid).err node = 1 node_usage = not_shared class = 10_hour wall_clock_limit = 10:00:00 checkpoint = no queue module add mpich64_shmem mpirun -np 2 a.out

Submit 64-bit mpi_p4 job via LoadLeveler
64-bit MPICH job can be run in batch. The script_file example is: #!/bin/csh initialdir = /homes/sp9/szhang job_type = parallel intput = input_file output = JOB$(jobid).out error = JOB$(jobid).err node = 2 tasks_per_node = 1 node_usage = shared network.MPI = switch,shared,ip class = 10_hour wall_clock_limit = 10:00:00 checkpoint = no queue Please note here only IP mode can be used on the HPS. set mydir = "/homes/sp9/szhang" set host_name = "Host_file3" @ nodec = 0 foreach nodei ($LOADL_PROCESSOR_LIST) @ nodec++ set nodet = `echo $nodei` if ($nodec == 1) then echo $nodet 0 ${mydir}/a.out > $host_name else echo $nodet 1 ${mydir}/a.out >> $host_name endif End module add mpich64 mpirun -p4pg $host_name a.out

Performance Evaluation
64-bit MPICH vs IBM native MPI in the IP mode Case 1: Measurement of Bandwidth (MB/s) 32-bit 64-bit Case 2: timing results (WT s) of running a coarse granularity MPI code Sample size bit 64-bit 1.5x10** : :45 10** :03: :00:12 Case 3: timing results (WT s) of running a fine granularity MPI code Memory size bit bit 200 MB : :30 In cases 2 and 3, the compiling options "-O3 -qstrict -qarch=pwr3" have been used.

64-bit MPICH configured in shared memory mode (via ch_shmem) Case 1: Bandwidth (MB/s) of blocking send-receive 32-bit (US mode) 64-bit 140/ Case 2: Bandwidth (MB/s) of asynchronous send-receive Case 3: timing results (WT s) of running a CFD MPI code 5: :20

Is 64-bit computing faster than 32-bit? 64-bit floating computation: a test case of running a FORTRAN CFD code compiled with -qautodbl=dbl4 -O3 -qstrict -qarch=pwr3 timing results (wall clock time in second): memory-size -q32 -q64 2 GB 260s 280 s 2.5 GB 290 s 64-bit integer computing: The same CFD code compiled with -qautodbl=dbl4 -O3 -qstrict -qarch=pwr3 -qintsize=8 2 GB s 284s Notes: the computations were done on one winterhawk node, but with different compiling options.

Existing Problems Memory addressing:
Unable to do full tests since the NightHawk is not available. Precision Extension: On 32bit systems, Real*8 and Real*16 have been feasible. On the same new SP, the 64-bit coding, compared to the 32-bit did not get improved performance for Real*8 and Real*16 computing, Why? Exponent range with floating point number: Unable to resolve a real number with value 10**(-400) < a > 10**400, why? CRAYC90 can handle real variable >10**600 64-bit debugging tools?

Conclusions The 64-bit MPICH, BLACS, PBLAS, SCALAPCK and PETCs libraries have been ported on the new SP at the UofM Supercomputing Institute. MPICH has been configured as the 64-bit message passing interface. BLACS, PBLAS, SCALAPCK and PETCs were built on top of MPICH. Benchmark comparison shows an encouraging future of using the public domain software in the distributed 64-bit computing: The performance of 64-bit MPI configured in shared memory mode can be better than the native MPI. For coarse granularity MPI application, the 64-bit mpich gives nearly the same performance as the IBM native MPI. For very fine granularity application, the 64-bit mpich can be slower than the native MPI by a factor of 2.

References Online Tutorial: www.msi.umn.edu/user_support/
BLAS: BLACS: LAPACK: MPICH: www-unix.mcs.anl.gov/mpi/mpich PBLAS: PETSC: www-unix.mcs.anl.gov/petsc SCALAPACK:

Shuxia Zhang, Amidu Oloso, Birali Runesha

Similar presentations

Presentation on theme: "Shuxia Zhang, Amidu Oloso, Birali Runesha"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shuxia Zhang, Amidu Oloso, Birali Runesha

Similar presentations

Presentation on theme: "Shuxia Zhang, Amidu Oloso, Birali Runesha"— Presentation transcript:

Similar presentations

About project

Feedback