Shuxia Zhang, Amidu Oloso, Birali Runesha

Slides:

Advertisements

Similar presentations

Three types of remote process invocation

Advertisements

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Running Jobs on Jacquard An overview of interactive and batch computing, with comparsions to Seaborg David Turner NUG Meeting 3 Oct 2005.

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &

Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.

A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.

© 2005 IBM MPI Louisiana Tech University Ruston, Louisiana Charles Grassl IBM January, 2006.

3.14 Work List IOC Core Channel Access. Changes to IOC Core Online add/delete of record instances Tool to support online add/delete OS independent layer.

Chapter 4 Message-Passing Programming. The Message-Passing Model.

How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.

Implementing Hypre- AMG in NIMROD via PETSc S. Vadlamani- Tech X S. Kruger- Tech X T. Manteuffel- CU APPM S. McCormick- CU APPM Funding: DE-FG02-07ER84730.

CIS250 OPERATING SYSTEMS Chapter One Introduction.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Threaded Programming Lecture 2: Introduction to OpenMP.

1 HPCI Presentation Kulathep Charoenpornwattana. March 12, Outline Parallel programming with MPI Running MPI applications on Azul & Itanium Running.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Parallel Computing Presented by Justin Reschke

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.

ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.

Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.

1 Introduction to Engineering Spring 2007 Lecture 18: Digital Tools 2.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Computer System Structures

Introduction to Operating Systems

Lab 1: Using NIOS II processor for code execution on FPGA

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

A survey of Exascale Linear Algebra Libraries for Data Assimilation

gLite MPI Job Amina KHEDIMI CERIST

In-situ Visualization using VisIt

CSE451 I/O Systems and the Full I/O Path Autumn 2002

NGS computation services: APIs and Parallel Jobs

CS1010 Programming Methodology

Is System X for Me? Cal Ribbens Computer Science Department

Usage of highly scalable parallel numerical libraries

CS1010 Programming Methodology

Introduction to Operating Systems

The Future of Fortran is Bright …

Compiling and Job Submission

Process-to-Process Delivery:

Module 2: Computer-System Structures

CS703 - Advanced Operating Systems

Lecture Topics: 11/1 General Operating System Concepts Processes

Chapter 5: I/O Systems.

Operating Systems Lecture 3.

MPJ: A Java-based Parallel Computing System

Module 2: Computer-System Structures

Quick Tutorial on MPICH for NIC-Cluster

The performance of NAMD on a large Power4 system

Module 2: Computer-System Structures

Module 2: Computer-System Structures

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Programming Parallel Computers

Presentation transcript:

Porting of 64-bit MPI and Distributed Parallel Math/Numerical Libraries on the new SP Shuxia Zhang, Amidu Oloso, Birali Runesha University of Minnesota, Supercomputing Institute For Digital Simulation and Advanced Computation October 7, 1999

Outline Introduction Rebuild of the 64-bit libraries on the new SP What does 64-bit computing mean? Why do we need the 64-bit libraries? Rebuild of the 64-bit libraries on the new SP Layering structure Tricky tips How to use these libraries Performance evaluation and existing problems Conclusions References

Introduction The 64-bit computing means: Double the number of bits from 32 to 64 Manipulate operations of 64-bit data Long integer Precision extension(?) Perform 64-bit addressing A 32-bit application has a limit of 2 GB virtual addressable memory A 64-bit application can address up to 16 EB (16x10 20 bytes) of virtual memory Very large buffers may be used for loading a big (>2 GB) data file Very large memory (>> 2GB) can be used for executing an executable. Perform 64-bit addressing Enhance Performance relative to the 32-bit computing loading large data file (database -- industry drive) 64 bit integer computation These are the benefits of 64-bit computing that a programmer sees.

Motivation: 64-bit computing requires: SP provides: 64-bit hardware Yes 64-bit operating system Yes 64-bit supporting software *No MPI yet Large physical memory (> 2GB) *Yes *IBM will not implement 64-bit MPI until when (?). *We will have 17 NightHawk nodes. Each of them has 4 CPUs and 16 GB of memory

The 64-bit distributed math/numerical parallel libraries: The User Support Staff has ported the following commonly used parallel libraries on the new SP: BLAS, LAPACK - now 64-bit ESSL can also be linked. MPICH, BLACS, PBLAS, SCALAPACK, PETSC Objectives: to support 64-bit distributed parallel computing on the new SP to effectively use the memory resources to conduct performance evaluation

When should the 64-bit computing be used? It should be used when the application needs 33 or higher bits i.e., 64-bit long integer or a large data file ( > 2GB) or a large memory ( > 2GB) Comments: The new 64-bit SP offers compatibility with 32-bit computing. These applications, which do not require the above properties, should not be changed .

Basic functions MPICH Message passing interface distributed parallel Library Name Capability Description Computing Mode MPICH Message passing interface distributed parallel BLAS Basic Linear Algebra Subprograms sequential LAPACK Linear algebra PACKage sequential BLACS Basic linear algebra communication Subprograms MPI - based PBLAS Parallel version of BLAS MPI - based SCALAPACK dense and band matrix, MPI - based large sparse eigenvalue, sparse direct systems preconditioners for large- sparse iterative solvers . PETSC Data structures and routines of partial differential equations MPI - based Fortan and C, single and double precision, complex and complex16 versions were built.

MPI ADI Channel Interface Implemented via P4 or shmem Layering structure MPICH: a portable implementation of MPI MPI A set of function definition in terms of user callable MPI routines, specifying message to be sent or received, moving data between the API and MP hardware, managing lists of pending messages, and providing basic functions about the execution environment ADI Channel Interface Transfer data from one process’ address space to another’s Implemented via P4 or shmem

Layering structure ScaLAPACK

Layering structure PETSc:

Examples of using the 64-bit libraries Blacs_hello_world: www.msi.umn.edu/sp/tutorials/examples/blacs.f Scalapack Example of solving Ax=b via ScaLAPACK routine PDGESV: www.msi.umn.edu/sp/tutorials/examples/Scalapack.f PETSc example of solving a Helmholtz equation: www.msi.umn.edu/sp/tutorials/examples/PETSc.f

Configuration of 64-bit MPICH Rebuilding on SP Configuration of 64-bit MPICH Configure options -arch=rs6000 -device=ch_p4 or -device=ch_shmem -noromio -noc++ -rshnol -cc="xlc -q64" -cflags="-qlanglvl=ansii" -clinker="xlc -q64" -fc="xlf -q64" -flinker="xlf -q64" -AR = “ar -X64” Comments: combination of “ -device=ch_p4 -comm=shared” did not work with 64-bit mpich library.

Rebuilding on SP Rebuild of 64-bit PETSc 1. Specify the compiler options and the paths and names for the requested libraries: i.e. modify the base, base.O, and base.site files. 2. To fix errors and warnings appearing during compiling, you can try the following: Comment out BS_INCLUDE = -I/home/petsc/BlockSolve95/include BS_LIB = -L/home/petsc/BlockSolve95/lib/libO/${PETSC_ARCH} -lBS95 Remove -DHAVE_BLOCKSOLVE from the line PCONF = -DHAVE_ESSL -DHAVE_BLOCKSOLVE modify "petscconf.h" Change “#define _XOPEN_SOURCE” to #if !defined(_XOPEN_SOURCE) #define _XOPEN_SOURCE #endifAdd add #define HAVE_64BITS

How to create 64-bit executables? Load “lib64” module: module add lib64 For a FORTRAN code: mpif77 -qarch=pwr3 -O3 -qstrict mpi_code.f For a C code, do the following: mpicc -q64 -qarch=pwr3 -O3 -qstrict mpi_code.c If the code uses the SCALAPACK library: mpif77 -qarch=pwr3 -O3 -qstrict SCALAPACK.f -lscalapack If the code uses the PETSc library: mpif77 -qarch=pwr3 -O3 -qstrict PETSc.f -lpetscmat -lpetsc... Comments: please note that the scripts "mpif77" and "mpicc" created for the 64-bit MPICH contain the option "-q64" and the link to the mpich libraries.

How to run 64-bit executables? MPI jobs: module add lib64 mpirun.ch_p4 -p4pg host-list a.out "host-list" in above example is a file containing: sp71css0 0 /homes/sp7/szhang/a.out sp68css0 1 / homes/sp7/szhang/a.out sp69css0 1 / homes/sp7/szhang/a.out

Submit 64-bit mpi jobs via LoadLeveler Example of LL script file for 64-bit MPICH_shmem job #!/bin/csh #@ initialdir = /homes/sp9/szhang #@ job_type = parallel #@ intput = input_file #@ output = JOB$(jobid).out #@ error = JOB$(jobid).err #@ node = 1 #@ node_usage = not_shared #@ class = 10_hour #@ wall_clock_limit = 10:00:00 #@ checkpoint = no #@ queue module add mpich64_shmem mpirun -np 2 a.out

Submit 64-bit mpi_p4 job via LoadLeveler 64-bit MPICH job can be run in batch. The script_file example is: #!/bin/csh #@ initialdir = /homes/sp9/szhang #@ job_type = parallel #@ intput = input_file #@ output = JOB$(jobid).out #@ error = JOB$(jobid).err #@ node = 2 #@ tasks_per_node = 1 #@ node_usage = shared #@ network.MPI = switch,shared,ip #@ class = 10_hour #@ wall_clock_limit = 10:00:00 #@ checkpoint = no #@ queue Please note here only IP mode can be used on the HPS. set mydir = "/homes/sp9/szhang" set host_name = "Host_file3" @ nodec = 0 foreach nodei ($LOADL_PROCESSOR_LIST) @ nodec++ set nodet = `echo $nodei` if ($nodec == 1) then echo $nodet 0 ${mydir}/a.out > $host_name else echo $nodet 1 ${mydir}/a.out >> $host_name endif End module add mpich64 mpirun -p4pg $host_name a.out

Performance Evaluation 64-bit MPICH vs IBM native MPI in the IP mode Case 1: Measurement of Bandwidth (MB/s) 32-bit 64-bit 70 42 Case 2: timing results (WT s) of running a coarse granularity MPI code Sample size 32-bit 64-bit 1.5x10**5 9:40 9:45 10**6 1:03:20 1:00:12 Case 3: timing results (WT s) of running a fine granularity MPI code Memory size 32-bit 64-bit 200 MB 4:50 7:30 In cases 2 and 3, the compiling options "-O3 -qstrict -qarch=pwr3" have been used.

Performance Evaluation 64-bit MPICH configured in shared memory mode (via ch_shmem) Case 1: Bandwidth (MB/s) of blocking send-receive 32-bit (US mode) 64-bit 140/80 210 Case 2: Bandwidth (MB/s) of asynchronous send-receive Case 3: timing results (WT s) of running a CFD MPI code 5:40 5:20

Performance Evaluation Is 64-bit computing faster than 32-bit? 64-bit floating computation: a test case of running a FORTRAN CFD code compiled with -qautodbl=dbl4 -O3 -qstrict -qarch=pwr3 timing results (wall clock time in second): memory-size -q32 -q64 2 GB 260s 280 s 2.5 GB 290 s 64-bit integer computing: The same CFD code compiled with -qautodbl=dbl4 -O3 -qstrict -qarch=pwr3 -qintsize=8 2 GB 428s 284s Notes: the computations were done on one winterhawk node, but with different compiling options.

Existing Problems Memory addressing: Unable to do full tests since the NightHawk is not available. Precision Extension: On 32bit systems, Real*8 and Real*16 have been feasible. On the same new SP, the 64-bit coding, compared to the 32-bit did not get improved performance for Real*8 and Real*16 computing, Why? Exponent range with floating point number: Unable to resolve a real number with value 10**(-400) < a > 10**400, why? CRAYC90 can handle real variable >10**600 64-bit debugging tools?

Conclusions The 64-bit MPICH, BLACS, PBLAS, SCALAPCK and PETCs libraries have been ported on the new SP at the UofM Supercomputing Institute. MPICH has been configured as the 64-bit message passing interface. BLACS, PBLAS, SCALAPCK and PETCs were built on top of MPICH. Benchmark comparison shows an encouraging future of using the public domain software in the distributed 64-bit computing: The performance of 64-bit MPI configured in shared memory mode can be better than the native MPI. For coarse granularity MPI application, the 64-bit mpich gives nearly the same performance as the IBM native MPI. For very fine granularity application, the 64-bit mpich can be slower than the native MPI by a factor of 2.

References Online Tutorial: www.msi.umn.edu/user_support/ www.msi.umn.edu/sp/tutorials/64_bit_lib.html BLAS: www.netlib.org/blas www.netlib.org/blacs/BLACS/Examples.html BLACS: www.netlib.org/blacs LAPACK: www.netlib.org/lapack MPICH: www-unix.mcs.anl.gov/mpi/mpich PBLAS: www.netlib.org/scalapack/faq.html#1.7 PETSC: www-unix.mcs.anl.gov/petsc SCALAPACK: www.netlib.org/scalapack