A Bioinformatics Introduction to Cluster Computing part I

A Bioinformatics Introduction to Cluster Computing part I
By Andrew D. Boyd, MD Research Fellow Michigan Center for Biological Information Department of Psychiatry, University of Michigan Health System and Abhijit Bose, PhD Associate Director Michigan Grid Research and Infrastructure Development and Department of Electrical Engineering and Computer Science University of Michigan

Introduction What is parallel computing? Why go parallel?
What are some limits of parallel computing? Types of parallel computing Shared memory Distributed memory

What is parallel computing?
Parallel computing: the use of multiple computers or processors working together on a common task. Each processor works on its section of the problem Processors are allowed to exchange information with other processors Grid of Problem to be solved CPU #1 works on this area of the problem CPU #2 works on this area of the problem Exchange y CPU #3 works on this area of the problem CPU #4 works on this area of the problem Exchange x

Why do parallel computing?
Limits of serial computing Available memory Performance Parallel computing allows: Solve problems that don’t fit on a single CPU Solve problems that can’t be solved in a reasonable time We can run… Larger problems Faster More cases

Types of parallelism Data parallel Task parallel
Each processor performs the same task on different data Example - grid problems Task parallel Each processor performs a different task Example - signal processing Most applications fall somewhere on the continuum between these two extremes

Basics of Data Parallel Programming
Same code will run on 2 CPUs Program has array of data to be operated on by 2 CPU so array is split into two parts. program.f: … if CPU=a then low_limit=1 upper_limit=50 elseif CPU=b then low_limit=51 upper_limit=100 end if do I = low_limit, upper_limit work on A(I) end do ... end program CPU A CPU B program.f: … low_limit=1 upper_limit=50 do I= low_limit, upper_limit work on A(I) end do end program program.f: … low_limit=51 upper_limit=100 do I= low_limit, upper_limit work on A(I) end do end program

Typical Task Parallel Application
Inverse FFT Task Normalize Task FFT Task Multiply Task DATA • Signal processing Use one processor for each task Can use more processors if one is overloaded

Basics of Task Parallel Programming
Program has 2 tasks (a and b) to be done by 2 CPUs CPU A CPU B program.f: … initialize ... if CPU=a then do task a elseif CPU=b then do task b end if …. end program program.f: … initialze do task a end program program.f: … initialze do task b end program

Limits of Parallel Computing
Theoretical upper limits Amdahl’s Law Practical limits Other Considerations time to re-write code

Theoretical upper limits
All parallel programs contain: Parallel sections Serial sections Serial sections limit the parallel effectiveness Amdahl’s Law states this formally

Amdahl’s Law Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Effect of multiple processors on run time Effect of multiple processors on speed up Where Fs = serial fraction of code Fp = parallel fraction of code N = number of processors S = 1 f s + p / N

Illustration of Amdahl's Law
It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behaviour of your code before doing production runs using large numbers of processors 250 fp = 1.000 fp = 0.999 fp = 0.990 fp = 0.900 200 150 100 50 50 100 150 200 250 Number of processors

Amdahl’s Law Vs. Reality
Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance 10 20 30 40 50 60 70 80 100 150 200 250 fp = 0.99 Speed up Number of processors Amdahl's Law Reality

Some other considerations
Writing effective parallel application is difficult Load balance is important Communication can limit parallel efficiency Serial time can dominate Is it worth your time to rewrite your application Do the CPU requirements justify parallelization? Will the code be used just once? Super-linear Speedup ? Cache effects as number of processors increases Randomised algorithms

Shared and Distributed memory
P M Network P P P P P P Bus Memory Distributed memory – each processor has it’s own local memory. Must do message passing to exchange data between processors. (examples: CRAY T3E, IBM SP ) Shared memory – single address space. All processors have access to a pool of shared memory. (examples: CRAY T90) Methods of memory access : - Bus - Crossbar

Pure Shared Memory Machines
T90, C90, YMP, XMP, SV1 SGI O2000 (sort of) HP-Exemplar (sort of) Vax 780 Various Suns Various Wintel boxes BBN GP 1000 Butterfly

Programming methodologies
Standard Fortran or C and let the compiler do it for you Directive can give hints to compiler (OpenMP) Libraries Threads like methods Explicitly Start multiple tasks Each given own section of memory Use shared variables for communication Message passing can also be used but is not common

Example of Shared-Memory Programming
Program: Calculate the value of pi program calc_pi implicit none integer n,i double precision w,x,sum,pi,f,a double precision start, finish, timef f(a) = 4.0 / (1.0 + a*a) n= start=timef() w=1.0/n sum=0.0

Shared-Memory Portion:
!$OMP PARALLEL PRIVATE(x,i), SHARED(w,n), & !$OMP REDUCTION(+:sum) !$OMP DO do i=1,n x = w * (i - 0.5) sum = sum + f(x) end do !$OMP END DO !$OMP END PARALLEL pi = w * sum finish=timef() print*,"value of pi, time taken:" print*,pi,finish-start end

Distributed shared memory (NUMA)
P Memory Bus P Memory Bus Consists of N processors and a global address space All processors can see all memory Each processor has some amount of local memory Access to the memory of other processors is slower Non-Uniform Memory Access

Programming methodologies
Same as shared memory Standard Fortran or C and let the compiler do it for you Directive can give hints to compiler (OpenMP) Libraries Threads like methods Explicitly Start multiple tasks Each given own section of memory Use shared variables for communication Message passing can also be used

DSM NUMA Machines SGI O2000 HP-Exemplar

Distributed Memory Each of N processors has its own memory
Memory is not shared Communication occurs using messages P M Network

Communication networks
Custom Many manufacturers offer custom interconnects Myrinet 2000 Off the shelf Ethernet (Force 10, Extreme Networks) ATM HIPI FIBER Channel FDDI INFINIBAND

Programming methodology
Mostly message passing using MPI Data distribution languages Simulate global name space Examples High Performance Fortran Split C Co-array Fortran

Hybrid machines SMP nodes (clumps) with interconnect between clumps
Origin 2000 Exemplar SV1 IBM Nighthawk Programming SMP methods on clumps or message passing Message passing between all processors

MESSAGE PASSING INTERFACE
AN INTRODUCTION TO MESSAGE PASSING INTERFACE FOR CLUSTERS

A Brief Intro to MPI Background on MPI Documentation
Hello world in MPI Basic communications Simple send and receive program You can use MPI in both Clusters and Grid environments

Background on MPI MPI - Message Passing Interface
Library standard defined by committee of vendors, implementers, and parallel programmers Used to create parallel SPMD programs based on message passing Available on almost all parallel machines in C and Fortran Over 100 advanced routines but 6 basic

Documentation MPI home page http://www.mcs.anl.gov/mpi
Contains the library standard Books "MPI: The Complete Reference" by Snir, Otto, Huss-Lederman, Walker, and Dongarra, MIT Press (also in Postscript and html) "Using MPI" by Gropp, Lusk and Skjellum, MIT Press

MPI Implementations Most parallel machine vendors have optimized versions Others: GLOBUS: MPICH-G2:

Key Concepts of MPI Used to create parallel SPMD programs based on message passing Normally the same program is running on several different nodes Nodes communicate using message passing Typical methodology: start job on n processors do i=1 to j each processor does some calculation pass messages between processor end do end job

Include files The MPI include file C: mpi.h
Fortran: mpif.h (a f90 module is a good place for this) Defines many constants used within MPI programs In C defines the interfaces for the functions Compilers know where to find the include files

Communicators Communicators A parameter for most MPI calls
A collection of processors working on some part of a parallel job MPI_COMM_WORLD is defined in the MPI include file as all of the processors in your job Can create subsets of MPI_COMM_WORLD Processors within a communicator are assigned numbers 0 to n-1

Data types Data types When sending a message, it is given a data type
Predefined types correspond to "normal" types MPI_REAL , MPI_FLOAT -Fortran and C real MPI_DOUBLE PRECISION , MPI_DOUBLE - Fortan and C double MPI_INTEGER and MPI_INT - Fortran and C integer Can create user-defined types

Minimal MPI Program Every MPI program needs these… C version
#include <mpi.h> /* the mpi include file */ /* Initialize MPI */ ierr=MPI_Init(&argc, &argv); /* How many total PEs are there */ ierr=MPI_Comm_size(MPI_COMM_WORLD, &nPEs); /* What node am I (what is my rank? */ ierr=MPI_Comm_rank(MPI_COMM_WORLD, &iam); ... ierr=MPI_Finalize(); In C MPI routines are functions and return an error value

Minimal MPI Program Every MPI program needs these…
Fortran version include 'mpif.h' ! MPI include file c Initialize MPI call MPI_Init(ierr) c Find total number of PEs call MPI_Comm_size(MPI_COMM_WORLD, nPEs, ierr) c Find the rank of this node call MPI_Comm_rank(MPI_COMM_WORLD, iam, ierr) ... call MPI_Finalize(ierr) In Fortran, MPI routines are subroutines, and last parameter is an error value

Basic Communication Data values are transferred from one processor to another One process sends the data Another receives the data Synchronous Call does not return until the message is sent or received Asynchronous Call indicates a start of send or received, and another call is made to determine if finished

Synchronous Send C Fortran Call blocks until message on the way
MPI_Send(&buffer, count ,datatype, destination, tag,communicator); Fortran Call MPI_Send(buffer, count, datatype, destination,tag,communicator, ierr) Call blocks until message on the way

Synchronous Send MPI_Send: Sends data to another processor
Use MPI_Receive to "get" the data C MPI_Send(&buffer,count,datatype, destination,tag,communicator); Fortran Call MPI_Send(buffer, count, datatype,destination, tag, communicator, ierr)

Call MPI_Send(buffer, count, datatype, destination, tag, communicator, ierr)
Buffer: The data Count : Length of source array (in elements, 1 for scalars) Datatype : Type of data, for example : MPI_DOUBLE_PRECISION, MPI_INT, etc Destination : Processor number of destination processor in communicator Tag : Message type (arbitrary integer) Communicator : Your set of processors Ierr : Error return (Fortran only)

Synchronous Receive C Fortran Call blocks until message is in buffer
MPI_Recv(&buffer,count, datatype, source, tag, communicator, &status); Fortran Call MPI_ RECV(buffer, count, datatype, source,tag,communicator, status, ierr) Call blocks until message is in buffer Status - contains information about incoming message MPI_Status status; Integer status(MPI_STATUS_SUZE)

Call MPI_Recv(buffer, count, datatype, source, tag, communicator, status, ierr)
Buffer: The data Count : Length of source array (in elements, 1 for scalars) Datatype : Type of data, for example : MPI_DOUBLE_PRECISION, MPI_INT, etc Source : Processor number of source processor in communicator Tag : Message type (arbitrary integer) Communicator : Your set of processors Status: Information about message Ierr : Error return (Fortran only)

Six basic MPI calls MPI_INIT Initialize MPI MPI_COMM_RANK
Get the processor rank MPI_COMM_SIZE Get the number of processors MPI_Send Send data to another processor MPI_Recv Get data from another processor MPI_FINALIZE Finish MPI

Send and Receive Program Fortran
program send_receive include "mpif.h" integer myid,ierr,numprocs,tag,source,destination,count integer buffer integer status(MPI_STATUS_SIZE) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) tag=1234; source=0; destination=1; count=1 if(myid .eq. source)then buffer=5678 Call MPI_Send(buffer, count, MPI_INTEGER,destination,& tag, MPI_COMM_WORLD, ierr) write(*,*)"processor ",myid," sent ",buffer endif if(myid .eq. destination)then Call MPI_Recv(buffer, count, MPI_INTEGER,source,& tag, MPI_COMM_WORLD, status,ierr) write(*,*)"processor ",myid," got ",buffer call MPI_FINALIZE(ierr) stop end

Send and Receive Program C
#include <stdio.h> #include "mpi.h" int main(int argc,char *argv[]) { int myid, numprocs, tag,source,destination,count, buffer; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); tag=1234; source=0; destination=1; count=1; if(myid == source){ buffer=5678; MPI_Send(&buffer,count,MPI_INT,destination,tag,MPI_COMM_WORLD); printf("processor %d sent %d\n",myid,buffer); } if(myid == destination){ MPI_Recv(&buffer,count,MPI_INT,source,tag,MPI_COMM_WORLD,&status); printf("processor %d got %d\n",myid,buffer); MPI_Finalize();

MPI Types MPI has many different predefined data types
Can be used in any communication operation

Predefined types in C

Predefined types in Fortran

Wildcards Allow you to not necessarily specify a tag or source Example
MPI_ANY_SOURCE and MPI_ANY_TAG are wild cards Status structure is used to get wildcard values MPI_Status status; int buffer[5]; int error; error = MPI_Recv(&buffer, 5, MPI_INTEGER, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD,&status);

Status The status parameter returns additional information for some MPI routines Additional Error status information Additional information with wildcard parameters C declaration : a predefined struct MPI_Status status; Fortran declaration : an array is used instead INTEGER STATUS(MPI_STATUS_SIZE)

Accessing status information
The tag of a received message C : status.MPI_TAG Fortran : STATUS(MPI_TAG) The source of a received message C : status.MPI_SOURCE Fortran : STATUS(MPI_SOURCE) The error code of the MPI call C : status.MPI_ERROR Fortran : STATUS(MPI_ERROR) Other uses...

MPI_Probe MPI_Probe allows incoming messages to be checked without actually receiving . The user can then decide how to receive the data. Useful when different action needs to be taken depending on the "who, what, and how much" information of the message.

MPI_Probe C Fortran Parameters Source: source rank, or MPI_ANY_SOURCE
int MPI_Probe(source, tag, comm, &status) Fortran MPI_PROBE(SOURCE, TAG, COMM, STATUS, IERROR) Parameters Source: source rank, or MPI_ANY_SOURCE Tag: tag value, or MPI_ANY_TAG Comm: communicator Status: status object

MPI_Probe example (part 1)
! How to use probe and get_count ! to find the size of an incoming message program probe_it include 'mpif.h' integer myid,numprocs integer status(MPI_STATUS_SIZE) integer mytag,icount,ierr,iray(10) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) mytag=123; iray=0; icount=0 if(myid .eq. 0)then ! Process 0 sends a message of size 5 icount=5 iray(1:icount)=1 call MPI_SEND(iray,icount,MPI_INTEGER, & 1,mytag,MPI_COMM_WORLD,ierr) endif

MPI_Probe example (part 2)
if(myid .eq. 1)then ! process 1 uses probe and get_count to find the size call mpi_probe(0,mytag,MPI_COM_WORLD,status,ierr) call mpi_get_count(status,MPI_INTEGER,icount,ierr) write(*,*)"getting ", icount," values" call mpi_recv(iray,icount,MPI_INTEGER,0, & mytag,MPI_COMM_WORLD,status,ierr) endif write(*,*)iray call mpi_finalize(ierr) stop End Fortran source C source

MPI_BARRIER Blocks the caller until all members in the communicator have called it. Used as a synchronization tool. C MPI_Barrier(comm ) Fortran Call MPI_BARRIER(COMM, IERROR) Parameter Comm communicator (MPI_COMM_WOLD)

Asynchronous Communication
Asynchronous send: send call returns immediately, send actually occurs later Asynchronous receive: receive call returns immediately. When received data is needed, call a wait subroutine Asynchronous communication used in attempt to overlap communication with computation Can help prevent deadlock (not advised)

Asynchronous Send with MPI_Isend
MPI_Request request int MPI_Isend(&buffer, count, datatype, tag, comm, &request) Fortran Integer REQUEST MPI_ISEND(BUFFER,COUNT,DATATYPE, DEST, TAG, COMM, REQUEST,IERROR) Request is a new output Parameter Don't change data until communication is complete

Asynchronous Receive with MPI_Irecv
MPI_Request request; int MPI_Irecv(&buf, count, datatype, source, tag, comm, &request) Fortran Integer request MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG,COMM, REQUEST,IERROR) Parameter Changes Request: communication request Status parameter is missing Don't use data until communication is complete

MPI_Wait used to complete communication
Request from Isend or Irecv is input The completion of a send operation indicates that the sender is now free to update the data in the send buffer The completion of a receive operation indicates that the receive buffer contains the received message MPI_Wait blocks until message specified by "request" completes

MPI_Wait used to complete communication
MPI_Request request; MPI_Status status; MPI_Wait(&request, &status) Fortran Integer request Integer status(MPI_STATUS_SIZE) MPI_WAIT(REQUEST, STATUS, IERROR) MPI_Wait blocks until message specified by "request" completes

MPI_Test Similar to MPI_Wait, but does not block
Value of flags signifies whether a message has been delivered C int flag int MPI_Test(&request,&flag, &status) Fortran LOGICAL FLAG MPI_TEST(REQUEST, FLAG, STATUS, IER)

Non blocking send example
call MPI_Isend (buffer,count,datatype,dest, tag,comm, request, ierr) 10 continue Do other work ... call MPI_Test (request, flag, status, ierr) if (.not. flag) goto 10

MPI Broadcast call: MPI_Bcast
All nodes call MPI_Bcast One node (root) sends a message all others receive the message C MPI_Bcast(&buffer, count, datatype, root, communicator); Fortran call MPI_Bcast(buffer, count, datatype, root, communicator, ierr) Root is node that sends the message

Scatter Operation using MPI_Scatter
Similar to Broadcast but sends a section of an array to each processors Data in an array on root node: A(0) A(1) A(2) A(N-1) Goes to processors: P0 P1 P Pn-1

MPI_Scatter C Fortran Parameters
int MPI_Scatter(&sendbuf, sendcnts, sendtype, &recvbuf, recvcnts, recvtype, root, comm ); Fortran MPI_Scatter(sendbuf,sendcnts,sendtype, recvbuf,recvcnts,recvtype,root,comm,ierror) Parameters Sendbuf is an array of size (number processors*sendcnts) Sendcnts number of elements sent to each processor Recvcnts number of elements obtained from the root processor Recvbuf elements obtained from the root processor, may be an array

Scatter Operation using MPI_Scatter
Scatter with Sendcnts = 2 Data in an array on root node: A(0) A(2) A(4) A(2N-2) A(1) A(3) A(5) A(2N-1) Goes to processors: P0 P1 P Pn-1 B(0) B(O) B(0) B(0) B(1) B(1) B(1) B(1)

Gather Operation using MPI_Gather
Used to collect data from all processors to the root, inverse of scatter Data is collected into an array on root processor Data from various Processors: P0 P1 P Pn-1 A A A A Goes to an array on root node: A(0) A(1) A(2) A(N-1)

MPI_Gather C Fortran Parameters
int MPI_Gather(&sendbuf,sendcnts, sendtype, &recvbuf, recvcnts,recvtype,root, comm ); Fortran MPI_Gather(sendbuf,sendcnts,sendtype, recvbuf,recvcnts,recvtype,root,comm,ierror) Parameters Sendcnts # of elements sent from each processor Sendbuf is an array of size sendcnts Recvcnts # of elements obtained from each processor Recvbuf of size Recvcnts*number of processors

Reduction Operations Used to combine partial results from all processors Result returned to root processor Several types of operations available Works on 1 or 2d arrays

MPI routine is MPI_Reduce
int MPI_Reduce(&sendbuf, &recvbuf, count, datatype, operation,root, communicator) Fortran call MPI_Reduce(sendbuf, recvbuf, count, datatype, operation,root, communicator, ierr) Parameters Like MPI_Bcast, a root is specified. Operation is a type of mathematical operation

Operations for MPI_Reduce
MPI_MAX Maximum MPI_MIN Minimum MPI_PROD Product MPI_SUM Sum MPI_LAND Logical and MPI_LOR Logical or MPI_LXOR Logical exclusive or MPI_BAND Bitwise and MPI_BOR Bitwise or MPI_BXOR Bitwise exclusive or MPI_MAXLOC Maximum value and location MPI_MINLOC Minimum value and location

Global Sum with MPI_Reduce
double sum_partial, sum_global; sum_partial = ...; ierr = MPI_Reduce(&sum_partial, &sum_global, 1, MPI_DOUBLE_PRECISION, MPI_SUM,root, MPI_COMM_WORLD); Fortran double precision sum_partial, sum_global sum_partial = ... call MPI_Reduce(sum_partial, sum_global, MPI_COMM_WORLD, ierr)

Global Sum with MPI_Reduce and 2d array

All Gather and All Reduce
Gather and Reduce come in an "ALL" variation Results are returned to all processors The root parameter is missing from the call Similar to a gather or reduce followed by a broadcast

Global Sum with MPI_AllReduce and 2d array

All to All communication with MPI_Alltoall
Each processor sends and receives data to/from all others C int MPI_Alltoall(&sendbuf,sendcnts, sendtype, &recvbuf, recvcnts, recvtype, root, MPI_Comm); Fortran MPI_ MPI_Alltoall(sendbuf,sendcnts,sendtype, recvbuf,recvcnts,recvtype,root,comm,ierror)

All to All with MPI_Alltoall
Parameters Sendcnts # of elements sent to each processor Sendbuf is an array of size sendcnts Recvcnts # of elements obtained from each processor Recvbuf of size Recvcnts*number of processors Note that both send buffer and receive buffer must be an array of size of the number of processors

The dreaded “V” or variable or operators
A collection of very powerful but difficult to setup global communication routines MPI_Gatherv: Gather different amounts of data from each processor to the root processor MPI_Alltoallv: Send and receive different amounts of data form all processors MPI_Allgatherv: Gather different amounts of data from each processor and send all data to each MPI_Scatterv: Send different amounts of data to each processor from the root processor

Summary MPI is used to create parallel programs based on message passing Usually the same program is run on multiple processors The 6 basic calls in MPI are: MPI_INIT( ierr ) MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) MPI_Send(buffer, count,MPI_INTEGER,destination, tag, MPI_COMM_WORLD, ierr) MPI_Recv(buffer, count, MPI_INTEGER,source,tag, MPI_COMM_WORLD, status,ierr) call MPI_FINALIZE(ierr)

Job Management in HPC Clusters
Interactive and Batch Jobs Common Resource Management Systems: PBS, LSF, Condor Queues and Job Specification Sample PBS and LSF scripts

Job Management in HPC Clusters
Interactive and Batch Jobs Interactive Mode is useful for debugging, performance tuning and profiling of applications Batch Mode is for production runs Most clusters are configured with a number of Queues for submitting and running batch jobs Batch jobs are submitted via scripts specific to the resource management system deployed

Example of Batch Queues
qstat -q server: nyx.engin.umich.edu Queue Memory CPU Time Walltime Node Run Que Lm State short :00: E R vs E R staff E R atlas :00: E R powell E R long :00: E R landau E R debug E R route E R coe E R • Can assign maxm number of CPUs, maxm wall time etc. to specific queues

An Example PBS Script #PBS -N pbstrial # job name
#PBS -l nodes=2,walltime=24:00:00 # number of CPUs, walltime #PBS -S /bin/sh #PBS -q npaci # queue name #PBS -M #PBS -m abe # notification level request #PBS -j oe # export GMPICONF=/home/abose/.gmpi/$PBS_JOBID echo "I ran on `hostname`" #cd to your execution directory first cd ~ scp em #use mpirun to run my MPI binary with 4 CPUs for 1 hour mpirun -np 4 ~/a.out # run executable

An Example LSF Script #!/bin/ksh #
# LSF batch script to run the test MPI code #BSUB -P # Project Number #BSUB -a mpich_gm # select the mpich-gm elim #BSUB -x # exclusive use of node (not_shared) #BSUB -n # number of total tasks #BSUB -R "span[ptile=1]" # run 1 tasks per node #BSUB -J mpilsf.test # job name #BSUB -o mpilsf.out # output filename #BSUB -e mpilsf.err # error filename #BSUB -q regular # queue # Fortran example mpif90 -o mpi_samp_f mpisamp.f mpirun.lsf ./mpi_samp_f # C example mpicc -o mpi_samp_c mpisamp.c mpirun.lsf ./mpi_samp_c # C++ example mpicxx -o mpi_samp_cc mpisamp.cc mpirun.lsf ./mpi_samp_cc # run executable

Application Checkpointing
The ability of a code to restart from the point of last execution or interruption. Very useful for timeshared systems such as Clusters and Grids  High-end systems provide checkpointing in the OS and hardware, but on clusters/desktops, we have to checkpoint on our own. You should checkpoint: (1) you have long-running jobs, and you usually run out of allocated queue time. (2) you can restart the job from last saved global state (not always possible) – it is quite common for time-dependent simulations. (3) you want to protect yourself from hardware faults/lost CPU cycles How often should one checkpoint ? Totally depends on the application and the user

Find out how much time is left in a queue: use ctime/etime/dtime type timer routines in the code periodically to check for elapsed time. If close to queue limit, write data to disk from all processes.  An efficient way of writing checkpointed data: keep two different files A & B. File A is the checkpointed data from last write. First, write to File B. If write is successful, then replace A with B. If you use the same file to write over and over again, a system crash during write will lose all checkpointed data so far. You can also combine files from multiple processes into a single file. In that case, name process files with the process id attached (simplifies management). We will talk about this in the MPI presentation.  How do you decide for a common checkpointed state among all processes ?

Distributed Checkpointing:  In the distributed/message passing model, issue a broadcast to all processes to checkpoint the data at specific points in the code.  Or, use a distributed checkpointing algorithm when program logic does not lend to the above. e.g. Read the paper: “A Survey of Checkpointing Algorithms for Parallel and Distributed Computers” by Kalaiselvi and Rajaraman. (google)  Simpler “quick-n-dirty” methods based on token passing also works although may not be very robust. (works most of the time) Contact me at if you wish to know more

A Bioinformatics Introduction to Cluster Computing part II
By Andrew D. Boyd, MD Research Fellow Michigan Center for Biological Information Department of Psychiatry, University of Michigan Health System And Abhijit Bose, PhD Associate Director Michigan Grid Research and Infrastructure Development

Computational Challenges
Most scientific computations have a parallel and a serial aspects Clusters tend to speed up the parallel aspect the most Serial computations decrease the ability to speed up the computation on a cluster

Parallelize the computation of a large data set
One can take the data set and brake it up into smaller pieces Example: Gave everyone a small piece of the radio frequency to compute on home computer

Serial Computations Less cost efficient with large interrelated problems that are hard to uncouple Less cost efficient when cost of communicating between the nodes exceeds the savings from distributing the computation load. Example: Molecular Dynamics CHARMM and XPLOR Software programs developed for Serial Machines NAMD Designed to run efficiently on such parallel machines for simulating large molecules Used Charm++, a parallel C++ library to parallelize the code

Large Genome vs Genome searches
BLAT, BLAST, FASTA, SSAHA All perform sequence matching All have difference performance speed and sensitivity Some have higher memory requirements than others However relatively easy to parallelize Divide up the list of sequences you are searching into smaller pieces.

A simple Bioinformatics Example
Re-examine the labels of the Affemetrix probe set Desire to know other matches with other genes within organism Especially as Unigene changes every few months 500,000 probes to search against the database

Time Savings of a cluster
On a single processor at 40 seconds per sequence 231 days to process all 500,000 probes or 57 days at 10 seconds per sequence On biosys1 (an old cluster) the job took approximately 2 weeks On morpheus (a new cluster) the job took approximately 4 days

One Cluster Computing Method
Take the .5 million affy probe sequences and divide into 90 files submit multiple single BLAST execution with 5500 sequences to the que Allow the scheduler to dynamically allocate the jobs to the unused nodes Easy to code in perl Will walk through the code later

Another Cluster Computer method for sequences mpiBLAST
One concern is the memory of the computers compared to the size of the database One could take the database and break it up into smaller pieces and have the .5 million affy probes search against the smaller pieces of the database on nodes of a cluster

mpiBLAST Uses the mpi libraries to pass messages between the nodes
One master node and a lot of slave nodes Each slave node is assigned a part of the database If more database parts than nodes then the master node dynamically allocates the database pieces At the end will recalculate the statistics for the complete database

mpiBLAST After initial testing program did not scale beyond 10 nodes
Poor reliability in completion of BLAST runs Early versions had very little error checking So if one node had an error the complete run was lost If message between nodes was lost the complete run was lost Poor error exiting, mpiBLAST would never finish if error came up had to be manually killed

Scalability of mpiBLAST

Another Parallel bioinformatics example
InterProScan Developed by European Bioinformatics Institute (EBI) Performs 11 protein domain comparisons for a amino acid sequence and creates a single report The code is all developed in PERL Like BLAST can submit sequences via the EBI web site but for larger number of sequences the applications software can be download

InterProScan Their approach to the solution was to take the input sequence and divide it into chunks. Then each of the 11 programs has a script configured to run on a specific chunk A separate script maintains how many processes are being submitted to the cluster and as one finishes another is submitted

InterProScan Assumptions built into the model
The number of nodes you will be running on will be the same and dedicated to you for the whole run InterProScan takes over the job of the cluster scheduler

Sample Code to divide an input file and Execute multiple instances of BLAST
3 perl scripts Hgdivide.pl demomaker.pl multiqsub.pl Could have only one script if truly desired Could build more error checking into the program as well

Hgdivide.pl [adboyd@head input]$ cat Hgdivide.pl #!/usr/bin/perl -w
# divide the affy probe set into files of 1000 probes per file open(IN, "<demoinput") || die("can't open out file"); #open input file $j = 1; # index for file number open(OUT, ">demoin".$j."") || die("can't open out file"); $k = 101; # number of lines in each input file $i = 1; # line number while ($line = <IN>) { # read line of input file if ($i < $k) {print OUT $line} # print line of file if not reach end of new input file if ($i >= $k) {$i = 1; ++$j; open(OUT, ">demoin".$j.""); print OUT $line} $i; # if at the end of the new file, reset index, add one to file number # open new file } close(OUT); close(IN);

demomaker.pl @num = (1..3); #number of scripts you want to make
foreach $chrom { open(OUT, ">run".$chrom.".sh") || die "can't open file"; # script file name print OUT ("#!/bin/tcsh\n"); # chose shell to run print OUT ("cd affy\n"); # change directory to working directory $input_file = "~/affy/input/demoin".$chrom.""; # input file and location $input = "demoin".$chrom.""; # name of input file $output_file = "~/affy/output/demo".$chrom.".out"; # output file for results of blast $output2_file = "~/affy/output/demo".$chrom.".out2"; # maintence of output file $scratch_dir = "/scratch"; # name of scratch directory on cluster print OUT ("date > ".$output2_file."\n"); # print date to maintence file print OUT ("hostname >> ".$output2_file."\n"); # print cluster node number to node file print OUT ("/bin/cp -f Hs.seq.all.* ".$scratch_dir."\n"); # copy database to local node print OUT ("echo 'db copied into scratch' >> ".$output2_file."\n"); # print status of copy to maintence file

demomaker.pl part 2 print OUT ("/bin/cp -f ".$input_file." ".$scratch_dir."\n"); # copy input file to local node print OUT ("echo 'input copied into scratch' >> ".$output2_file."\n"); # print status of copy to maintence file print OUT ("date >> ".$output2_file."\n"); # print time copying is done to maintence file print OUT ("cd ".$scratch_dir."\n"); # move script to node directory $cmd_line = "~/affy/blastall -i ".$input." -o ".$output_file." -p blastn -d Hs.seq.all -e a 2 -U T "; # commmand line to execute print OUT ("echo '$cmd_line' >> ".$output2_file."\n"); # print command line to execute to maintence file print OUT ($cmd_line."\n"); # run command line print OUT ("echo Blastall finished >> ".$output2_file."\n"); # print status of Blast run to maintence file print OUT ("/bin/rm -f Hs.seq.all.*\n"); # remove database from node print OUT ("echo 'db removed from scratch' >> ".$output2_file."\n"); # print status of database removal to maintence file print OUT ("/bin/rm -f ".$input."\n"); # remove input file for node print OUT ("echo 'input removed from scratch' >> ".$output2_file."\n"); # print status of inputer removal from node # print time to maintence file } close(OUT);

multiqsub.pl #!/usr/bin/perl #system call to submit to cluster
$i=1; #index while ($i < 4) { system("qsub run".$i.".sh"); # system call for each script ++$i; }

The joy of cluster computing
This is the stone wall which you may encounter while working on a cluster This is “feature” not necessarily a bug

How software is maintained on a cluster
Many clusters will allow you to install any software you desire Some clusters pre install common software like BLAST or InterProScan even databases If multiple people are using the same database it is a waste of memory to have it installed multiple times

Bring your own Software model
Find out what libraries and compilers are supported on the cluster Not everyone uses gcc The latest version of the compiler you need may not be installed Find out the name of the scratch space on the nodes, scratch space is the area of the node you can copy files to Usually labeled /scratch

Software previously installed
Find out what directory the software is located Find out what version they are running Find out if the software is on a drive that can be mounted to the compute nodes Warning!!!! not all drives and directories visible from the Head Node are mountable to the compute nodes

The importance of benchmarking
Before submitting jobs to a cluster First run a fraction of the job on head node try to figure out how long the complete job will run The program will probably not scale linearly for each node added However, if the individual executions will take longer than 24 hours you will need to write a module to checkpoint your program There is an assumed direct relationship between the likelihood of a cluster failing and how long the program will run

The importance of troubleshooting
Some steps to take if the program does not work on a cluster First run a small job on the Head node Second, if allowed, consider running in interactive mode to see if errors are generated Interactive mode allows you to type command line instructions on the individual node the program is executed on One can also read the error outputs directly Just because a program runs on the head node does not mean it will run on a compute node If not performing normally contact your friendly system administrator

The importance of troubleshooting part II
Third submit one job to the scheduler The scheduler should not modify the code if the commands to the scheduler are not correct will not execute properly Just because a program runs on the head node and the compute node does not mean it will run after being assigned by the scheduler If not performing normally contact your friendly system administrator Fourth submit multiple jobs to scheduler Just because a program runs on the head node, a compute node, through the scheduler, does not mean it will run on every single node on a cluster While every node is supposed to be identical sometimes they are different

Information to bring to a System Administer when problems arise
What program/script were you trying to use? What command line parameters? When did this happen? What node/computer did this happen on? What was the error message you received if any? If you have thoughts or clues to the problem pass them on to the sys admin.

Errors that I have seen on clusters
Software not properly installed by systems administrators on drives that can be mounted on the nodes Results: no output from program, Troubleshooting: Called system administrator, error was reported to node which was inaccessible to user Scratch space permissions were changed by a previous user, node would not function Results: Since that node finished first all subsequent jobs were sent to that node and errored out, found out Monday morning that only half of the output was generated, Troubleshooting: tracked all of the uncompleted jobs to a single node from error output file

Errors that I have seen on clusters part II
Complier version supported on cluster not version software was written Results: Program failed to compile Troubleshooting: ed author of program, tried to install appropriate compiler, ended up using binaries for operating system Home storage disk space full, bioinformatics application tend to take up more disk space than other applications Results: Output from last twenty jobs were missing Troubleshooting: looked to see if failed job runs were from a single node, looked at disk space allocation, other users complaining as well Programs errored out Results: Output from jobs stopped midway through run. Troubleshooting: Looked to see how long program was running, (5 days), looked at input file did a brief calculation, would have taken 30 days to finish single job, no bench marking, revaluated experiment

Additional Information
Vi short command reference: BLAST Sun Grid Engine sites

Acknowledgements Brian Athey, Associate Professor
Abhijit Bose, Associate Director MGRID Georgi Kostov, System Administrator Chris Bliton, MCBI project manager Fan Meng, Assistant Research Scientist Joe Landman, Scalable Informatics Jeff Ogden, IT project manager Paul Trombley, graphic artist Tom Hacker, Associate Director, Indiana Univ. NPACI and SDSC

A Bioinformatics Introduction to Cluster Computing part I

Similar presentations

Presentation on theme: "A Bioinformatics Introduction to Cluster Computing part I"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Bioinformatics Introduction to Cluster Computing part I

Similar presentations

Presentation on theme: "A Bioinformatics Introduction to Cluster Computing part I"— Presentation transcript:

Similar presentations

About project

Feedback