E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS2 GEOFIT - MTR With Geofit measurements from a full orbit are simultaneously processed A Geofit where P, T and VMR of H 2 O and O 3 are simultaneously retrieved increase the computing time
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS3 TIME OF SIMULATIONS Computing Time: sequential algorithm We made some simulations with an Alphas. ES45, CPU 1 GHz H 2 OT S = 1h 30m (T S = T SEQUENTIAL ) O 3 T S = 4h 40m PTT S = 9h 48m MTRT S = 10h 30m …to reduce the time of the simulations we propose a parallel system
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS4 PARALLELIZATION The first step will be to parallelize the loop that computes the forward model because: 1.It is the most time consuming part of the code. 2.The computation of the forward model for one sequence is independent from the computation of another sequence so that processors have to communicate data only at the beginning and at the end of the forward model.
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS5 PARALLEL TIME Parallel time (T P ) is the sequential time divided the number of CPUs Example, for a system with 8 CPUs if the algorithm is completely parallel: T P = T S /8 = 12.5% of sequential time This is the best improvement we can reach with 8 CPUs
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS6 FORWARD MODEL PARALL. T Forward model (3 iterations): 45m Sum of the times to compute the forward model T P = T Forward model /#CPU = 45m/8 = 6m Time of parallelized code T = T S + T P = (1h 30m - 45m) + 6m = 51m = 56% Total time (sum of the time of code remained sequential and time of code parallelized) H2OH2O If we parallelize only the forward model we can do an evaluation of the simulations time with 8 CPUs:
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS7 FW MODEL PARALL./1 PT T Forward model (2 it): 10h 30m, T P = 1h 11m T = 2h 11m = 20% MTR T Forward model (2 it): 9h 33m, T P = 1h 12m T = 1h 26m = 15% T Forward model (2 it): 4h 10m, T P = 30m T = 60m = 21% O3O3
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS8 MEMORY CLASSIFICATION M M M M P P P P NETWORK Local Memory Each processor (P) can see the whole memory (M) Each processor can see only its memory: to exchange data we need a network M P P P P Shared Memory In order to use a parallel code we need an appropriate hardware witch can be classified by memory:
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS9 OPEN-MP VS MPI With systems Shared Memory is used OpenMP + compiler directives With systems Local Memory is used MPI + call to libraries The header file mpif.h contains definitions of MPI constants, MPI types and functions Parallelism is not visible to the programmer (compiler responsible for parallelism) Easy to do Small improvements in performance Parallelism is visible to the programmer Difficult to do Large improvements in performance
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS10 OPEN-MP EXAMPLE PROGRAM Matrix IMPLICIT NONE INTEGER (KIND=4) :: i, j INTEGER (KIND=4), parameter :: n = 1000 INTEGER (KIND=4) :: a(n,n) !$ OMP PARALLEL DO & !$ PRIVATE(i,j) & !$ SHARED(a) DO j = 1, n DO i = 1, n a(i,j) = i + j ENDDO !$ OMP END PARALLEL DO END f90 –omp name_program setenv OMP_NUM_THREADS 2 f90 name_program If we compile in this way the compiler will treat the instructions beginning with !$ like comments If we compile with –omp flag the compiler will read these instructions
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS11 MPI EXAMPLE POINT TO POINT COMMUNICATION: SEND and RECEIVE MPI_SEND(buf, count, type, dest, tag, comm, ierr) MPI_RECV(buf, count, type, dest, tag, status, comm, ierr) BUFarray of type type COUNTnumber of elements of buf to be sent TYPE MPI type of buf DESTrank of the destination process TAGnumber identifying the message COMMcommunicator of the sender and receiver STATUS array containing communication status IERRerror code (if ierr = 0 no error occurs)
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS12 MPI EXAMPLE/1 BROADCAST (ONE TO ALL COMMUNICATION): SAME DATA SENT FROM ROOT PROCESS TO ALL OTHERS IN THE COMMUNICATOR PROCESSES A0A0 DATA A0A0 PROCESSES A0A0 A0A0 A0A0 DATA
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS13 MPI COMMINICATOR IN MPI IT IS POSSIBLE TO DIVIDE THE TOTAL NUMBER OF PROCESSES INTO GROUPS, CALLED COMMUNICATORS THE COMMUNICATOR THAT INCLUDES ALL PROCESSES IS CALLED MPI_COMM_WORLD
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS14 BROADCAST EXAMPLE PROGRAM Broadcast IMPLICIT NONE INCLUDE 'mpif.h' REAL (KIND=4) :: buffer INTEGER (KIND=4) :: err, rank, size CALL MPI_INIT(err) CALL MPI_COMM_RANK(MPI_WORLD_COMM, rank, err) CALL MPI_COMM_SIZE(MPI_WORLD_COMM, size, err) if(rank.eq. 5) buffer = 24. call MPI_BCAST(buffer, 1, MPI_REAL, 5, MPI_COMM_WORLD, err) print *, "P:", rank," after broadcast buffer is ", buffer CALL MPI_FINALIZE(err) END P:1 after broadcast buffer is 24. P:3 after broadcast buffer is 24. P:4 after broadcast buffer is 24. P:0 after broadcast buffer is 24. P:5 after broadcast buffer is 24. P:6 after broadcast buffer is 24. P:7 after broadcast buffer is 24. P:2 after broadcast buffer is 24. Proc. 5 sends its real variable buffer to the processes in the comm. MPI_COMM_WORLD
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS15 OTHER COLLECTIVE COMMUNICATIONS ALLGATHER: DIFFERENT DATA SENT FROM DIFFERENT PROCESSES TO ALL OTHER IN THE COMMUNICATOR SCATTER: DIFFERENT DATA SENT FROM ROOT PROCESS TO ALL OTHER IN THE COMMUNICATOR GATHER: THE OPPOSITE OF SCATTER D0D0 PROCESSES C0C0 B0B0 A0A0 DATA D0D0 C0C0 B0B0 A0A0 PROCESSES D0D0 C0C0 B0B0 A0A0 D0D0 C0C0 B0B0 A0A0 D0D0 C0C0 B0B0 A0A0 DATA PROCESSES A3A3 A2A2 A1A1 A0A0 DATA A3A3 PROCESSES A2A2 A1A1 A0A0 DATA
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS16 LINUX CLUSTER We have a linux cluster with 8 nodes, each node: CPU Intel P4, 2.8Ghz, Front Side Bus 800Mhz 2 Gbyte RAM 333Mhz Hard Disk 40 Gbyte 1 Switch LAN (Network)
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS17 CONCLUSIONS Alphas. with 2 CPUs Shared Memory: Linux cluster (Local memory): Very expensive (~ ,00 €) Limitated #CPU Cheap (~900,00 €/node) Illimitated #CPU In the past only arch. 32 bits 2 (32-1) = 2 Gbyte = 2 · 2 30 bytes Now architecture 64 bits! 2 (64-1) = 8 Exabyte = 8 · 2 60 bytes For readability and simplicity of the code we would like to use Fortran 90