Firenze, Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA
Firenze, Giugno 2003, C. Gheller Reading and Writing data is a problem usually underestimated. However it can become crucial for: Performance Porting data on different platforms Parallel implementation of I/O algorithms
Firenze, Giugno 2003, C. Gheller Time to access disk: approx Mbyte/sec Time to access memory: approx 1-10 Gbyte/sec THEREFORE When reading/writing on disk a code is 100 times slower. Optimization is platform dependent. In general: write large amount of data in single shots Performance
Firenze, Giugno 2003, C. Gheller For example: avoid looped read/write do i=1,N write (10) A(i) enddo Is VERY slow Performance Optimization is platform dependent. In general: write large amount of data in single shots
Firenze, Giugno 2003, C. Gheller Data portability This is a subtle problem, which becomes crucial only after all… when you try to use data on different platforms. For example: unformatted data written by a IBM system cannot be read by a Alpha station or by a Linux/MS Windows PC There are two main problem: Data representation File structure
Firenze, Giugno 2003, C. Gheller Data portability: number representation There are two different representations: Big Endian Byte3 Byte2 Byte1 Byte0 will be arranged in memory as follows: Base Address+0 Byte3 Base Address+1 Byte2 Base Address+2 Byte1 Base Address+3 Byte0 Little Endian Byte3 Byte2 Byte1 Byte0 will be arranged in memory as follows: Base Address+0 Byte0 Base Address+1 Byte1 Base Address+2 Byte2 Base Address+3 Byte3 Alpha, PC Unix (IBM, SGI, SUN…)
Firenze, Giugno 2003, C. Gheller Data portability: File structure For performance reasons, Fortran organizes binary files in BLOCKS. Each block is identified by a proper bit sequence (usually 1 byte long) Unfortunately, each Fortran compiler has its own Block size and separators !!! Notice that this problem is typical of Fortran and does not affect C / C++
Firenze, Giugno 2003, C. Gheller Data portability: Compiler solutions Some compilers allows to overcome these problems with specific options However this leads to Spend a lot of time in re-configuring compilation on each different system Have a less portable code (the results depending on the compiler)
Firenze, Giugno 2003, C. Gheller Data portability: Compiler solutions For example, Alpha Fortran compiler allows to use Big- Endian data using the -convert big_endian option However this option is not present in any other compiler and, furthermore, data produced with this option are incompatible with the system that wrote them!!!
Firenze, Giugno 2003, C. Gheller Fortran offers a possible solution both for the performance and for the portability problems with the DIRECT ACCESS files. Open(unit=10, file=datafile.bin, form=unformatted, access=direct, recl=N) The result is a binary file with no blocks and no control characters. Any Fortran compiler writes (and can read) it in THE SAME WAY Notice however that the endianism problem is still present… However the file is portable between any platform with the same endianism
Firenze, Giugno 2003, C. Gheller Direct Access Files The keyword recl is the basic quantum of written data. It is usually expressed in bytes (except Alpha which expresses it in words). Example 1 Real*4 x(100) Inquire(IOLENGTH=IOL) x(1) Open(unit=10, file=datafile.bin, access=direct, recl=IOL) Do i=1,100 write(10,rec=i)x(i) Enddo Close (10) Portable but not performing !!! (Notice that, this is precisely the C fread-fwrite I/O)
Firenze, Giugno 2003, C. Gheller Direct Access Files Example 2 Real*4 x(100) Inquire(IOLENGTH=IOL) x Open(unit=10, file=datafile.bin, access=direct, recl=IOL) write(10,rec=1)x Close (10) Portable and Performing !!!
Firenze, Giugno 2003, C. Gheller Direct Access Files Example 3 Real*4 x(100),y(100),z(100) Open(unit=10, file=datafile.bin, access=direct, recl=4*100) write(10,rec=1)x write(10,rec=2)y write(10,rec=3)z Close (10) The same result can be obtained as Real*4 x(100),y(100),z(100) Open(unit=10, file=datafile.bin, access=direct, recl=4*100) write(10,rec=2)y write(10,rec=3)z write(10,rec=1)x Close (10) Order is not important!!!
Firenze, Giugno 2003, C. Gheller Parallel I/OI/O is not a trivial issue in parallel Example Program Scrivi Write(*,*)Hello World End program Scrivi Execute in parallel on 4 processors: Pe 0 Pe 1 Pe 2 Pe 3 $./Scrivi Hello World
Firenze, Giugno 2003, C. Gheller Parallel I/O Goals: Improve the performance Ensure data consistency Avoid communication Usability
Firenze, Giugno 2003, C. Gheller Parallel I/O Solution 1: Master-Slave Only 1 processor performs I/O Pe 1 Pe 2 Pe 3 Pe 0 Data File Goals: Improve the performance: NO Ensure data consistency: YES Avoid communication: NO Usability: YES (but in general not portable)
Firenze, Giugno 2003, C. Gheller Parallel I/O Solution 2: Distributed I/O All the processors read/writes their own files Pe 1 Pe 2 Pe 3 Data File 0 Goals: Improve the performance: YES (but be careful) Ensure data consistency: YES Avoid communication: YES Usability: NO Pe 0 Data File 3 Data File 2 Data File 1 Warning: Do not parametrize with processors!!!
Firenze, Giugno 2003, C. Gheller Parallel I/O Solution 3: Distributed I/O on single file All the processors read/writes on a single ACCESS=DIRECT file Pe 1 Pe 2 Pe 3 Goals: Improve the performance: YES for read, NO for write Ensure data consistency: NO Avoid communication: YES Usability: YES (portable !!!) Pe 0 Data File
Firenze, Giugno 2003, C. Gheller Parallel I/O Solution 4: MPI2 I/O MPI functions performs the I/O. These functions are not standards. Asyncronous I/O is supported Pe 1 Pe 2 Pe 3 Goals: Improve the performance: YES (strongly!!!) Ensure data consistency: NO Avoid communication: YES Usability: YES Pe 0 Data File MPI
Firenze, Giugno 2003, C. Gheller Case Study Data analysis – case 1 How many clusters are there in the image ??? Cluster finding algorithm Input = the image Output = a number
Firenze, Giugno 2003, C. Gheller Case Study Case 1- Parallel implementation Parallel Cluster finding algorithm Input = a fraction of the image Output = a number for each processor All the parallelism is in the setup of the input. Then all processors work independently !!!! Pe 0Pe 1
Firenze, Giugno 2003, C. Gheller Case Study Case 1- Setup of the input Each processor reads its own part of the input file Pe 0Pe 1 ! The image is NxN pixels, using 2 processors Real*4 array(N,N/2) Open (unit=10, file=image.bin,access=direct,recl=4*N*N/2) Startrecord=mype+1 read(10,rec=Startrecord)array Call Sequential_Find_Cluster(array, N_cluster) Write(*,*)mype, found, N_cluster, clusters
Firenze, Giugno 2003, C. Gheller Case Study Case 1- Boundary conditions Boundaries must be treated in a specific way Pe 0Pe 1 ! The image is NxN pixels, using 2 processors Real*4 array(0:N+1,0:N/2+1) ! Set boundaries on the image side array(0,:) = 0.0 array(N+1,:)= 0.0 jside= mod(mype,2)*N/2+mod(mype,2) array(:,jside)=0.0 Open (unit=10, file=image.bin,access=direct,recl=4*N) Do j=1,N/2 record=mype*N/2+j read(10,rec=record)array(:,j) Enddo If(mype.eq.0)then record=N/2+1 read(10,rec=record)array(:,N/2+1) else record=N/2-1 read(10,rec=record)array(:,0) endif Call Sequential_Find_Cluster(array, N_cluster) Write(*,*)mype, found, N_cluster, clusters suggested avoid
Firenze, Giugno 2003, C. Gheller Case Study Data analysis – case 2 From observed data… … …to the sky map
Firenze, Giugno 2003, C. Gheller Case Study Data analysis – case 2 Each map pixel is meausered N times. The final value for each pixel is an average of all the corresponding measurements … … values map pixels id MAP
Firenze, Giugno 2003, C. Gheller Case Study Case 2: parallelization Values and ids are distributed between processors in the data input phase (just like case 1) Calculation is performed independently by each processor Each processor produce its own COMPLETE map (which is small and can be replicated) The final map is the SUM OF ALL THE MAPS calculated by different processors
Firenze, Giugno 2003, C. Gheller Case Study Case 2: parallelization ! N Data, M pixels, Npes processors (M << N) Real*8 value(N/Npes) Real*8 map(M) Integer id(N/Npes) Open(unit=10,file=data.bin,access=direct,recl=4*N/Npes) Open(unit=20,file=ids.bin,access=direct,recl=4*N/Npes) record=mype+1 Read(10,rec=record)value Read(20,rec=record)id Call Sequential_Calculate_Local_Map(value,id,map) Call BARRIER Call Calculate_Final_Map(map) Call Print_Final_Map(map) Define basic arrays Read data in parallel (boundaries are neglected) Calculate local maps Sincronize process Parallel calculation of the final map Print final map
Firenze, Giugno 2003, C. Gheller Case Study Case 2: calculation of the final map Subroutine Calculate_Final_Map(map) Real*8 map(M) Real*8 map_aux(M) Do i=1,npes If(mype.eq.0)then call RECV(map_aux,i-1) map=map+map_aux Else if (mype.eq.i-1)then call SEND(map,0) Endif Call BARRIER enddo return Calculate final map processor by processor However MPI offers a MUCH BETTER solution (we will see it tomorrow)
Firenze, Giugno 2003, C. Gheller Case Study Case 2: print the final map Subroutine Print_Final_Map(map) Real*8 map(M) If(mype.eq.0)then do i=1,m write(*,*)i,map(i) enddo Endif return Only one processor writes the result At this point ONLY processor 0 has the final map and can print it out