High Performance Computing Course Notes 2007-2008 Parallel I/O.

Slides:



Advertisements
Similar presentations
InterCell foundations with ParXXL Render Large Scale Computations Interactive Jens Gustedt INRIA Nancy – Grand Est AlGorille Stéphane Vialle Supélec Metz.
Advertisements

A Proposal of Capacity and Performance Assured Storage in The PRAGMA Grid Testbed Yusuke Tanimura 1) Hidetaka Koie 1,2) Tomohiro Kudoh 1) Isao Kojima 1)
Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.
1 RAID Overview n Computing speeds double every 3 years n Disk speeds cant keep up n Data needs higher MTBF than any component in system n IO.
1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.
Devising Secure Sockets Layer-Based Distributed Systems: A Performance-Aware Approach Norman Lim, Shikharesh Majumdar,Vineet Srivastava, Dept. of Systems.
Non-Blocking Collective MPI I/O Routines Ticket #273.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
FY 2004 Allocations Francesca Verdier NERSC User Services NERSC User Group Meeting 05/29/03.
Efficient I/O on the Cray XT Jeff Larkin With Help Of: Gene Wagenbreth.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.
1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
High Performance Cooperative Data Distribution [J. Rick Ramstetter, Stephen Jenks] [A scalable, parallel file distribution model conceptually based on.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Unit Assessment: Advanced Higher Investigative Biology.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Grid IO APIs William Gropp Mathematics and Computer Science Division.
Parallel Programming with Java
Google File System Simulator Pratima Kolan Vinod Ramachandran.
1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,
computer
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Guiding Principles. Goals First we must agree on the goals. Several (non-exclusive) choices – Want every CS major to be educated in performance including.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.
Computer Science 340 Software Design & Testing Software Architecture.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Measuring Performance Based on slides by Henri Casanova.
Data Structures Dr. Abd El-Aziz Ahmed Assistant Professor Institute of Statistical Studies and Research, Cairo University Springer 2015 DS.
Computer Vision COURSE OBJECTIVES: To introduce the student to computer vision algorithms, methods and concepts. EXPECTED OUTCOME: Get introduced to computer.
Advocacy? What it is and how to make it work for you
RAID Overview.
Bus Systems ISA PCI AGP.
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
UPC Parallel I/O Library
Parallel I/O Optimizations
Virtual laboratories in cloud infrastructure of educational institutions Evgeniy Pluzhnik, Evgeniy Nikulchev, Moscow Technological Institute
Disks and RAID.
MPI: Portable Parallel Programming for Scientific Computing
Design Process 6 steps (Non- linear )
Lecture 17: Distributed Transactions
Intro to Architecture & Organization
בארגונים במוסדות ובחברה
Chapter 5: Computer Systems Organization
Machine Learning Course.
Deep Neural Networks for Onboard Intelligence
CSC3050 – Computer Architecture
PVFS: A Parallel File System for Linux Clusters
External Sorting.
Alexey Lastovetsky, Maureen O’Flynn
Course Outline for Computer Architecture
Data science course in Bangalore.
Գագաթի որոնում (Peak Finder)
Robotics Engineering Science vs Engineering What’s the difference?
Parallel I/O for Distributed Applications (MPI-Conn-IO)
Networking What are the basic concepts of networking? Three classes
The Design and Implementation of a Log-Structured File System
Presentation transcript:

High Performance Computing Course Notes Parallel I/O

2 Computer Science, University of Warwick Aims  To learn how to achieve higher I/O performance  To use a concrete implementation (MPI-IO):  Some concepts, including: etypes, displacement and views  Collective vs. non-collective I/O  Contiguous vs. non-contiguous I/O High Performance Parallel I/O

3 Computer Science, University of Warwick Why are we looking at parallel I/O?  I/O is a major bottleneck in many parallel applications  I/O subsystems for parallel machines may be designed for high performance, however many applications achieve < 10 th of the peak I/O bandwidth  Parallel-I/O systems designed for large data transfer (MB of data)  However, many parallel applications make many smaller I/O requests (<kB)

4 Computer Science, University of Warwick Parallel I/O – version 1.0 Phase 1: All processes send data to proc d0d0 d1d1 d2d2 d3d3 Early solutions: All processes send data to process 0, which then writes to file

5 Computer Science, University of Warwick Parallel I/O – version 1.0 Phase 1: All processes send data to proc. 0 Phase 2: Proc. 0 writes to file d0d0 d1d1 d2d2 d3d3 d0d0 d1d1 d2d2 d3d3 File Early solutions: All processes send data to process 0, which then writes to file

6 Computer Science, University of Warwick Bad things about version Single node bottleneck 2. Poor performance 3. Poor scalability 4. Single point of failure Good things about version The parallel machine needs only support I/O from one process 2. Do not need specialized I/O library 3. If you are converting from sequential code then this parallel version (of program) is close to original 4. Results in a single file which is easy to manage Parallel I/O – version 1.0

7 Computer Science, University of Warwick Parallel I/O – version 2.0 All processes can now write in one phase d0d0 d1d1 d2d2 d3d3 Each process writes to a separate file d0d0 d1d1 d2d2 d3d3 File 1 File 2File 3File 4

8 Computer Science, University of Warwick Parallel I/O – version 2.0 All processes can now write in one phase d0d0 d1d1 d2d2 d3d3 Each process writes to a separate file d0d0 d1d1 d2d2 d3d3 File 1 File 2File 3File 4 Good things about version Now we are doing things in parallel 2.High performance

9 Computer Science, University of Warwick Parallel I/O – version 2.0 All processes can now write in one phase d0d0 d1d1 d2d2 d3d3 Each process writes to a separate file d0d0 d1d1 d2d2 d3d3 File 1 File 2File 3File 4 Bad things about version We now have lots of small files to manage 2.How do we read the data back when #procs changes? 3.Does not interoperate well with other applications

10 Computer Science, University of Warwick Parallel I/O – version 3.0 All processes can now write in one Phase, to one common file d0d0 d1d1 d2d2 d3d3 Multiple processes of parallel program access (read/write) data from a common file d0d0 d1d1 d2d2 d3d3 File

11 Computer Science, University of Warwick Parallel I/O – version 3.0 Good things about version 3.0  Simultaneous I/O from any number of processes  Maps well onto collective operations  Excellent performance and scalability  Results in a single file which is easy to manage and interoperates well with other applications Bad things about version 3.0  Requires more complex I/O library support

12 Computer Science, University of Warwick What is Parallel I/O? Multiple processes of a parallel program accessing data (reading or writing) from a common file FILE P0P1P2 P(n-1)

13 Computer Science, University of Warwick Non-parallel I/0  Simple  Poor performance – if a single process is writing to one file  Hard to interoperate with other applications – if writing to more than one file Parallel I/O  Provides high performance  Provides a single file with which it is easy to interoperate with other tools (e.g. visualization systems)  If you design it right then can use existing features of parallel libraries such as collectives and derived datatypes Why Parallel I/O?

14 Computer Science, University of Warwick  We are going to be looking at parallel I/O in the context of MPI, why?  Because writing is like sending a message, reading is like receiving  Because collective-like operations are important in parallel I/O  Because non-contiguous data layout is important (if we are using a single file), supported by MPI datatypes  Parallel I/O is now integral part of MPI-2 Why Parallel I/O?

15 Computer Science, University of Warwick Parallel I/O example  Consider an example of a 2D array distributed among 16 processors  Array stored in row-major order P0P0 P1P1 P4P4 P5P5 P8P8 P2P2 P3P3 P6P6 P7P7 P 12 P9P9 P 10 P 11 P 13 P 14 P 15 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P4P4 P5P5 P6P6 P7P7 etc… Array Corresponding file

16 Computer Science, University of Warwick Access pattern 1: MPI_File_seek Updates the individual file pointer int MPI_File_seek( MPI_File mpi_fh, MPI_Offset offset, int whence ); Parameters mpi_fh : [in] file handle (handle) offset : [in] file offset (integer) whence : [in] update mode (state) MPI_FILE_SEEK updates the individual file pointer according to whence, which has the following possible values: MPI_SEEK_SET: the pointer is set to offset MPI_SEEK_CUR: the pointer is set to the current pointer position plus offset MPI_SEEK_END: the pointer is set to the end of file plus offset

17 Computer Science, University of Warwick Access pattern 1: MPI_File_read Read using individual file pointer int MPI_File_read( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status ); Parameters mpi_fh: [in] file handle (handle) buf: [out] initial address of buffer count: [in] number of elements in buffer (nonnegative integer) datatype: [in] datatype of each buffer element (handle) status: [out] status object (Status)

18 Computer Science, University of Warwick Access pattern 1  We could do a UNIX-style access pattern in MPI-IO  One independent read request is done for each row in the local array Many independent, contiguous requests MPI_File_open(…, “filename”, …, &fh) for(i=0; i < n_local_rows; i++) { MPI_File_seek (fh, offset, …) MPI_File_read (fh, row[i], …) } MPI_File_close (&fh)  Individual file pointers per process per file handle  Each process sets the file pointer with some suitable offset  The data is then read into the local array  This is not a collective operation (non-blocking)

19 Computer Science, University of Warwick Access pattern 2: Access pattern 2: MPI_File_read_all Collective read using individual file pointer int MPI_File_read_all( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status ); Parameters fh : [in] file handle (handle) buf : [out] initial address of buffer (choice) count : [in] number of elements in buffer (nonnegative integer) datatype : [in] datatype of each buffer element (handle) status : [out] status object (Status) MPI_FILE_READ_ALL is a collective version of the blocking MPI_FILE_READ interface.

20 Computer Science, University of Warwick Access pattern 2  Similar to access pattern 1 but using collectives  All processes that opened file will read data together (with own access information) Many collective, contiguous requests MPI_File_open(…, “filename”, …, &fh) for(i=0; i < n_local_rows; i++) { MPI_File_seek (fh, offset, …) MPI_File_read_all (fh, row[i], …) } MPI_File_close (&fh)  read_all is a collective version of the read operation  This is blocking  Each process accesses the file at the same time  This may be useful as independent I/O operations do not convey what other procs are doing at the same time

21 Computer Science, University of Warwick  File  Ordered collection of typed data items  MPI supports random or sequential access  Opened collectively by a group of processes  All collective I/O calls on file are done over this group  Displacement  Absolute byte position relative to the beginning of a file  Defines the location where a view begins  etype (elementary datatype)  Unit of data access and positioning  Can be a predefined or derived datatype  Offsets are expressed as multiples of etypes Access pattern 3: Definitions

22 Computer Science, University of Warwick  Filetype  Basis for partitioning the file among processes and defines a template for accessing the file (based on etype)  View  Current set of data visible and accessible from an open file (as an ordered set of etypes)  Each process has its own view based on - a displacement, etype and filetype  Pattern defined by filetype is repeated (in units of etypes) beginning at the displacement Access pattern 3: Definitions

23 Computer Science, University of Warwick Access pattern 3: File Views Specified by a triplet (displacement, etype, and filetype) passed to MPI_File_set_view displacement = number of bytes to be skipped from the start of the file etype = basic unit of data access (can be any basic or derived datatype) filetype = specifies which portion of the file is visible to the process

24 Computer Science, University of Warwick Access pattern 3: A Simple Noncontiguous File View Example etype = MPI_INT filetype = two MPI_INTs followed by a gap of four MPI_INTs displacement filetype and so on... FILE head of file

25 Computer Science, University of Warwick Access pattern 3: How do views relate to multiple processes? proc. 0 filetype file… displacement proc. 1 filetype proc. 2 filetype Group of processes using complementary views to achieve global data distribution Partitioning a file among parallel processes

26 Computer Science, University of Warwick MPI_File_set_view Describes that part of the file accessed by a single MPI process. int MPI_File_set_view( MPI_File mpi_fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info ); Parameters mpi_fh :[in] file handle (handle) disp :[in] displacement (nonnegative integer) etype :[in] elementary datatype (handle) filetype :[in] filetype (handle) datarep :[in] data representation (string) info :[in] info object (handle)

27 Computer Science, University of Warwick Access pattern 3: File View Example MPI_File thefile; for (i=0; i<BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; MPI_File_open(MPI_COMM_WORLD, "testfile", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile); MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int), MPI_INT, MPI_INT, "native", MPI_INFO_NULL); MPI_File_write(thefile, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&thefile);

28 Computer Science, University of Warwick MPI_Type_create_subarray Create a datatype for a subarray of a regular, multidimensional array int MPI_Type_create_subarray( int ndims, int array_of_sizes[], int array_of_subsizes[], int array_of_starts[], int order, MPI_Datatype oldtype, MPI_Datatype *newtype ); Parameters ndims :[in] number of array dimensions (positive integer) array_of_sizes :[in] number of elements of type oldtype in each dimension of the full array (array of positive integers) array_of_subsizes :[in] number of elements of type oldtype in each dimension of the subarray (array of positive integers) array_of_starts :[in] starting coordinates of the subarray in each dimension (array of nonnegative integers) order :[in] array storage order flag (state) oldtype :[in] array element datatype (handle) newtype :[out] new datatype (handle)

29 Computer Science, University of Warwick Using the Subarray Datatype gsizes[0] = 16; /* no. of rows in global array */ gsizes[1] = 16; /* no. of columns in global array*/ psizes[0] = 4; /* no. of procs. in vertical dimension */ psizes[1] = 4; /* no. of procs. in horizontal dimension */ lsizes[0] = 16/psizes[0]; /* no. of rows in local array */ lsizes[1] = 16/psizes[1]; /* no. of columns in local array*/ dims[0] = 4; dims[1] = 4; periods[0] = periods[1] = 1; MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm); MPI_Comm_rank(comm, &rank); MPI_Cart_coords(comm, rank, 2, coords);

30 Computer Science, University of Warwick Subarray Datatype contd. Subarray Datatype contd. /* global indices of first element of local array */ start_indices[0] = coords[0] * lsizes[0]; start_indices[1] = coords[1] * lsizes[1]; MPI_Type_create_subarray(2, gsizes, lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT, &filetype); MPI_Type_commit(&filetype);

31 Computer Science, University of Warwick Access pattern 3  Each process creates a derived datatype to describe the non-contiguous access pattern  We thus have a file view and independent access Single independent, non-contiguous request MPI_Type_create_subarray (…, &subarray, …) MPI_Type_commit (&subarray) MPI_File_open(…, “filename”, …, &fh) MPI_File_set_view (fh, …, subarray, …) MPI_File_read (fh, local_array, …) MPI_File_close (&fh)  Creates a datatype describing a subarray of a multi-dimentional array  Commits the datatype (must be done before comms)  System may compile at commit time an internal representation for the datatype

32 Computer Science, University of Warwick Access pattern 3  Each process creates a derived datatype to describe the non-contiguous access pattern  We thus have a file view and independent access Single independent, non-contiguous request MPI_Type_create_subarray (…, &subarray, …) MPI_Type_commit (&subarray) MPI_File_open(…, “filename”, …, &fh) MPI_File_set_view (fh, …, subarray, …) MPI_File_read (fh, local_array, …) MPI_File_close (&fh)  Opens the file as before  Now changes the processes view of the data in the file using set_view  set_view is collective  Although the reads are still independent

33 Computer Science, University of Warwick  Note here that we are reading the whole sub-array despite the non-contiguous storage P0P0 P1P1 P4P4 P5P5 P8P8 P2P2 P3P3 P6P6 P7P7 P 12 P9P9 P 10 P 11 P 13 P 14 P 15 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P4P4 P5P5 P6P6 P7P7 etc… Processes {4,5,6,7}, {8,9,10,11}, {12,13,14,15} will have file views based on the same filetypes but with different displacements Access pattern 3 proc. 0 filetype proc. 1 filetype proc. 2 filetype proc. 3 filetype

34 Computer Science, University of Warwick Access pattern 4  Each process creates a derived datatype to describe the non-contiguous access pattern  We thus have a file view and collective access Single collective, non-contiguous request MPI_Type_create_subarray (…, &subarray, …) MPI_Type_commit (&subarray) MPI_File_open(…, “filename”, …, &fh) MPI_File_set_view (fh, …, subarray, …) MPI_File_read_all (fh, local_array, …) MPI_File_close (&fh)  Creates and commits datatype as before  Now changes the processes view of the data in the file using set_view  set_view is collective  Reads are now collective

35 Computer Science, University of Warwick  These access patterns express four different style of parallel I/O that are available  You should choose your access pattern depending on the application  Larger the size of the I/O request, the better performance  Collectives are going to do better than individual reads  Pattern 4 therefore offers (potentially) the best performance Access patterns

36 Computer Science, University of Warwick I/O optimization: Data Sieving Data sieving is used to combine lots of small accesses into a single larger one  Remote file systems (parallel or not) tend to have high latencies  Reducing the number of operations important

37 Computer Science, University of Warwick I/O optimization: Data Sieving Writes Using data sieving for writes is more complicated  Must read the entire region first  Then make our changes  Then write the block back Requires locking in the file system  Can result in false sharing

38 Computer Science, University of Warwick I/O optimization: Two-Phase Collective I/O Problems with independent, noncontiguous access  Lots of small accesses  Independent data sieving reads lots of extra data Idea: Reorganize access to match layout on disks  Single processes use data sieving to get data for many  Often reduces total I/O through sharing of common blocks Second ``phase'' moves data to final destinations

39 Computer Science, University of Warwick I/O optimization: Collective I/O Collective I/O is coordinated access to storage by a group of processes  Collective I/O functions must be called by all processes participating in I/O  Allows I/O layers to know more about access as a whole