1 Outline l Performance Issues in I/O interface design l MPI Solutions to I/O performance issues l The ROMIO MPI-IO implementation.

Slides:



Advertisements
Similar presentations
Non-Blocking Collective MPI I/O Routines Ticket #273.
Advertisements

File Consistency in a Parallel Environment Kenin Coloma
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
OPERATING SYSTEMS Introduction
1 I/O Management in Representative Operating Systems.
1 Course Outline Processes & Threads CPU Scheduling Synchronization & Deadlock Memory Management File Systems & I/O Networks, Protection and Security.
1 Memory Management in Representative Operating Systems.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Operating System Organization
Grid IO APIs William Gropp Mathematics and Computer Science Division.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
Disk and I/O Management
1 MPI-2: Extending the Message- Passing Interface Rusty Lusk Argonne National Laboratory.
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
1 Input/Output. 2 Principles of I/O Hardware Some typical device, network, and data base rates.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
1 MPI Datatypes l The data in a message to sent or received is described by a triple (address, count, datatype), where l An MPI datatype is recursively.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
Distributed File Systems
I/O Systems I/O Hardware Application I/O Interface
Threads, Thread management & Resource Management.
RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad.
1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
Lecture 8 Page 1 CS 111 Online Other Important Synchronization Primitives Semaphores Mutexes Monitors.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.
Department of Computer Science and Software Engineering
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
I/O Software CS 537 – Introduction to Operating Systems.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Disk Cache Main memory buffer contains most recently accessed disk sectors Cache is organized by blocks, block size = sector’s A hash table is used to.
Jonathan Walpole Computer Science Portland State University
Multiple Platters.
Parallel I/O Optimizations
Distributed Shared Memory
Chapter 9 – Real Memory Organization and Management
Operating System I/O System Monday, August 11, 2008.
Performance Evaluation of Adaptive MPI
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CS703 - Advanced Operating Systems
CSE451 Virtual Memory Paging Autumn 2002
Presentation transcript:

1 Outline l Performance Issues in I/O interface design l MPI Solutions to I/O performance issues l The ROMIO MPI-IO implementation

2 Semantics of I/O l Basic operations have requirements that are often not understood and can impact performance l Physical and logical operations may be quite different

3 Read and Write l Read and Write are atomic l No assumption on the number of processes (or their relationship to each other) that have a file open for reading and writing l Process 1 Process 2 read a … … write b read b l Reading a large block containing both a and b (Caching data) and using that data to perform the second read without going back to the original file is incorrect l This requirement of read/write results in overspecification of interface in many applications codes (application does not require strong synchronization of read/write).

4 Open l User’s model is that this gets a file descriptor and (perhaps) initializes local buffering l Problem: no Unix (or POSIX) interface for “exclusive access open”. l One possible solution: »Make open keep track of how many processes have file open »A second open succeeds only after the process that did the first open has changed caching approach »Possible problems include a non-responsive (or dead) first process and inability to work with parallel applications

5 Close l User’s model is that this flushes the last data written to disk (if they think about that) and relinquishes the file descriptor l When is data written out to disk? »On close? »Never? l Example: »Unused physical memory pages used as disk cache. »Combined with Uninterruptible Power Supply, may never appear on disk

6 Seek l User’s model is that this assigns the given location to a variable and takes about 0.01 microseconds l Changes position in file for “next” read l May interact with implementation to cause data to flush data to disk (clear all caches) »Very expensive, particularly when multiple processes are seeking into the same file

7 Read/Fread l Users expect read (unbuffered) to be faster than fread (buffered) (rule: buffering is bad, particularly when done by the user) »Reverse true for short data (often by several orders of magnitude) »User thinks reason is “System calls are expensive” »Real culprit is atomic nature of read l Note Fortran 77 requires unique open (Section , lines 44-45).

8 Tuning Parameters l I/O systems typically have a large range of tuning parameters l MPI-2 File hints include »MPI_MODE_UNIQUE_OPEN »File info –access style –collective buffering (and size, block size, nodes) –chunked (item, size) –striping –likely number of nodes (processors) –implementation-specific methods such as caching policy

9 I/O Application Characterization l Data from Dan Reed’s Pablo project l Instrument both logical (API) and physical (OS code) interfaces to I/O system l Look at existing parallel applications

10 I/O Experiences (Prelude) l Application developers »do not know detailed application I/O patterns »do not understand file system behavior l File system designers »do not know how systems are used »do not know how systems perform

11 Input/Output Lessons l Access pattern categories »initialization »checkpointing »out-of-core »real-time »streaming l Within these categories »wide temporal and spatial variation »small requests are very common –but I/O often optimized for large requests…

12 Input/Output Lessons l Recurring themes »access pattern variability »extreme performance sensitivity »users avoid non-portable I/O interfaces l File system implications »wide variety of access patterns »unlikely that a single policy will suffice »standard parallel I/O APIs needed

13 Input/Output Lessons l Variability »request sizes »interaccess times »parallelism »access patterns »file multiplicity »file modes

14 Asking the Right Question l Do you want Unix or Fortran I/O? »Even with a significant performance penalty? l Do you want to change your program? »Even to another portable version with faster performance? »Not even for a factor of 40??? l User “requirements” can be misleading

15 Effect of user I/O choices (I/O model) l MPI-IO example using collective I/O »Addresses some synchronization issues l Parameter tuning significant

16 Importance of Correct User Model l Collective vs. Independent I/O model »Either will solve user’s functional problem l Same operation (in terms of bytes moved to/from user’s application), but slightly different program and assumptions »Different assumptions lead to very different performance

17 Why MPI is a Good Setting for Parallel I/O l Writing is like sending and reading is like receiving. l Any parallel I/O system will need: »collective operations »user-defined datatypes to describe both memory and file layout »communicators to separate application-level message passing from I/O-related message passing »non-blocking operations l Any parallel I/O system would like »method for describing application access pattern »implementation-specific parameters l I.e., lots of MPI-like machinery

18 Introduction to I/O in MPI l I/O in MPI can be considered as Unix I/O plus (lots of) other stuff. Basic operations: MPI_File_{open, close, read, write, seek} l Parameters to these operations (nearly) match Unix, aiding straightforward port from Unix I/O to MPI I/O. l However, to get performance and portability, more advanced features must be used.

19 MPI I/O Features l Noncontiguous access in both memory and file l Use of explicit offset (faster seek) l Individual and shared file pointers l Nonblocking I/O l Collective I/O l Performance optimizations such as preallocation l File interoperability l Portable data representation Mechanism for providing hints applicable to a particular implementation and I/O environment (e.g. number of disks, striping factor): info

20 “Two-Phase” I/O l Trade computation and communication for I/O. l The interface describes the overall pattern at an abstract level. l I/O blocks are written in large blocks to amortize effect of high I/O latency. l Message-passing (or other data interchange) among compute nodes is used to redistribute data as needed.

21 Noncontiguous Access l In memory: In file: displacementfile type proc 0 proc 1 proc 2 proc 3... Processor memories Parallel file

22 Discontiguity l Noncontiguous data in both memory and file is specified using MPI datatypes, both predefined and derived. l Data layout in memory specified on each call, as in message-passing. l Data layout in file is defined by a file view. l A process can access data only within its view. l View can be changed; views can overlap.

23 Basic Data Access Individual file pointer: MPI_File_read Explicit file offset: MPI_File_read_at Shared file pointer: MPI_File_read_shared Nonblocking I/O: MPI_File_iread l Similarly for writes

24 Collective I/O in MPI l A critical optimization in parallel I/O l Allows communication of “big picture” to file system l Framework for 2-phase I/O, in which communication precedes I/O (can use MPI machinery) l Basic idea: build large blocks, so that reads/writes in I/O system will be large Small individual requests Large collective access

25 MPI Collective I/O Operations Blocking: MPI_File_read_all( fh, buf, count, datatype, status ) Non-blocking: MPI_File_read_all_begin( fh, buf, count, datatype ) MPI_File_read_all_end( fh, buf, status )

26 ROMIO - a Portable Implementation of MPI I/O l Rajeev Thakur, Argonne l Implementation strategy: an abstract device for I/O (ADIO) l Tested for low overhead l Can use any MPI implementation (MPICH, vendor) PIOFS ADIO MPIPFS PIOFSPFSUNIX HP HFS SGI XFS ADIO network

27 Current Status of ROMIO l ROMIO released on Oct.1, 1997 l Beta version of released Feb, 1998 l A substantial portion of the standard has been implemented: »collective I/O »noncontiguous accesses in memory and file »asynchronous I/O l Support large files---greater than 2 Gbytes l Works with MPICH and vendor MPI implementations

28 ROMIO Users l Around 175 copies downloaded so far l All three ASCI labs. have installed and rigorously tested ROMIO and are now encouraging their users to use it l A number of users at various universities and labs. around the world l A group in Portugal ported ROMIO to Windows 95 and NT

29 Interaction with Vendors l HP/Convex is incorporating ROMIO into the next release of its MPI product l SGI has provided hooks for ROMIO to work with its MPI l DEC and IBM have downloaded the software for review l NEC plans to use ROMIO as a starting point for its own MPI-IO implementation l Pallas started with an early version of ROMIO for its MPI-IO implementation for Fujitsu

30 Hints used in ROMIO MPI-IO Implementation l cb_buffer_size l cb_nodes l stripping_unit l stripping_factor l ind_rd_buffer_size l ind_wr_buffer_size l start_iodevice l pfs_svr_buf MPI-2 predefined hints New Algorithm Parameters Platform-specific hints

31 Performance l Astrophysics application template from U. of Chicago: read/write a three-dimensional matrix l Caltech Paragon: 512 compute nodes, 64 I/O nodes, PFS l ANL SP 80 compute nodes, 4 I/O servers, PIOFS l Measure independent I/O, collective I/O, independent with data sieving

32 Benefits of Collective I/O l 512 x 512 x 512 matrix on 48 nodes of SP 512 x 512 x 1024 matrix on 256 nodes of Paragon

33 Independent Writes l On Paragon l Lots of seeks and small writes l Time shown = 130 seconds

34 Collective Write l On Paragon l Communication and communication precede seek and write l Time shown = 2.75 seconds

35 Independent Writes with “Data Sieving” l On Paragon l Use large blocks, write multiple “real” blocks plus “gaps” l Requires lock, read, modify, write, unlock for writes l Paragon has file locking at block level l 4 MB blocks l Time = 16 seconds

36 Changing the Block Size l Smaller blocks mean less contention, therefore more parallelism l 512 KB blocks l Time = 10.2 seconds l Still 4 times the collective time

37 Data Sieving with Small Blocks l If the block size is too small, however, then the increased parallelism doesn’t make up for the many small writes l 64 KB blocks l Time = 21.5 seconds

38 Conclusions l OS level I/O operations overly restrictive for many HPC applications »You want those restrictions for I/O from your editor or word processor »Failure of NFS to implement these rules a continuing source of trouble l Physical and logical (application) performance different l Application “kernels” often unrepresentative of actual operations »Use independent I/O when collective is intended l Vendors can compete on the quality of their MPI IO implementation