1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002.

1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002

2 Participants l Argonne National Laboratory - Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham, Anthony Chan l Northwestern University - Alok Choudhary, Wei-keng Liao, Avery Ching, Kenin Coloma, Jianwei Li l Collaborators - Lawrence Livermore National Laboratory - Ghaleb Abdulla, Tina Eliassi-Rad, Terence Critchlow - Application groups

3 Focus Areas in Project l Parallel I/O on clusters - Parallel Virtual File System (PVFS) l MPI-IO hints - ROMIO MPI-IO implementation l Grid I/O - Linking PVFS and ROMIO with Grid I/O components l Application interfaces - NetCDF and HDF5 l Everything is interconnected! l Wei-keng Liao will drill down into specific tasks

4 Parallel Virtual File System l Lead developer R. Ross (ANL) - R. Latham (ANL), developer - A. Ching, K. Coloma (NWU), collaborators l Open source, scalable parallel file system - Project began in mid 90’s at Clemson University - Now a collaborative between Clemson and ANL l Successes - In use on large Linux clusters (OSC, Utah, Clemson, ANL, Phillips Petroleum, …) - 100+ unique downloads/month - 160+ users on mailing list, 90+ on developers list - Multiple Gigabyte/second performance shown

5 Keeping PVFS Relevant: PVFS2 l Scaling to thousands of clients and hundreds of servers requires some design changes - Distributed metadata - New storage formats - Improved fault tolerance l New technology, new features - High-performance networking (e.g. Infiniband, VIA) - Application metadata l New design and implementation warranted (PVFS2)

6 PVFS1, PVFS2, and SDM l Maintaining PVFS1 as a resource to community - Providing support, bug fixes - Encouraging use by application groups - Adding functionality to improve performance (e.g. tiled display) l Implementing next-generation parallel file system - Basic infrastructure for future PFS work - New physical distributions (e.g. chunking) - Application metadata storage l Ensuring that a working parallel file system will continue to be available on clusters as they scale

7 Data Staging for Tiled Display l Contact: Joe Insley (ANL) - Commodity components - projectors, PCs - Provide very high resolution visualization l Staging application preprocesses “frames” into a tile stream for each “visualization node” - Uses MPI-IO to access data from PVFS file system - Streams of tiles are merged into movie files on visualization nodes - End goal is to display frames directly from PVFS - Enhancing PVFS and ROMIO to improve performance

8 Example Tile Layout l 3x2 display, 6 readers l Frame size is 2532x1408 pixels l Tile size is 1024x768 pixels (overlapped) l Movies broken into frames with each frame stored in its own file in PVFS l Readers pull data from PVFS and send to display

9 Tested access patterns l Subtile - Each reader grabs a piece of a tile - Small noncontiguous accesses - Lots of accesses for a frame l Tile - Each reader grabs a whole tile - Larger noncontiguous accesses - Six accesses for a frame l Reading individual pieces is simply too slow

10 Noncontiguous Access in ROMIO l ROMIO performs “data sieving” to cut down number of I/O operations l Uses large reads which grab multiple noncontiguous pieces l Example, reading tile 1:

11 Noncontiguous Access in PVFS l ROMIO data sieving - Works for all file systems (just uses contiguous read) - Reads extra data (three times desired amount) l Noncontiguous access primitive allows requesting just desired bytes (A. Ching, NWU) l Support in ROMIO allows transparent use of new optimization (K. Coloma, NWU) l PVFS and ROMIO support implemented

12 Metadata in File Systems l Associative arrays of information related to a file l Seen in other file systems (MacOS, BeOS, ReiserFS) l Some potential uses: - Ancillary data (from applications) - Derived values - Thumbnail images - Execution parameters - I/O library metadata - Block layout information - Attributes on variables - Attributes of dataset as a whole - Headers - Keeps header out of data stream - Eliminates need for alignment in libraries

13 Metadata and PVFS2 Status l Prototype metadata storage for PVFS2 implemented - R. Ross (ANL) - Uses Berkeley DB for storage of keyword/value pairs - Need to investigate how to interface to MPI-IO l Other components of PVFS2 coming along - Networking in testing (P. Carns, Clemson) - Client side API under development (Clemson) l PVFS2 beta early fourth quarter?

14 ROMIO MPI-IO Implementation l Written by R. Thakur (ANL) - R. Ross and R. Latham (ANL), developers - K. Coloma (NWU), collaborator l Implementation of MPI-2 I/O specification - Operates on wide variety of platforms - Abstract Device Interface for I/O (ADIO) aids in porting to new file systems l Successes - Adopted by industry (e.g. Compaq, HP, SGI) - Used at ASCI sites (e.g. LANL Blue Mountain)

15 ROMIO Current Directions l Support for PVFS noncontiguous requests - K. Coloma (NWU) l Hints - key to efficient use of HW & SW components - Collective I/O - Aggregation (synergy) - Performance portability - Controlling ROMIO Optimizations - Access patterns - Grid I/O l Scalability - Parallel I/O benchmarking

16 ROMIO Aggregation Hints l Part of ASCI Software Pathforward project - Contact: Gary Grider (LANL) l Implementation by R. Ross, R. Latham (ANL) l Hints control what processes do I/O in collectives l Examples: - All processes on same node as attached storage - One process per host l Additionally limit number of processes who open file - Good for systems w/out shared FS (e.g. O2K clusters) - More scalable

17 Aggregation Example l Cluster of SMPs l Only one SMP box has connection to disks l Data is aggregated to processes on single box l Processes on that box perform I/O on behalf of the others

18 Optimization Hints l MPI-IO calls should be chosen to best describe the I/O taking place - Use of file views - Collective calls for inherently collective operations l Unfortunately sometimes choosing the “right” calls can result on lower performance l Allow application programmers to tune ROMIO with hints rather than using different MPI-IO calls l Avoid the misapplication of optimizations (aggregation, data sieving)

19 Optimization Problems l ROMIO checks for applicability of two-phase optimization when collective I/O is used l With tiled display application using subtile access, this optimization is never used l Checking for applicability requires communication between processes l Results in 33% drop in throughput (on test system) l A hint that tells ROMIO not to apply the optimization can avoid this without changes to the rest of the application

20 Access Pattern Hints l Collaboration between ANL and LLNL (and growing) l Examining how access pattern information can be passed to MPI-IO interface, through to underlying file system l Used as input to optimizations in MPI-IO layer l Used as input to optimizations in FS layer as well - Prefetching - Caching - Writeback

21 Status of Hints l Aggregation control finished l Optimization hints - Collectives, data sieving read finished - Data sieving write control in progress - PVFS noncontiguous I/O control in progress l Access pattern hints - Exchanging log files, formats - Getting up to speed on respective tools

22 Parallel I/O Benchmarking l No common parallel I/O benchmarks l New effort (consortium) to: - Define some terminology - Define test methodology - Collect tests l Goal: provide a meaningful test suite with consistent measurement techniques l Interested parties at numerous sites (and growing) - LLNL, Sandia, UIUC, ANL, UCAR, Clemson l In infancy…

23 Grid I/O l Looking at ways to connect our I/O work with components and APIs used in the Grid - New ways of getting data in and out of PVFS - Using MPI-IO to access data in the Grid - Alternative mechanisms for transporting data across the Grid (synergy) l Working towards more seamless integration of the tools used in the Grid and those used on clusters and in parallel applications (specifically MPI applications) l Facilitate moving between Grid and Cluster worlds

24 Local Access to GridFTP Data l Grid I/O Contact: B. Allcock (ANL) l GridFTP striped server provides high-throughput mechanism for moving data across Grid l Relies on proprietary storage format on striped servers - Must manage metadata on stripe location - Data stored on servers must be read back from servers - No alternative/more direct way to access local data - Next version assumes shared file system underneath

25 GridFTP Striped Servers l Remote applications connect to multiple striped servers to quickly transfer data over Grid l Multiple TCP streams better utilize WAN network l Local processes would need to use same mechanism to get to data on striped servers

26 PVFS under GridFTP l With PVFS underneath, GridFTP servers would store data on PVFS I/O servers l Stripe information stored on PVFS metadata server

27 Local Data Access l Application tasks that are part of a local parallel job could access data directly off PVFS file system l Output from application could be retrieved remotely via GridFTP

28 MPI-IO Access to GridFTP l Applications such as tiled display reader desire remote access to GridFTP data l Access through MPI-IO would allow this with no code changes l ROMIO ADIO interface provides the infrastructure necessary to do this l MPI-IO hints provide means for specifying number of stripes, transfer sizes, etc.

29 WAN File Transfer Mechanism l B. Gropp (ANL), P. Dickens (IIT) l Applications - PPM and COMMAS (Paul Woodward, UMN) l Alternative mechanism for moving data across Grid using UDP l Focuses on requirements for file movement - All data must arrive at destination - Ordering doesn’t matter - Lost blocks can be retransmitted when detected, but need not stop the remainder of the transfer

30 WAN File Transfer Performance l Comparing TCP utilization to WAN FT technique l See 10-12% utilization with single TCP stream (8 streams to approach max. utilization) l With WAN FT obtain near 90% utilization, more uniform performance

31 Grid I/O Status l Planning with Grid I/O group - Matching up components - Identifying useful hints l Globus FTP client library is available l 2nd generation striped server being implemented l XIO interface prototyped - Hooks for alternative local file systems - Obvious match for PVFS under GridFTP

32 NetCDF l Applications in climate and fusion - PCM - John Drake (ORNL) - Weather Research and Forecast Model (WRF) - John Michalakes (NCAR) - Center for Extended Magnetohydrodynamic Modeling - Steve Jardin (PPPL) - Plasma Microturbulence Project - Bill Nevins (LLNL) l Maintained by Unidata Program Center l API and file format for storing multidimensional datasets and associated metadata (in a single file)

33 NetCDF Interface l Strong points: - It’s a standard! - I/O routines allow for subarray and strided access with single calls - Access is clearly split into two modes - Defining the datasets (define mode) - Accessing and/or modifying the datasets (data mode) l Weakness: no parallel writes, limited parallel read capability l This forces applications to ship data to a single node for writing, severely limiting usability in I/O intensive applications

34 Parallel NetCDF l Rich I/O routines and explicit define/data modes provide a good foundation - Existing applications are already describing noncontiguous regions - Modes allow for a synchronization point when file layout changes l Missing: - Semantics for parallel access - Collective routines - Option for using MPI datatypes l Implement in terms of MPI-IO operations l Retain file format for interoperability

35 Parallel NetCDF Status l Design document created - B. Gropp, R. Ross, and R. Thakur (ANL) l Prototype in progress - J. Li (NWU) l Focus is on write functions first - Biggest bottleneck for checkpointing applications l Read functions follow l Investigate alternative file formats in future - Address differences in access modes between writing and reading

36 FLASH Astrophysics Code l Developed at ASCI Center at University of Chicago - Contact: Mike Zingale l Adaptive mesh (AMR) code for simulating astrophysical thermonuclear flashes l Written in Fortran90, uses MPI for communication, HDF5 for checkpointing and visualization data l Scales to thousands of processors, runs for weeks, needs to checkpoint l At the time, I/O was a bottleneck (½ of runtime on 1024 processors)

37 HDF5 Overhead Analysis l Instrumented FLASH I/O to log calls to H5Dwrite MPI_File_write_at H5Dwrite

38 HDF5 Hyperslab Operations l White region is hyperslab “gather” (from memory) l Cyan is “scatter” (to file)

39 Hand-Coded Packing l Packing time is in black regions between bars l Nearly order of magnitude improvement

40 Wrap Up l Progress being made on multiple fronts - ANL/NWU collaboration is strong - Collaborations with other groups maturing l Balance of immediate payoff and medium term infrastructure improvements - Providing expertise to application groups - Adding functionality targeted at specific applications - Building core infrastructure to scale, ensure availability l Synergy with other projects l On to Wei-keng!

1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002.

Similar presentations

Presentation on theme: "1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002.

Similar presentations

Presentation on theme: "1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002."— Presentation transcript:

Similar presentations

About project

Feedback