1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002.

Slides:



Advertisements
Similar presentations
NGAS – The Next Generation Archive System Jens Knudstrup NGAS The Next Generation Archive System.
Advertisements

University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
File Consistency in a Parallel Environment Kenin Coloma
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
SDM Center Coupling Parallel IO with Remote Data Access Ekow Otoo, Arie Shoshani, Doron Rotem, and Alex Sim Lawrence Berkeley National Lab.
Parallel I/O A. Patra MAE 609/CE What is Parallel I/O ? zParallel processes need parallel input/output zIdeal: Processor consuming/producing data.
Data Grids Darshan R. Kapadia Gregor von Laszewski
GridFTP: File Transfer Protocol in Grid Computing Networks
I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
Grid IO APIs William Gropp Mathematics and Computer Science Division.
 The Open Systems Interconnection model (OSI model) is a product of the Open Systems Interconnection effort at the International Organization for Standardization.
Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,
Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
1 Outline l Performance Issues in I/O interface design l MPI Solutions to I/O performance issues l The ROMIO MPI-IO implementation.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 Arie Shoshani, LBNL SDM center Scientific Data Management Center(SDM-ISIC) Arie Shoshani Computing Sciences Directorate Lawrence Berkeley National Laboratory.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
The european ITM Task Force data structure F. Imbeaux.
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad.
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
Communicating Security Assertions over the GridFTP Control Channel Rajkumar Kettimuthu 1,2, Liu Wantao 3,4, Frank Siebenlist 1,2 and Ian Foster 1,2,3 1.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Plethora: A Wide-Area Read-Write Storage Repository Design Goals, Objectives, and Applications Suresh Jagannathan, Christoph Hoffmann, Ananth Grama Computer.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division.
May 2003National Coastal Data Development Center Brief Introduction Two components Data Exchange Infrastructure (DEI) Spatial Data Model (SDM) Together,
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Welcome to the PVFS BOF! Rob Ross, Rob Latham, Neill Miller Argonne National Laboratory Walt Ligon, Phil Carns Clemson University.
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
Distributed Components for Integrating Large- Scale High Performance Computing Applications Nanbor Wang, Roopa Pundaleeka and Johan Carlsson
Active Storage Processing in Parallel File Systems Jarek Nieplocha Evan Felix Juan Piernas-Canovas SDM CENTER.
Connections to Other Packages The Cactus Team Albert Einstein Institute
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
The Globus eXtensible Input/Output System (XIO): A protocol independent IO system for the Grid Bill Allcock, John Bresnahan, Raj Kettimuthu and Joe Link.
Supercomputing 2006 Scientific Data Management Center Lead Institution: LBNL; PI: Arie Shoshani Laboratories: ANL, ORNL, LBNL, LLNL, PNNL Universities:
Parallel NetCDF Rob Latham Mathematics and Computer Science Division Argonne National Laboratory
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
SDM Center Parallel I/O Storage Efficient Access Team.
AFS/OSD Project R.Belloni, L.Giammarino, A.Maslennikov, G.Palumbo, H.Reuter, R.Toebbicke.
SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,
Protocols and Services for Distributed Data- Intensive Science Bill Allcock, ANL ACAT Conference 19 Oct 2000 Fermi National Accelerator Laboratory Contributors:
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
CLIENT SERVER COMPUTING. We have 2 types of n/w architectures – client server and peer to peer. In P2P, each system has equal capabilities and responsibilities.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Scenario use cases Szymon Mueller PSNC. Agenda 1.General description of experiment use case. 2.Detailed description of use cases: 1.Preparation for observation.
University of Chicago Department of Energy Applications In Hand:  FLASH (HDF-5)  ENZO (MPI-IO)  STAR Likely  Climate – Bill G to contact (Michalakas.
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
University of Technology
Hadoop Technopoints.
Database System Architectures
Presentation transcript:

1 Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002

2 Participants l Argonne National Laboratory - Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham, Anthony Chan l Northwestern University - Alok Choudhary, Wei-keng Liao, Avery Ching, Kenin Coloma, Jianwei Li l Collaborators - Lawrence Livermore National Laboratory - Ghaleb Abdulla, Tina Eliassi-Rad, Terence Critchlow - Application groups

3 Focus Areas in Project l Parallel I/O on clusters - Parallel Virtual File System (PVFS) l MPI-IO hints - ROMIO MPI-IO implementation l Grid I/O - Linking PVFS and ROMIO with Grid I/O components l Application interfaces - NetCDF and HDF5 l Everything is interconnected! l Wei-keng Liao will drill down into specific tasks

4 Parallel Virtual File System l Lead developer R. Ross (ANL) - R. Latham (ANL), developer - A. Ching, K. Coloma (NWU), collaborators l Open source, scalable parallel file system - Project began in mid 90’s at Clemson University - Now a collaborative between Clemson and ANL l Successes - In use on large Linux clusters (OSC, Utah, Clemson, ANL, Phillips Petroleum, …) unique downloads/month users on mailing list, 90+ on developers list - Multiple Gigabyte/second performance shown

5 Keeping PVFS Relevant: PVFS2 l Scaling to thousands of clients and hundreds of servers requires some design changes - Distributed metadata - New storage formats - Improved fault tolerance l New technology, new features - High-performance networking (e.g. Infiniband, VIA) - Application metadata l New design and implementation warranted (PVFS2)

6 PVFS1, PVFS2, and SDM l Maintaining PVFS1 as a resource to community - Providing support, bug fixes - Encouraging use by application groups - Adding functionality to improve performance (e.g. tiled display) l Implementing next-generation parallel file system - Basic infrastructure for future PFS work - New physical distributions (e.g. chunking) - Application metadata storage l Ensuring that a working parallel file system will continue to be available on clusters as they scale

7 Data Staging for Tiled Display l Contact: Joe Insley (ANL) - Commodity components - projectors, PCs - Provide very high resolution visualization l Staging application preprocesses “frames” into a tile stream for each “visualization node” - Uses MPI-IO to access data from PVFS file system - Streams of tiles are merged into movie files on visualization nodes - End goal is to display frames directly from PVFS - Enhancing PVFS and ROMIO to improve performance

8 Example Tile Layout l 3x2 display, 6 readers l Frame size is 2532x1408 pixels l Tile size is 1024x768 pixels (overlapped) l Movies broken into frames with each frame stored in its own file in PVFS l Readers pull data from PVFS and send to display

9 Tested access patterns l Subtile - Each reader grabs a piece of a tile - Small noncontiguous accesses - Lots of accesses for a frame l Tile - Each reader grabs a whole tile - Larger noncontiguous accesses - Six accesses for a frame l Reading individual pieces is simply too slow

10 Noncontiguous Access in ROMIO l ROMIO performs “data sieving” to cut down number of I/O operations l Uses large reads which grab multiple noncontiguous pieces l Example, reading tile 1:

11 Noncontiguous Access in PVFS l ROMIO data sieving - Works for all file systems (just uses contiguous read) - Reads extra data (three times desired amount) l Noncontiguous access primitive allows requesting just desired bytes (A. Ching, NWU) l Support in ROMIO allows transparent use of new optimization (K. Coloma, NWU) l PVFS and ROMIO support implemented

12 Metadata in File Systems l Associative arrays of information related to a file l Seen in other file systems (MacOS, BeOS, ReiserFS) l Some potential uses: - Ancillary data (from applications) - Derived values - Thumbnail images - Execution parameters - I/O library metadata - Block layout information - Attributes on variables - Attributes of dataset as a whole - Headers - Keeps header out of data stream - Eliminates need for alignment in libraries

13 Metadata and PVFS2 Status l Prototype metadata storage for PVFS2 implemented - R. Ross (ANL) - Uses Berkeley DB for storage of keyword/value pairs - Need to investigate how to interface to MPI-IO l Other components of PVFS2 coming along - Networking in testing (P. Carns, Clemson) - Client side API under development (Clemson) l PVFS2 beta early fourth quarter?

14 ROMIO MPI-IO Implementation l Written by R. Thakur (ANL) - R. Ross and R. Latham (ANL), developers - K. Coloma (NWU), collaborator l Implementation of MPI-2 I/O specification - Operates on wide variety of platforms - Abstract Device Interface for I/O (ADIO) aids in porting to new file systems l Successes - Adopted by industry (e.g. Compaq, HP, SGI) - Used at ASCI sites (e.g. LANL Blue Mountain)

15 ROMIO Current Directions l Support for PVFS noncontiguous requests - K. Coloma (NWU) l Hints - key to efficient use of HW & SW components - Collective I/O - Aggregation (synergy) - Performance portability - Controlling ROMIO Optimizations - Access patterns - Grid I/O l Scalability - Parallel I/O benchmarking

16 ROMIO Aggregation Hints l Part of ASCI Software Pathforward project - Contact: Gary Grider (LANL) l Implementation by R. Ross, R. Latham (ANL) l Hints control what processes do I/O in collectives l Examples: - All processes on same node as attached storage - One process per host l Additionally limit number of processes who open file - Good for systems w/out shared FS (e.g. O2K clusters) - More scalable

17 Aggregation Example l Cluster of SMPs l Only one SMP box has connection to disks l Data is aggregated to processes on single box l Processes on that box perform I/O on behalf of the others

18 Optimization Hints l MPI-IO calls should be chosen to best describe the I/O taking place - Use of file views - Collective calls for inherently collective operations l Unfortunately sometimes choosing the “right” calls can result on lower performance l Allow application programmers to tune ROMIO with hints rather than using different MPI-IO calls l Avoid the misapplication of optimizations (aggregation, data sieving)

19 Optimization Problems l ROMIO checks for applicability of two-phase optimization when collective I/O is used l With tiled display application using subtile access, this optimization is never used l Checking for applicability requires communication between processes l Results in 33% drop in throughput (on test system) l A hint that tells ROMIO not to apply the optimization can avoid this without changes to the rest of the application

20 Access Pattern Hints l Collaboration between ANL and LLNL (and growing) l Examining how access pattern information can be passed to MPI-IO interface, through to underlying file system l Used as input to optimizations in MPI-IO layer l Used as input to optimizations in FS layer as well - Prefetching - Caching - Writeback

21 Status of Hints l Aggregation control finished l Optimization hints - Collectives, data sieving read finished - Data sieving write control in progress - PVFS noncontiguous I/O control in progress l Access pattern hints - Exchanging log files, formats - Getting up to speed on respective tools

22 Parallel I/O Benchmarking l No common parallel I/O benchmarks l New effort (consortium) to: - Define some terminology - Define test methodology - Collect tests l Goal: provide a meaningful test suite with consistent measurement techniques l Interested parties at numerous sites (and growing) - LLNL, Sandia, UIUC, ANL, UCAR, Clemson l In infancy…

23 Grid I/O l Looking at ways to connect our I/O work with components and APIs used in the Grid - New ways of getting data in and out of PVFS - Using MPI-IO to access data in the Grid - Alternative mechanisms for transporting data across the Grid (synergy) l Working towards more seamless integration of the tools used in the Grid and those used on clusters and in parallel applications (specifically MPI applications) l Facilitate moving between Grid and Cluster worlds

24 Local Access to GridFTP Data l Grid I/O Contact: B. Allcock (ANL) l GridFTP striped server provides high-throughput mechanism for moving data across Grid l Relies on proprietary storage format on striped servers - Must manage metadata on stripe location - Data stored on servers must be read back from servers - No alternative/more direct way to access local data - Next version assumes shared file system underneath

25 GridFTP Striped Servers l Remote applications connect to multiple striped servers to quickly transfer data over Grid l Multiple TCP streams better utilize WAN network l Local processes would need to use same mechanism to get to data on striped servers

26 PVFS under GridFTP l With PVFS underneath, GridFTP servers would store data on PVFS I/O servers l Stripe information stored on PVFS metadata server

27 Local Data Access l Application tasks that are part of a local parallel job could access data directly off PVFS file system l Output from application could be retrieved remotely via GridFTP

28 MPI-IO Access to GridFTP l Applications such as tiled display reader desire remote access to GridFTP data l Access through MPI-IO would allow this with no code changes l ROMIO ADIO interface provides the infrastructure necessary to do this l MPI-IO hints provide means for specifying number of stripes, transfer sizes, etc.

29 WAN File Transfer Mechanism l B. Gropp (ANL), P. Dickens (IIT) l Applications - PPM and COMMAS (Paul Woodward, UMN) l Alternative mechanism for moving data across Grid using UDP l Focuses on requirements for file movement - All data must arrive at destination - Ordering doesn’t matter - Lost blocks can be retransmitted when detected, but need not stop the remainder of the transfer

30 WAN File Transfer Performance l Comparing TCP utilization to WAN FT technique l See 10-12% utilization with single TCP stream (8 streams to approach max. utilization) l With WAN FT obtain near 90% utilization, more uniform performance

31 Grid I/O Status l Planning with Grid I/O group - Matching up components - Identifying useful hints l Globus FTP client library is available l 2nd generation striped server being implemented l XIO interface prototyped - Hooks for alternative local file systems - Obvious match for PVFS under GridFTP

32 NetCDF l Applications in climate and fusion - PCM - John Drake (ORNL) - Weather Research and Forecast Model (WRF) - John Michalakes (NCAR) - Center for Extended Magnetohydrodynamic Modeling - Steve Jardin (PPPL) - Plasma Microturbulence Project - Bill Nevins (LLNL) l Maintained by Unidata Program Center l API and file format for storing multidimensional datasets and associated metadata (in a single file)

33 NetCDF Interface l Strong points: - It’s a standard! - I/O routines allow for subarray and strided access with single calls - Access is clearly split into two modes - Defining the datasets (define mode) - Accessing and/or modifying the datasets (data mode) l Weakness: no parallel writes, limited parallel read capability l This forces applications to ship data to a single node for writing, severely limiting usability in I/O intensive applications

34 Parallel NetCDF l Rich I/O routines and explicit define/data modes provide a good foundation - Existing applications are already describing noncontiguous regions - Modes allow for a synchronization point when file layout changes l Missing: - Semantics for parallel access - Collective routines - Option for using MPI datatypes l Implement in terms of MPI-IO operations l Retain file format for interoperability

35 Parallel NetCDF Status l Design document created - B. Gropp, R. Ross, and R. Thakur (ANL) l Prototype in progress - J. Li (NWU) l Focus is on write functions first - Biggest bottleneck for checkpointing applications l Read functions follow l Investigate alternative file formats in future - Address differences in access modes between writing and reading

36 FLASH Astrophysics Code l Developed at ASCI Center at University of Chicago - Contact: Mike Zingale l Adaptive mesh (AMR) code for simulating astrophysical thermonuclear flashes l Written in Fortran90, uses MPI for communication, HDF5 for checkpointing and visualization data l Scales to thousands of processors, runs for weeks, needs to checkpoint l At the time, I/O was a bottleneck (½ of runtime on 1024 processors)

37 HDF5 Overhead Analysis l Instrumented FLASH I/O to log calls to H5Dwrite MPI_File_write_at H5Dwrite

38 HDF5 Hyperslab Operations l White region is hyperslab “gather” (from memory) l Cyan is “scatter” (to file)

39 Hand-Coded Packing l Packing time is in black regions between bars l Nearly order of magnitude improvement

40 Wrap Up l Progress being made on multiple fronts - ANL/NWU collaboration is strong - Collaborations with other groups maturing l Balance of immediate payoff and medium term infrastructure improvements - Providing expertise to application groups - Adding functionality targeted at specific applications - Building core infrastructure to scale, ensure availability l Synergy with other projects l On to Wei-keng!