PIO: The Parallel I/O Library The 13 th Annual CCSM Workshop, June 19, 2008 Raymond Loy Leadership Computing Facility / Mathematics and Computer Science.

Slides:



Advertisements
Similar presentations
Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
Advertisements

Weather Research & Forecasting: A General Overview
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
SDM Center Coupling Parallel IO with Remote Data Access Ekow Otoo, Arie Shoshani, Doron Rotem, and Alex Sim Lawrence Berkeley National Lab.
RAMADDA for Big Climate Data Don Murray NOAA/ESRL/PSD and CU-CIRES Boulder/Denver Big Data Meetup - June 18, 2014.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
The Triumph of Hope over Experience * ? Bill Gropp *Samuel Johnson.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Application of Fortran 90 to ocean model codes Mark Hadfield National Institute of Water and Atmospheric Research New Zealand.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
1 ICS103 Programming in C Lecture 2: Introduction to C (1)
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
Grid IO APIs William Gropp Mathematics and Computer Science Division.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Computer Organization
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
An Introduction to Software Architecture
2 Chip-Multiprocessors & You John Dennis March 16, 2007 John Dennis March 16, 2007.
Instruction Set Architecture
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
CESM/RACM/RASM Update May 15, Since Nov, 2011 ccsm4_0_racm28:racm29:racm30 – vic parallelization – vic netcdf files – vic coupling mods and “273.15”
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cthru Technical Brief Gary Morris Center of Higher Learning Stennis Space Center.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
Using HDF5 in WRF Part of MEAD - an alliance expedition.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.
ESMF Code Generation Rocky Dunlap Spencer Rugaber Leo Mark Georgia Tech College of Computing.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
CARDIAC ELECTROPHYSIOLOGY WEB LAB Developing your own protocol descriptions.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K.
1 HDF5 Life cycle of data Boeing September 19, 2006.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Earth System Modeling Framework Python Interface (ESMP) October 2011 Ryan O’Kuinghttons Robert Oehmke Cecelia DeLuca.
CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado,
August 2001 Parallelizing ROMS for Distributed Memory Machines using the Scalable Modeling System (SMS) Dan Schaffer NOAA Forecast Systems Laboratory (FSL)
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
Sending large message counts (The MPI_Count issue)
On the Road to a Sequential CCSM Robert Jacob, Argonne National Laboratory Including work by: Mariana Vertenstein (NCAR), Ray Loy (ANL), Tony Craig (NCAR)
Intro to Parallel HDF5 10/17/151ICALEPCS /17/152 Outline Overview of Parallel HDF5 design Parallel Environment Requirements Performance Analysis.
Parallel NetCDF Rob Latham Mathematics and Computer Science Division Argonne National Laboratory
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
SDM Center Parallel I/O Storage Efficient Access Team.
ECE 456 Computer Architecture Lecture #9 – Input/Output Instructor: Dr. Honggang Wang Fall 2013.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Chapter 7 Text Input/Output Objectives
NetCDF 3.6: What’s New Russ Rew
Chapter 7 Text Input/Output Objectives
Operation System Program 4
An Introduction to Software Architecture
Presentation transcript:

PIO: The Parallel I/O Library The 13 th Annual CCSM Workshop, June 19, 2008 Raymond Loy Leadership Computing Facility / Mathematics and Computer Science Division Argonne National Laboratory With John Dennis, National Center for Atmospheric Research Jim Edwards, National Center for Atmospheric Research Robert Jacob, Argonne National Laboratory

Trends in Climate Model Resolution High resolution configuration: 1/10th degree ocean/ice with 0.5 degree atmosphere. –Ocean: 3600 x 2400 x 42 –Sea ice: 3600 x 2400 x 20 –Atmosphere: 576 x 384 x 26 –Land: 576 x 384 x 17 Compared to CCSM3: –Ocean: 73x larger –Atmosphere: 7x larger

Trends in Climate Model Resolution History output sizes for high-resolution configuration for one write of a single monthly average –Atmosphere: 0.8 GB –Ocean: 24 GB (reduced; 100GB if full) –Sea Ice: 4 GB –Land: 0.3 GB Restart output: –Atmosphere: 0.9 GB –Ocean: 29 GB (96 GB with extra tracers) –Sea Ice: 5 GB –Land: 0.2 GB –Coupler: 6.5 GB

Trends in High Performance Computing systems Moore’s Law is still increasing transistor count but now they are grouped in to multiple cores. Memory/core is nearly constant. Power/cooling constraints promote design for maximum flops/watt –BlueGene: low power nodes; low memory BG/L node: PowerPC, 0.7GHz; 512MB (256MB/core) BG/P node: PowerPC, 0.85 GHz; 2GB (512MB/core) –SciCortex node: 6 MIPS64 cores, 0.5 GHz; (600 mW each!)

Ye Olde Gather/Scatter with Serial Read/Write As old as the first parallel program Still state-of-the-art Example: gather and write

Solution: Parallel I/O! Parallel I/O beings with hardware and low-level software forming a parallel file system –Many disks look like one big disk. –Related: old parallel I/O method of each processor writing its own file to local disk. Postprocessing needed to complete output. –Examples: PVFS, Lustre, GPFS. Figure and following general parallel I/O overview provided by Rob Latham (Argonne)

MPI-IO The Message Passing Interface (MPI) is an interface standard for writing message passing programs –Most popular programming model on HPC systems MPI-IO is an I/O interface specification for use in MPI apps Data model is same as POSIX –Stream of bytes in a file Features: –Collective I/O –Noncontiguous I/O with MPI datatypes and file views –Nonblocking I/O –Fortran bindings (and additional languages) Implementations available on most platforms I/O presentation from Rob Latham (Argonne National Lab)

NetCDF: Standard file format used in climate modeling Data Model: –Collection of variables in single file –Typed, multidimensional array variables –Attributes on file and variables Features: –C and Fortran interfaces –Portable data format Data is always written in a big-endian format NetCDF files consist of three regions –Header –Non-record variables (all dimensions specified) –Record variables (ones with an unlimited dimension) I/O presentation from Rob Latham (Argonne National Lab)

Parallel NetCDF: NetCDF output with MPI-IO Based on NetCDF –Derived from their source code –API slightly modified –Final output is indistinguishable from serial NetCDF file Additional Features –Noncontiguous I/O in memory using MPI datatypes –Noncontiguous I/O in file using sub-arrays –Collective I/O Unrelated to netCDF-4 work I/O presentation from Rob Latham (Argonne National Lab)

Goals for Parallel I/O in CCSM Provide parallel I/O for all component models Encapsulate complexity into library Simple interface for component developers to implement Extensible for future I/O technology

Goals for Parallel I/O in CCSM Backward compatible (node=0) Support for multiple formats –{sequential,direct} binary –netcdf Preserve format of input/output files Supports 1D, 2D and 3D arrays

Climate model decompositions can be complex Ocean decomposition with space-filling curve

PIO Terms and Concepts: I/O decomp vs. physical model decomp –I/O decomp == model decomp MPI-IO+ message aggregation –I/O decomp != model decomp Need Rearranger: MCT, custom No component-specific info in library –Pair with existing communication tech –1-D arrays input to library; component must flatten 2-D and 3-D arrays

PIO Data Rearrangement Goal: redistribute data from computational layout of the model (“compdof”) to a subset of processors designated for I/O (“iodof”). –Provides direct control of number of procs reading/writing to maximize performance on a platform –This level of control not possible with pnetcdf API, also more portable than MPI-IO hints –I/O decomposition matched to actual read/write Initial method: MCT –Pro: MCT Rearranger is general, allows arbitrary pattern –Con: Setup is expensive (all-to-all matching); description of the decompositions can be large due to poor compression of small runs of indices Improved method: Box Rearranger –Netcdf/Pnetcdf reads/writes naturally operate on rectangular “box” subsets of output array variables 14

Data rearrangement

Box Rearranger

PIO Box Rearranger Mapping defined by extents of box for each I/O node –Extremely compact representation easily distributed –Reverse mapping computed at runtime Supports features needed for e.g. ocean vs. land –“holes” in computational decomposition –fill values for I/O dofs not covered Design evolved driven by performance of CAM integration –Initial design conserved space by creating send/receive types on-the- fly. MPI too slow. –Important to performance to cache MPI types and compute reverse mapping up-front during Rearranger creation 17

PIO API subroutine PIO_init(comp_rank, comp_comm, num_iotasks, num_aggregator, stride, Rearranger, IOsystem, base) integer(i4), intent(in) :: comp_rank ! (MPI rank) integer(i4), intent(in) :: comp_comm ! (MPI communicator) integer(i4), intent(in) :: num_iotasks integer(i4), intent(in) :: num_aggregator integer(i4), intent(in) :: stride integer, intent(in) :: Rearranger !defined in pio_types + PIO_rearr_none ! pio does no data rearrangment, data is assumed to be in it's final form when passed to pio + PIO_rearr_mct ! pio uses mct to rearrange the data from the computational layout to the io layout. + PIO_rearr_box ! pio uses an internal rearranger to rearrange the data from the computational layout to the io layout. type (IOsystem_desc_t), intent(out) :: IOsystem ! Output IOsystem stores the context

PIO API subroutine PIO_initDecomp(Iosystem,baseTYPE,dims,compDOF,IOdesc) type(IOSystem_desc_t), intent(in) :: IOsystem integer(i4), intent(in) :: baseTYPE ! type of array {int,real4,real8} integer(i4), intent(in) :: dims(:) ! global dimensions of array integer (i4), intent(in) :: compDOF(:) ! Global degrees of freedom for comp decomposition type (IO_desc_t), pointer, intent(out) :: IOdesc Automatically computes start(:) and cnt(:) to define the I/O mapping

PIO API subroutine PIO_initDecomp(Iosystem,baseTYPE,dims,lenBLOCKS,compDOF, ioDOFR,ioDOFW,start,cnt,IOdesc) type(IOSystem_desc_t), intent(in) :: IOsystem integer(i4), intent(in) :: baseTYPE ! type of array {int,real4,real8} integer(i4), intent(in) :: dims(:) ! global dimensions of array integer (i4), intent(in) :: lenBLOCKS integer (i4), intent(in) :: compDOF(:) ! Global degrees of freedom for comp decomposition integer (i4), intent(in) :: ioDofR(:) ! Global degrees of freedom for I/O decomp (Read op) integer (i4), intent(in) :: ioDofW(:) ! Global degrees of freedom for IO decomp (Write op) integer (PIO_OFFSET), intent(in) :: start(:), cnt(:) ! pNetCDF domain decomosition information type (IO_desc_t), pointer, intent(out) :: IOdesc start(:) and cnt(:) define the I/O mapping

PIO API subroutine PIO_write_darray(data_file, varDesc, IOdesc, array, iostat, fillval) type (File_desc_t), intent(inout) :: data_file ! file information (netcdf or binary) type (IOsystem_desc_t), intent(inout) :: iosystem ! io subsystem information type (var_desc_t), intent(inout) :: varDesc ! variable descriptor type (io_desc_t), intent(inout) :: iodesc ! io descriptor defined in initdecomp intent(in) :: array ! array to be written (currently integer, real*4 and real8 types are supported, 1 dimension integer, intent(out) :: iostat ! error return code intent(in), optional :: fillvalue ! same type as array, a fillvalue for pio to use in the case of missing data Cached I/O mapping and structures reusable for multiple writes/reads (via IOdesc)

PIO API subroutine PIO_read_darray(data_file, varDesc, iodesc, array, iostat) type (File_desc_t), intent(inout) :: data_file ! info about data file type (var_desc_t), intent(inout) :: varDesc ! variable descriptor type (io_desc_t), intent(inout) :: iosystem intent(in) :: array ! array to be read currently integer, real*4 and real8 types are supported, 1 dimension) integer, intent(out) :: iostat ! error return code No fillval needed in this direction (holes not modified)

PIO in CAM 23

PIO Success Stories PIO implementation in CCSM –Atmosphere: read and write history and restart; all dycores –Ocean: read and write history and restart –Land: write history –Sea Ice and Coupler: in progress PIO being used in high-resolution coupled model. Backwards-compatible NetCDF mode has value-added –Rearrangement to IO proc subset followed by gather/write one piece at a time. –Avoids overflowing memory of root processor

PIO success stories High resolution atmosphere model test cases with the HOMME dynamical core. Reading input data not possible without PIO! Figure provided by Mark Taylor, Sandia National Lab

26 CAM-HOMME on BG/P CAM-HOMME with full atmospheric physics and aquaplanet surface. Figure provided by Mark Taylor, Sandia National Lab Reading input data using PIO

PIO Deployment Developed configure system for portability across all CCSM platforms and sites. – Supports a large set of options (Enable/disable MCT,Parallel NetCDF,NetCDF,MPI-IO,serial compatibility,MPI-2,diagnostic modes,...) In current use on –Argonne BG/L, Intrepid (BG/P), Jazz (Intel,Linux) –Blueice (Power5+,AIX), Bangkok (Intel,Linux) –Jaguar (Opteron,XT4) –Sandia cluser (Intel+Infiniband) PIO currently developed within CCSM repository –Transitioning development to Google Code 27

Future work Clean up documentation More unit tests/ system tests Understanding performance across zoo of parallel I/O hardware/software Add to rest of CCSM You will soon be able to download, use and help develop PIO! –