Parallel NetCDF Rob Latham Mathematics and Computer Science Division Argonne National Laboratory

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

MPI Message Passing Interface Portable Parallel Programs.
MPI Message Passing Interface
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
Reference: / MPI Program Structure.
Reference: Getting Started with MPI.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
I/O Optimization for ENZO Cosmology Simulation Using MPI-IO Jianwei Li12/06/2001.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
1 ICS103 Programming in C Lecture 2: Introduction to C (1)
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Grid IO APIs William Gropp Mathematics and Computer Science Division.
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
SDM Center February 2, 2005 Progress on MPI-IO Access to Mass Storage System Using a Storage Resource Manager Ekow J. Otoo, Arie Shoshani and Alex Sim.
Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill.
PIO: The Parallel I/O Library The 13 th Annual CCSM Workshop, June 19, 2008 Raymond Loy Leadership Computing Facility / Mathematics and Computer Science.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
1 HDF5 Life cycle of data Boeing September 19, 2006.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.
Parallel Programming with MPI By, Santosh K Jena..
1 File Handling. 2 Storage seen so far All variables stored in memory Problem: the contents of memory are wiped out when the computer is powered off Example:
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Slides created by: Professor Ian G. Harris Hello World #include main() { printf(“Hello, world.\n”); }  #include is a compiler directive to include (concatenate)
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
NetCDF and Scientific Data Durability Russ Rew, UCAR Unidata ESIP Federation Summer Meeting
Intro to Parallel HDF5 10/17/151ICALEPCS /17/152 Outline Overview of Parallel HDF5 design Parallel Environment Requirements Performance Analysis.
April 28, 2008LCI Tutorial1 Parallel HDF5 Tutorial Tutorial Part IV.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
SDM Center Parallel I/O Storage Efficient Access Team.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
Message Passing Interface Using resources from
C Programming Day 2. 2 Copyright © 2005, Infosys Technologies Ltd ER/CORP/CRS/LA07/003 Version No. 1.0 Union –mechanism to create user defined data types.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
- 1 - Overview of Parallel HDF Overview of Parallel HDF5 and Performance Tuning in HDF5 Library NCSA/University of Illinois at Urbana- Champaign.
Chapter 7 Text Input/Output Objectives
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
MPI Message Passing Interface
MPI Message Passing Interface
Presentation transcript:

Parallel NetCDF Rob Latham Mathematics and Computer Science Division Argonne National Laboratory

2 I/O for Computational Science Application require more software than just a parallel file system Break up support into multiple layers with distinct roles: –Parallel file system maintains logical space, provides efficient access to data (e.g. PVFS, GPFS, Lustre) –Middleware layer deals with organizing access by many processes (e.g. MPI-IO, UPC-IO) –High level I/O library maps app. abstractions to a structured, portable file format (e.g. HDF5, Parallel netCDF) High-level I/O Library I/O Middleware (MPI-IO) Parallel File System I/O Hardware Application Parallel File System I/O Hardware

3 High Level Libraries Match storage abstraction to domain –Multidimensional datasets –Typed variables –Attributes Provide self-describing, structured files Map to middleware interface –Encourage collective I/O Implement optimizations that middleware cannot, such as –Caching attributes of variables –Chunking of datasets High-level I/O Library I/O Middleware (MPI-IO) Parallel File System I/O Hardware Application

4 Higher Level I/O Interfaces Provide structure to files –Well-defined, portable formats –Self-describing –Organization of data in file –Interfaces for discovering contents Present APIs more appropriate for computational science –Typed data –Noncontiguous regions in memory and file –Multidimensional arrays and I/O on subsets of these arrays Both of our example interfaces are implemented on top of MPI-IO

5 PnetCDF Interface and File Format

6 Parallel netCDF (PnetCDF) Based on original “Network Common Data Format” (netCDF) work from Unidata –Derived from their source code –Argonne, Northwestern, and community Data Model: –Collection of variables in single file –Typed, multidimensional array variables –Attributes on file and variables Features: –C and Fortran interfaces –Portable data format (identical to netCDF) –Noncontiguous I/O in memory using MPI datatypes –Noncontiguous I/O in file using sub-arrays –Collective I/O Unrelated to netCDF-4 work (more later)

7 netCDF/PnetCDF Files PnetCDF files consist of three regions –Header –Non-record variables (all dimensions specified) –Record variables (ones with an unlimited dimension) Record variables are interleaved, so using more than one in a file is likely to result in poor performance due to noncontiguous accesses Data is always written in a big-endian format

8 Storing Data in PnetCDF Create a dataset (file) –Puts dataset in define mode –Allows us to describe the contents Define dimensions for variables Define variables using dimensions Store attributes if desired (for variable or dataset) Switch from define mode to data mode to write variables Store variable data Close the dataset

9 Simple PnetCDF Examples Simplest possible PnetCDF version of “Hello World” First program creates a dataset with a single attribute Second program reads the attribute and prints it Shows very basic API use and error checking

10 Simple PnetCDF: Writing (1) #include int main(int argc, char **argv) { int ncfile, ret, count; char buf[13] = "Hello World\n"; MPI_Init(&argc, &argv); ret = ncmpi_create(MPI_COMM_WORLD, "myfile.nc", NC_CLOBBER, MPI_INFO_NULL, &ncfile); if (ret != NC_NOERR) return 1; /* continues on next slide */ Integers used for references to datasets, variables, etc.

11 Simple PnetCDF: Writing (2) ret = ncmpi_put_att_text(ncfile, NC_GLOBAL, "string", 13, buf); if (ret != NC_NOERR) return 1; ncmpi_enddef(ncfile); /* entered data mode – but nothing to do */ ncmpi_close(ncfile); MPI_Finalize(); return 0; } Storing value while in define mode as an attribute

12 Retrieving Data in PnetCDF Open a dataset in read-only mode ( NC_NOWRITE ) Obtain identifiers for dimensions Obtain identifiers for variables Read variable data Close the dataset

13 Simple PnetCDF: Reading (1) #include int main(int argc, char **argv) { int ncfile, ret, count; char buf[13]; MPI_Init(&argc, &argv); ret = ncmpi_open(MPI_COMM_WORLD, "myfile.nc", NC_NOWRITE, MPI_INFO_NULL, &ncfile); if (ret != NC_NOERR) return 1; /* continues on next slide */

14 Simple PnetCDF: Reading (2) /* verify attribute exists and is expected size */ ret = ncmpi_inq_attlen(ncfile, NC_GLOBAL, "string", &count); if (ret != NC_NOERR || count != 13) return 1; /* retrieve stored attribute */ ret = ncmpi_get_att_text(ncfile, NC_GLOBAL, "string", buf); if (ret != NC_NOERR) return 1; printf("%s", buf); ncmpi_close(ncfile); MPI_Finalize(); return 0; }

15 Compiling and Running ;mpicc pnetcdf-hello-write.c -I /usr/local/pnetcdf/include/ -L /usr/local/pnetcdf/lib -lpnetcdf -o pnetcdf-hello-write ;mpicc pnetcdf-hello-read.c -I /usr/local/pnetcdf/include/ -L /usr/local/pnetcdf/lib -lpnetcdf -o pnetcdf-hello-read ;mpiexec -n 1 pnetcdf-hello-write ;mpiexec -n 1 pnetcdf-hello-read Hello World ;ls -l myfile.nc -rw-r--r-- 1 rross rross 68 Mar 26 10:00 myfile.nc ;strings myfile.nc string Hello World File size is 68 bytes; extra data (the header) in file.

16 Example: FLASH Astrophysics FLASH is an astrophysics code for studying events such as supernovae –Adaptive-mesh hydrodynamics –Scales to 1000s of processors –MPI for communication Frequently checkpoints: –Large blocks of typed variables from all processes –Portable format –Canonical ordering (different than in memory) –Skipping ghost cells Ghost cell Stored element … Vars 0, 1, 2, 3, … 23

17 Example: FLASH with PnetCDF FLASH AMR structures do not map directly to netCDF multidimensional arrays Must create mapping of the in-memory FLASH data structures into a representation in netCDF multidimensional arrays Chose to –Place all checkpoint data in a single file –Impose a linear ordering on the AMR blocks Use 1D variables –Store each FLASH variable in its own netCDF variable Skip ghost cells –Record attributes describing run time, total blocks, etc.

18 Defining Dimensions int status, ncid, dim_tot_blks, dim_nxb, dim_nyb, dim_nzb; MPI_Info hints; /* create dataset (file) */ status = ncmpi_create(MPI_COMM_WORLD, filename, NC_CLOBBER, hints, &file_id); /* define dimensions */ status = ncmpi_def_dim(ncid, "dim_tot_blks", tot_blks, &dim_tot_blks); status = ncmpi_def_dim(ncid, "dim_nxb", nzones_block[0], &dim_nxb); status = ncmpi_def_dim(ncid, "dim_nyb", nzones_block[1], &dim_nyb); status = ncmpi_def_dim(ncid, "dim_nzb", nzones_block[2], &dim_nzb); Each dimension gets a unique reference

19 Creating Variables int dims = 4, dimids[4]; int varids[NVARS]; /* define variables (X changes most quickly) */ dimids[0] = dim_tot_blks; dimids[1] = dim_nzb; dimids[2] = dim_nyb; dimids[3] = dim_nxb; for (i=0; i < NVARS; i++) { status = ncmpi_def_var(ncid, unk_label[i], NC_DOUBLE, dims, dimids, &varids[i]); } Same dimensions used for all variables

20 Storing Attributes /* store attributes of checkpoint */ status = ncmpi_put_att_text(ncid, NC_GLOBAL, "file_creation_time", string_size, file_creation_time); status = ncmpi_put_att_int(ncid, NC_GLOBAL, "total_blocks", NC_INT, 1, tot_blks); status = ncmpi_enddef(file_id); /* now in data mode … */

21 Writing Variables double *unknowns; /* unknowns[blk][nzb][nyb][nxb] */ size_t start_4d[4], count_4d[4]; start_4d[0] = global_offset; /* different for each process */ start_4d[1] = start_4d[2] = start_4d[3] = 0; count_4d[0] = local_blocks; count_4d[1] = nzb; count_4d[2] = nyb; count_4d[3] = nxb; for (i=0; i < NVARS; i++) { /*... build datatype “mpi_type” describing values of a single variable... */ /* collectively write out all values of a single variable */ ncmpi_put_vara_all(ncid, varids[i], start_4d, count_4d, unknowns, 1, mpi_type); } status = ncmpi_close(file_id); Typical MPI buffer- count-type tuple

22 Inside PnetCDF Define Mode In define mode (collective) –Use MPI_File_open to create file at create time –Set hints as appropriate (more later) –Locally cache header information in memory All changes are made to local copies at each process At ncmpi_enddef –Process 0 writes header with MPI_File_write_at –MPI_Bcast result to others –Everyone has header data in memory, understands placement of all variables No need for any additional header I/O during data mode!

23 Inside PnetCDF Data Mode Inside ncmpi_put_vara_all (once per variable) –Each process performs data conversion into internal buffer –Uses MPI_File_set_view to define file region Contiguous region for each process in FLASH case –MPI_File_write_all collectively writes data At ncmpi_close –MPI_File_close ensures data is written to storage MPI-IO performs optimizations –Two-phase possibly applied when writing variables

24 Tuning PnetCDF: Hints Uses MPI_Info, so identical to straight MPI-IO hin For example, turning off two-phase writes, in case you’re doing large contiguous collective I/O on Lustre: MPI_Info info; MPI_File fh; MPI_Info_create(&info); MPI_Info_set(info, ”romio_cb_write", “disable”); ncmpi_open(comm, filename, NC_NOWRITE, info, &ncfile); MPI_Info_free(&info);

25 Wrapping Up: PnetCDF gives us –Simple, portable, self-describing container for data –Collective I/O –Data structures closely mapping to the variables described Easy – though not automatic – transition from serial NetCDF Datasets Interchangeable with serial NetCDF If PnetCDF meets application needs, it is likely to give good performance –Type conversion to portable format does add overhead Complimentary, not predatory –Research –Friendly, healthy competition

26 References PnetCDF –Mailing list, SVN netCDF ROMIO MPI-IO Shameless plug: Parallel-I/O tutorial at SC2007

27 Acknowledgements This work is supported in part by U.S. Department of Energy Grant DE- FC02-01ER25506, by National Science Foundation Grants EIA , CCR , and CCR , and by the U.S. Department of Energy under Contract W ENG-38. This work was performed under the auspices of the U.S. Department of Energy by University of California, Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.