HDF5 A new file format & software for high performance scientific data management.

Slides:



Advertisements
Similar presentations
The HDF Group Parallel HDF5 Developments 1 Copyright © 2010 The HDF Group. All Rights Reserved Quincey Koziol The HDF Group
Advertisements

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.
The Future of NetCDF Russ Rew UCAR Unidata Program Center Acknowledgments: John Caron, Ed Hartnett, NASA’s Earth Science Technology Office, National Science.
Making earth science data more accessible: experience with chunking and compression Russ Rew January rd Annual AMS Meeting Austin, Texas.
I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
HDF4 and HDF5 Performance Preliminary Results Elena Pourmal IV HDF-EOS Workshop September
University of Illinois at Urbana-ChampaignHDF Mike Folk HDF-EOS Workshop IV Sept , 2000 HDF Update HDF.
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
HDF 1 HDF5 Advanced Topics Object’s Properties Storage Methods and Filters Datatypes HDF and HDF-EOS Workshop VIII October 26, 2004.
The HDF Group April 17-19, 2012HDF/HDF-EOS Workshop XV1 Introduction to HDF5 Barbara Jones The HDF Group The 15 th HDF and HDF-EOS Workshop.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
NPP/ NPOESS Product Data Format Richard E. Ullman NASA/GSFC/NPP NOAA/NESDIS/IPOAlgorithm / System EngineeringData / Information Architecture
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 Overview of HDF5 HDF Summit Boeing Seattle The HDF Group (THG) September 19, 2006.
Big Applications: Simulations, Models, Visualization, … Scientific data management for big computers and big data HDF5 (serial.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
1 Introduction to HDF5 Data Model, Programming Model and Library APIs HDF and HDF-EOS Workshop VIII October 26, 2004.
April 28, 2008LCI Tutorial1 HDF5 Tutorial LCI April 28, 2008.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
N P O E S S I N T E G R A T E D P R O G R A M O F F I C E NPP/ NPOESS Product Data Format Richard E. Ullman NOAA/NESDIS/IPO NASA/GSFC/NPP Algorithm Division.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
October 15, 2008HDF and HDF-EOS Workshop XII1 What will be new in HDF5?
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad.
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
EXPRESS/HDF5 Mapping Specification Version 0.5 Walkthrough David Price October 2006.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
1 HDF5 Life cycle of data Boeing September 19, 2006.
A High performance I/O Module: the HDF5 WRF I/O module Muqun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University.
HDF 1 Introduction to HDF5 NCSA/University of Illinois at Urbana-Champaign May 2000
NetCDF Data Model Issues Russ Rew, UCAR Unidata NetCDF 2010 Workshop
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
- 1 - HDF5, HDF-EOS and Geospatial Data Archives HDF and HDF-EOS Workshop VII September 24, 2003.
HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1 Introduction to HDF5 Command-line Tools.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
NetCDF-4: Software Implementing an Enhanced Data Model for the Geosciences Russ Rew, Ed Hartnett, and John Caron UCAR Unidata Program, Boulder
NetCDF and Scientific Data Durability Russ Rew, UCAR Unidata ESIP Federation Summer Meeting
FITSIO, HDF4, NetCDF, PDB and HDF5 Performance Some Benchmarks Results Elena Pourmal Science Data Processing Workshop February 27, 2002.
Parallel NetCDF Rob Latham Mathematics and Computer Science Division Argonne National Laboratory
April 28, 2008LCI Tutorial1 Parallel HDF5 Tutorial Tutorial Part IV.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
SDM Center Parallel I/O Storage Efficient Access Team.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Unidata Infrastructure for Data Services Russ Rew GO-ESSP Workshop, LLNL
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Other Projects Relevant (and Not So Relevant) to the SODA Ideal: NetCDF, HDF, OLE/COM/DCOM, OpenDoc, Zope Sheila Denn INLS April 16, 2001.
- 1 - Overview of Parallel HDF Overview of Parallel HDF5 and Performance Tuning in HDF5 Library NCSA/University of Illinois at Urbana- Champaign.
The HDF Group Introduction to HDF5 Session Three HDF5 Software Overview 1 Copyright © 2010 The HDF Group. All Rights Reserved.
1 January 11-13, 2010ESRF Workshop – Introduction to HDF5 Introduction to HDF5 Francesc Alted Consultant and PyTables creator.
HDF Experiences with I/O Bottlenecks
Moving from HDF4 to HDF5/netCDF-4
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
Efficiently serving HDF5 via OPeNDAP
What NetCDF users should know about HDF5?
Presentation transcript:

HDF5 A new file format & software for high performance scientific data management

High performance data requirements larger datasets (> terabyte) bigger, faster machines and storage systems varied architectures and I/O paradigms parallel computing environments complex subsetting complex data

HDF5 based on lessons learned from: Existing standards –HDF, PDB, AIO, netCDF, MPI-IO and others Computer science ASCI physics applications and users Earth science applications and users Other users

… and ASCI Requirements Compatibility with vector bundle model Collective access MPI-IO Transform data between memory & storage Parallel file systems: PIOFS, HPSS, etc.

Data model Datatypes (array elements) –integer & float –strings & pointers –compound (record structures) Aggregate object types –“dataset:” multidimensional array –grouping structure –each object has a name & attributes

Basic data object: array of records Record int8int4int16 float32 Dimensionality: 5 x 3 Number type: 3 5

Storage Capacity Store large objects Store large numbers of objects Limit: 2 gigabytes no limit HDF4HDF5 Limit: 20,000 objects no limit

Dataset components a multidimensional array of data elements header with metadata –datatype –dataspace –attributes –storage info Metadata header Dataset “Fred” Data int16 time = 32.4 pressure = 987 temp = 56 Datatype Attributes Dataspace 2 Dim_3=2 Dim_2=4 Dim_1=5 Rank Dimensions Chunked; compressed Storage info

Groups Group structure for organizing the file Every file starts with a root group Like directories in file system Groups have attributes “/” “/foo” “/foo/bar”

Special Storage Options chunked compressed extendable split file Metadata for Fred Dataset “Fred” File A File B Data for Fred Improves subsetting access time Improves storage efficiency Arrays can be extended individually Metadata in one file. Raw data in another.

The HDF5 Library New API and programming model Smaller, better, faster Able to support parallel I/O better OO compatible C & Fortran still primary, others considered I/O performance emphasized Current platforms –ASCI: IBM SP2, SGI Origin 2000, Intel Teraflop –Solaris, Linux, HPUX, IRIX, NT

Sub-selection Options Flexibility in mappings between data in memory and object in file Selection regions can be –points –hyperslabs –unions of hyperslabs Selection region in memory can be different shape from selection in file Supports I/O needs for parallel computation

Mappings between file dataspaces/selections and memory dataspaces/selections. (c) A sequence of points from a 2D array to a sequence of points in a 3D array. (d) Union of hyperslabs in file to union of hyperslabs in memory. Number of elements must be equal. (b) A regular series of blocks from a 2D array to a contiguous sequence at a certain offset in a 1D array (a) A hyperslab from a 2D array to the corner of a smaller 2D array

HDF5 Raw Data Pipeline Handles all aspects of data storage and transfer of data between file and application. Deals with multiple storage options –chunking, compression, number conversion,... Optimized performance for common usage Hooks for new filters –compression schemes, encryption, checksum,... –user-specified filters

Performance tuning Facilities for performance measurement –timing tests in test suite –Pablo instrumentation Caching –app can set cache size for metadata & chunks Parallel optimizations –efficient metadata management –chunking –can control placement on physical media

HDF5 and ASCI Applications Multi-lab collaboration –DOE Tri-lab: Livermore, Sandia, Los Alamos –NCSA, Limit Point Systems Motivation: –Data sharability –Application interoperability Leverage experiences –EXODUS (SNL), SILO & PDB (LLNL) –HDF (NCSA), netCDF (UCAR)

ASCI DMF Data Abstraction ASCI DMF Data Abstraction Objectives –Sound data model with robust data abstractions –Computational mechanics data: meshes & fields –Based on mathematical field of fiber bundles –Common format allows common tools & sharing –Common API shield apps from model complexities MPI IO (ANL) HDF5 (NCSA) Fiber Bundle Kernel (LLNL) Data Structure Layer (LLNL) Mesh APIs (SNL/LANL) APPLICATION

HDF5 driver projects ProjectApplicationTypes of data ASCI: Computational Fields on meshes: mechanicsstructured, unstructured hierarchical CANIS: UIUC Digital Concept space Object store for large Library Project analysis of medical collection of small abstracts objects (noun phrases) TRAPPIST: non-Non-destructiveNDT experiment data: destructive testing testingtomography and consortium radiology NASA Earth Observing Earth Science data Remote sensing: System management swath, grid and point data