The HDF Group HDF5 Overview Elena Pourmal The HDF Group 1 10/17/15ICALEPCS 2015.

Slides:



Advertisements
Similar presentations
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Advertisements

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
File Systems.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
The Future of NetCDF Russ Rew UCAR Unidata Program Center Acknowledgments: John Caron, Ed Hartnett, NASA’s Earth Science Technology Office, National Science.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
The HDF Group A Brief Introduction to HDF5 Quincey Koziol Director of Core Software and HPC The HDF Group March 5,
© InLoox ® InLoox PM Web App product presentation The Online Project Software.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
High-Speed, High Volume Document Storage, Retrieval, and Manipulation with Documentum and Snowbound March 8, 2007.
By Mihir Joshi Nikhil Dixit Limaye Pallavi Bhide Payal Godse.
The HDF Group July 8, 2014HDF 2014 ESIP Summer Meeting HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann The.
The HDF Group HDF5: State of the Union Quincey Koziol The HDF Group November 13,
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
The HDF Group Company, Services and Products May 30-31, 2012HDF5 Workshop at PSI 1.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
NetCDF-4 The Marriage of Two Data Formats Ed Hartnett, Unidata June, 2004.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
DM_PPT_NP_v01 SESIP_0715_AJ HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann Gerd Heber, John Readey, Joel Plutchak The HDF Group HDF.
Upgrade to Real Time Linux Target: A MATLAB-Based Graphical Control Environment Thesis Defense by Hai Xu CLEMSON U N I V E R S I T Y Department of Electrical.
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Unidata TDS Workshop TDS Overview – Part I XX-XX October 2014.
1 Overview of HDF5 HDF Summit Boeing Seattle The HDF Group (THG) September 19, 2006.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
The HDF Group Virtual Object Layer in HDF5 Exploring new HDF5 concepts May 30-31, 2012HDF5 Workshop at PSI 1.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
Mid-Course Review: NetCDF in the Current Proposal Period Russ Rew
December 1, 2005HDF & HDF-EOS Workshop IX P eter Cao, NCSA December 1, 2005 Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration.
Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.
May 30-31, 2012HDF5 Workshop at PSI1 HDF5 at Glance Quick overview of known topics.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
The HDF Group ESIP Summer Meeting HDF Studio John Readey The HDF Group 1 July 8 – 11, 2014.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
11/7/2007HDF and HDF-EOS Workshop XI, Landover, MD1 HDF5 Software Process MuQun Yang, Quincey Koziol, Elena Pourmal The HDF Group.
1 HDF5 Life cycle of data Boeing September 19, 2006.
A High performance I/O Module: the HDF5 WRF I/O module Muqun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group Support for NPP/NPOESS by The HDF Group Mike Folk, Elena Pourmal, Peter Cao The HDF Group November 5, 2009 November 3-5,
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
Connections to Other Packages The Cactus Team Albert Einstein Institute
July 20, Update on the HDF5 standardization effort Elena Pourmal, Mike Folk The HDF Group July 20, 2006 SPG meeting, Palisades, NY.
The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.
NetCDF and Scientific Data Durability Russ Rew, UCAR Unidata ESIP Federation Summer Meeting
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
1 Data Management with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group September 10, 2012NASA Digital.
SDM Center Parallel I/O Storage Efficient Access Team.
The HDF Group January 8, ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Update on Unidata Technologies for Data Access Russ Rew
The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
Unidata Infrastructure for Data Services Russ Rew GO-ESSP Workshop, LLNL
Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
The HDF Group Introduction to HDF5 Session Three HDF5 Software Overview 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Elena Pourmal The HDF Group
Hierarchical Data Formats (HDF) Update
Moving from HDF4 to HDF5/netCDF-4
HDF5 New Features October 8, 2017
HDF5 October 8, 2017 Elena Pourmal Copyright 2016, The HDF Group.
HDF5 Metadata and Page Buffering
InLoox PM Web App product presentation
Hierarchical Data Format (HDF) Status Update
Elena Pourmal The HDF Group HDF Workshop July 17, 2018
Lecture 4: File-System Interface
Presentation transcript:

The HDF Group HDF5 Overview Elena Pourmal The HDF Group 1 10/17/15ICALEPCS 2015

Outline The HDF Group company Products and services Overview of HDF5 What is coming in HDF release? Future directions 210/17/15ICALEPCS 2015

THE HDF GROUP COMPANY 310/17/15ICALEPCS 2015

Champaign, Illinois, USA 410/17/15ICALEPCS 2015

The HDF Group Not-for-profit company (since 2006), ex-NCSA at University of Illinois Offices in 5 states About 40 employees (more than 50% growth in the past 9 years) -Core software developers -Domain specialists -Documentation team -Technical support Mission-driven 510/17/15ICALEPCS 2015

The HDF Group Mission To ensure long-term accessibility of HDF data through sustainable development and support of HDF technologies. 10/17/156ICALEPCS 2015

The HDF Group philosophy Committed to Open Source HDF software is free BSD type of license Community involvement Testing Patches New features (e.g., CMake support) Serving diverse user base Remote sensing, HPC, non-destructive testing, medical records, scientific modeling, etc. 710/17/15ICALEPCS 2015

Revenue by Source 810/17/15 NASA, NOAA ICALEPCS 2015

Revenue by Project Type 10/17/159ICALEPCS 2015

PRODUCTS AND SERVICES 1010/17/15ICALEPCS 2015

The HDF Group products Main product: HDF Technology Suite -For managing high volume complex, heterogeneous data -Flagship: HDF5 data store -Flexible and efficient storage and I/O -Portable -Highly customizable -Misc. tools -Specialized software and tools (e.g., JPSS) 1110/17/15ICALEPCS 2015

HDF5 IN 5 MINUTES Data challenges addressed by HDF5 1210/17/15ICALEPCS 2015

HDF5 Technology Platform HDF5 Abstract Data Model Defines the “building blocks” for data organization and specification Files, Groups, Links, Datasets, Attributes, Datatypes, Dataspaces HDF5 Software Tools Language Interfaces (C, Fortran, C++, Java) HDF5 Library HDF5 Binary File Format Bit-level organization of HDF5 file Defined by HDF5 File Format Specification HDF5 Ecosystem Tools and services (h5py, MATLAB, IDL, OPeNDAP, etc.) Communities (Earth Sciences, medical imaging, modeling and visualization) Community standards (NeXus, HDF-EOS5, h5part, CGNS) Institutional support and endorsement (NASA, NOAA, DOE) 1310/17/15ICALEPCS 2015

Members of the HDF community 1410/17/15ICALEPCS 2015

Success stories Petabytes of NASA remote sensing data in HDF4 and HDF5 file formats New NASA/JPSS missions chose HDF5 format for data archiving 1510/17/15 Need to organize complex collections of data Long term data preservation Efficient, scalable storage and access lat | lon | temp ----|-----| | 23 | | 23 | | 24 | | 24 | | 21 | | 21 | 3.6 ICALEPCS 2015

Success story: Trillion Particle Simulation 1610/17/15 Physics plasma simulation at NERSC Cray XE6 Simulation ran on 120,000 cores using 80% of computing resources 90% of available memory 50% of Lustre scratch system and writing 10 one-trillion particle dumps of TBs in HDF5 files; sustained ~ 27 GB/sec; total 350 TBs in HDF5 ICALEPCS 2015

The HDF Group services Helpdesk and mailing lists -Open to all users of HDF HDF5 Documentation HDF Examples (C, Fortran, C++, Java, Python, MATLAB) /17/15ICALEPCS 2015

The HDF Group services Standard support Assistance in general areas of HDF usage Premium support Access to our consulting and training resources Limited consulting hours are included Enterprise support Help with developing common strategies for managing HDF data within organization Organization shares consulting/troubleshooting services Training Consulting, custom development and support 1810/17/15ICALEPCS 2015

HDF RELEASE New Upcoming Features 1910/17/15ICALEPCS 2015

PERSISTENT FILE FREE SPACE TRACKING Reusing free file space in a file 2010/17/15ICALEPCS 2015

Unused space in HDF5 file HDF5 library currently only tracks free space while file is open Space from deleted objects Space from resized compressed chunks Free space in the file is “lost” after file is closed h5repack is used to remove “holes” in the file New function H5Pset_file_space Sets a property to track free space in the file that can be reused when file is reopened Allows fine tuning space tracking 2110/17/15ICALEPCS 2015

SCALABLE CHUNK INDEXING Improving performance and saving space 2210/17/15ICALEPCS 2015

Optimizing chunking storage and performance HDF5 has an ability to add more data to existing datasets (data arrays) Special storage mechanism – chunked storage B-trees are used to index chunks in the file O(log n) lookup time HDF5 takes advantage of the access pattern and properties of the datasets O(1) lookup time File space savings when storing HDF5 metadata 2310/17/15ICALEPCS 2015

Optimizing chunking storage and performance B-tree implementation was reworked to use less space in the file Used for datasets with more than one unlimited dimension New indexing structures were introduced to achieve O(1) performance and storage savings in special cases 2410/17/15ICALEPCS 2015

Optimizing chunking storage and performance Examples of O(1) lookup access: Fixed-size chunked dataset with no compression filters Algorithmic lookup Fixed-size chunked dataset with compression filters Array to index chunks Fixed-size dataset stored in one chunk (i.e., we now allow compression for contiguous dataset) No index Dataset with one unlimited dimension Extensible array to index chunks 2510/17/15ICALEPCS 2015

CONCURRENCY: SINGLE-WRITER/MULTIPLE- READER 2610/17/15ICALEPCS 2015

Concurrent Access to Data 10/17/1527 HDF5 File Writer Reader … which can be read by a reader… with no IPC necessary. New data elements … … are added to a dataset in the file… ICALEPCS 2015

VIRTUAL DATASET (VDS) Managing data stored across HDF5 files 2810/17/15ICALEPCS 2015

4 granules in 9 GMODO-SVM07… files 2910/17/15 VDS Use Case with NPP satellite data Visualization with IDV ICALEPCS 2015

One virtual dataset with 36 granules stored in one file VDS Use Case with NPP satellite data Visualization with IDV ICALEPCS 2015

VDS use case: Percival detector 10/17/15 31 time Series of images a.h5b.h5 c.h5 BC D d.h5 Virtual Dataset VDS has images A, B, C and D interleaved VDS.h5 Dataset BDataset C Dataset D A C D A B t1t1 t2t2 t3t3 t4t4 t 3+4k t 1+4k Dataset A reader writer

VDS: Conceptual View 10/17/15 32

METADATA CACHE IMAGE Performance boost when opening and closing HDF5 files 3310/17/15ICALEPCS 2015

Problem: Metadata Cache Image HDF5 metadata is typically small and scattered throughout the file. Resulting many small I/Os a major problem for parallel file systems. Metadata cache minimizes this during normal operation, but must still populate cache on file open, and flush it on file close. Problem if files are opened and closed often. 10/17/1534ICALEPCS 2015

Solution: Metadata Cache Image Store the contents of the metadata cache in a single block at file close, and then populate the cache with the stored entries on file open. If access pattern is similar over close and reopen, should save a significant number of small I/O operations. This solution is implemented in the metadata cache image feature. 10/17/1535ICALEPCS 2015

Metadata Cache Image To enable, set cache image FAPL property on file create or open: H5AC_cache_image_config_t cache_image_config = {H5AC__CURR_CACHE_IMAGE_CONFIG_VERSION, TRUE, 0}; fapl_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_libver_bounds(fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST); H5Pset_mdc_image_config(fapl_id, &cache_image_config); Then create or open file as usual. 10/17/1536ICALEPCS 2015

Metadata Cache Image Metadata cache image is read and deleted automatically on file open. Must set cache image FAPL property again if a new cache image is desired on file close. Earlier versions of HDF5 that don't understand the cache image will refuse to open the file. One can use a light-weight utility to remove caching info making file compatible with 1.8 Prototype implementation showed order of magnitude speedup on parallel systems 10/17/1537ICALEPCS 2015

DATA AGGREGATION AND PAGE BUFFERING Performance imporvemnts 3810/17/15ICALEPCS 2015

Page buffering/ Data aggregation 10/17/1539 Aggregate and align metadata and small data, perform I/O in aligned pages

Data and Metadata Aggregators The new aggregators pack small raw data and metadata allocations into aligned blocks which work with the page buffer. 10/17/1540 HDF5 File MetadataData Small allocations ICALEPCS 2015

HDF5 Page Buffering 10/17/1541 Page buffer contains MD pages (L2 cache) HDF5 File Metadata blocks are multiples of 64K Metadata blocks are aligned

IMPROVEMENTS FOR PARALLEL ACCESS HDF5 Parallel 4210/17/15ICALEPCS 2015

Problems We Solved for PHDF5 Slowness on opening and closing HDF5 files Metadata Cache Optimizations -Avoiding the Metadata Read Storm -Collective Metadata Writes Avoid Truncate Feature Writing/reading multiple variable s Collective I/O on multiple datasets or Multi-Dataset I/O I/O on selections bigger than 2GB with MPICH Page Buffering Page Buffering - a layer under the VFD to capture small I/Os and cache them for larger paged size I/Os. ICALEPCS 2015

Metadata reads with CGNS and netCDF-4 10/17/1544 CGNS reads on Blue Gene, GPFS netCDF-4 reads on Cray XE6, GPFS ICALEPCS 2015

Collective I/O on multiple datasets 10/17/1545 Two new routines H5Dread_multi() and H5Dwrite_multi() The plot shows the performance difference between using a single H5Dwrite() multiple times and using H5Dwrite_multi () once on 30 chunked datasets on Cray XE-6 with Lustre file system (hopper). ICALEPCS 2015

BACKWARD/FORWARD COMPATIBILITY ISSUES HDF /17/1546ICALEPCS 2015

Backward/Forward compatibility issues 10/17/1547 HDF will always read files created by the earlier versions HDF by default will create files that can be read by HDF5 1.8.* HDF will create files incompatible with 1.8 version if new features are used Tools to “downgrade” the file created by HDF h5format_convert (SWMR files; doesn’t rewrite raw data) h5repack (VDS, SWMR and other; does rewrite data)

EXPLORING NEW DIRECTIONS Examples 4810/17/15ICALEPCS 2015

HDF5 ODBC Driver Open DataBase Connectivity (ODBC) Industry standard middleware API for accessing database management sys. All analytics apps. have an ODBC client HiFive – ODBC driver for HDF5 Windows, [Linux, MacOS X] Client & Client/Serve Accessing HDF5 files from Excel & R 49 10/17/15 Thanks to Gerd Heber, THG ICALEPCS 2015

HDF5 for the Web Can I access HDF5 files remotely? API? My (mobile) client speaks HTTP! What is a file system? Who uses files anymore? Cloud computing w/ HDF /17/15 Thanks to John Readey, THG ICALEPCS 2015

Emerging Trends in I/O 10/17/1551 Increased computational power…  Huge expansion of simulation data volume & metadata complexity  Complex to manage and analyze …achieved through parallelism  100,000s nodes with 10s millions cores  More frequent hardware & software failures …tiered storage architectures  High performance fabric & solid state storage on-cluster  Low performance, high capacity disk-based storage off-cluster …object-based storage The HDF Group has been working with Intel and others on the Fast Forward Project to investigate and contribute to those trends ICALEPCS 2015

HDF5 role in the Fast Forward Storage Stack Object storage Virtual Object Layer (VOL) Data Integrity/ Fault Tolerance Transaction End-to-end checksums Data Analysis Extensions Query/View/Index APIs Analysis Shipping 10/17/1552ICALEPCS 2015

HDF5 as an interface to non-HDF5 storage 10/17/ ICALEPCS 2015

HDF5 as an interface to non-HDF5 storage 10/17/1554 Different File Formats plugins: ICALEPCS 2015

DATA INDEXING Features we are investigating 5510/17/15ICALEPCS 2015

Indexing and HDF5 10/17/15 56 New APIs for indexing and querying of both structure and contents of HDF5 file H5Q API defines query to apply to a file Create/combine queries (OR, AND) Basic operators supported ( ≤, ≥,=, ≠ ) on either dataset/attribute values, link/attribute names HDF5V API retrieves data HDF5X API adds third-party indexing plugins

Example: Combined query 10/17/1557

The HDF Group Thank You! Questions? 10/17/15 58 ICALEPCS 2015