The HDF Group HDF5: State of the Union Quincey Koziol The HDF Group November 13, 20121.

Slides:



Advertisements
Similar presentations
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
The HDF Group Parallel HDF5 Developments 1 Copyright © 2010 The HDF Group. All Rights Reserved Quincey Koziol The HDF Group
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
Alternate Software Development Methodologies
ARCS Data Analysis Software An overview of the ARCS software management plan Michael Aivazis California Institute of Technology ARCS Baseline Review March.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
VisIt Software Engineering Infrastructure and Release Process LLNL-PRES Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
CASE Tools CIS 376 Bruce R. Maxim UM-Dearborn. Prerequisites to Software Tool Use Collection of useful tools that help in every step of building a product.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
The HDF Group A Brief Introduction to HDF5 Quincey Koziol Director of Core Software and HPC The HDF Group March 5,
Introduction to Systems Analysis and Design
University of Illinois at Urbana-ChampaignHDF Mike Folk HDF-EOS Workshop IV Sept , 2000 HDF Update HDF.
Object Oriented Databases by Adam Stevenson. Object Databases Became commercially popular in mid 1990’s Became commercially popular in mid 1990’s You.
HDF5 Tools Update Peter Cao - The HDF Group November 6, 2007 This report is based upon work supported in part by a Cooperative Agreement.
The HDF Group Company, Services and Products May 30-31, 2012HDF5 Workshop at PSI 1.
The HDF Group HDF Update Mike Folk The HDF Group The 13th HDF and HDF-EOS Workshop November 3-5, 2009 HDF/HDF-EOS Workshop XIII1.
Zhonghua Qu and Ovidiu Daescu December 24, 2009 University of Texas at Dallas.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 Overview of HDF5 HDF Summit Boeing Seattle The HDF Group (THG) September 19, 2006.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
May 30-31, 2012HDF5 Workshop at PSI1 HDF5 at Glance Quick overview of known topics.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
The HDF Group HDF5 Tools Updates Peter Cao, The HDF Group September 28-30, 20101HDF and HDF-EOS Workshop XIV.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
The european ITM Task Force data structure F. Imbeaux.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
1 HDF5 Life cycle of data Boeing September 19, 2006.
A High performance I/O Module: the HDF5 WRF I/O module Muqun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University.
The HDF Group Milestone 5.1: Initial POSIX Function Shipping Demonstration Jerome Soumagne, Quincey Koziol 09/24/2013 © 2013 The HDF Group.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1 Introduction to HDF5 Command-line Tools.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
Connections to Other Packages The Cactus Team Albert Einstein Institute
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
The HDF Group HDF5 Overview Elena Pourmal The HDF Group 1 10/17/15ICALEPCS 2015.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Process Asad Ur Rehman Chief Technology Officer Feditec Enterprise.
1 Data Management with HDF5 Quincey Koziol Director of Core Software Development and HPC The HDF Group September 10, 2012NASA Digital.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
SDM Center Parallel I/O Storage Efficient Access Team.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
Objective ICT : Internet of Services, Software & Virtualisation FLOSSEvo some preliminary ideas.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
The HDF Group Introduction to HDF5 Session Three HDF5 Software Overview 1 Copyright © 2010 The HDF Group. All Rights Reserved.
1 CASE Computer Aided Software Engineering. 2 What is CASE ? A good workshop for any craftsperson has three primary characteristics 1.A collection of.
CIS 375 Bruce R. Maxim UM-Dearborn
HDF and HDF-EOS Workshop XII
The Post Windows Operating System
Hierarchical Data Formats (HDF) Update
Mohamad Chaarawi The HDF Group
Spark Presentation.
HDF5 October 8, 2017 Elena Pourmal Copyright 2016, The HDF Group.
What NetCDF users should know about HDF5?
Database management systems
Presentation transcript:

The HDF Group HDF5: State of the Union Quincey Koziol The HDF Group November 13, 20121

What is HDF5? A versatile data model that can represent very complex data objects and a wide variety of metadata. A completely portable file format with no limit on the number or size of data objects stored. An open source software library that runs on a wide range of computational platforms, from cell phones to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces. A rich set of integrated performance features that allow for access time and storage space optimizations. Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection. November 13, 20122

HDF5 Technology Platform HDF5 Abstract Data Model Defines the “building blocks” for data organization and specification Files, Groups, Links, Datasets, Attributes, Datatypes, Dataspaces HDF5 Software Tools Language Interfaces HDF5 Library HDF5 Binary File Format Bit-level organization of HDF5 file Defined by HDF5 File Format Specification 3November 13, 2012

13, HDF5 Data Model Groups – provide structure among objects Datasets – where the primary data goes Data arrays Rich set of datatype options Flexible, efficient storage and I/O Attributes, for metadata Everything else is built essentially from these parts.

13, Structures to organize objectspalette Raster image 3-D array 2-D array Raster image lat | lon | temp ----|-----| | 23 | | 23 | | 24 | | 24 | | 21 | | 21 | 3.6Table “/” (root) “/TestData” “Groups” “Datasets”

Why use HDF5? Challenging data: Application data that pushes the limits of what can be addressed by traditional database systems, XML documents, or in-house data formats. Software solutions: For very large datasets, very fast access requirements, or very complex datasets. To easily share data across a wide variety of computational platforms using applications written in different programming languages. That take advantage of the many open-source and commercial tools that understand HDF5. Enabling long-term preservation of data. November 13, 20126

Who uses HDF5? Examples of HDF5 user communities Astrophysics Astronomers NASA Earth Science Enterprise Dept. of Energy Labs Supercomputing centers in US, Europe and Asia Financial Institutions NOAA Manufacturing industries Many others For a more detailed list, visit November 13, 20127

Topics November 13, 20128

13, Brief History of HDF 1987At NCSA (University of Illinois), a task force formed to create an architecture-independent format and library: AEHOO (All Encompassing Hierarchical Object Oriented format) Became HDF Early NASA adopted HDF for Earth Observing System project 1990’s 1996 DOE’s ASC (Advanced Simulation and Computing) Project began collaborating with the HDF group (NCSA) to create “Big HDF” (Increase in computing power of DOE systems at LLNL, LANL and Sandia National labs, required bigger, more complex data files). “Big HDF” became HDF HDF5 was released with support from DOE Labs, NASA, NCSA 2006 The HDF Group spun off from University of Illinois as non-profit corporation

The HDF Group Established in years at University of Illinois’ National Center for Supercomputing Applications 5 years as independent non-profit company, “The HDF Group” The HDF Group owns HDF4 and HDF5 HDF4 & HDF5 formats, libraries, and tools are open source and freely available with BSD-style license Currently employ 37 FTEs Looking for more developers now! November 13,

The HDF Group The HDF Group Mission To ensure long-term accessibility of HDF data through sustainable development and support of HDF technologies. November 13,

Goals of The HDF Group Maintain and evolve HDF for sponsors and communities that depend on it Provide support to the HDF communities through consulting, training, tuning, development, research Sustain the company for the long term to assure data access over time November 13,

The HDF Group Services Helpdesk and Mailing Lists Available to all users as a first level of support: Priority Support Rapid issue resolution and advice Consulting Needs assessment, troubleshooting, design reviews, etc. Training Tutorials and hands-on practical experience Enterprise Support Coordinating HDF activities across departments Special Projects Adapting customer applications to HDF New features and tools Research and Development November 13,

Members of the HDF support community NASA – Earth Observing System NOAA/NASA/Riverside Tech – NPOESS A large financial institution DOE – Exascale FastForward w/Intel & EMC DOE – projects w/LBNL & PNNL, ANL & ORNL Lawrence Livermore National Lab Sandia National Lab ITER – project with General Atomics A leading U.S. aerospace company University of Illinois/NCSA PSI/Dectris and DESY – European light sources Projects for petroleum industry, vehicle testing, weapons research, others “In kind” support November 13,

Income Profile – Total income: ~$3.7 million November 13, 2012

New Directions We’re Taking High energy light source data storage Projects with DESY and PSI/Dectris, to store data from European synchotrons and particle accelerators Applications of HDF5 in the Bioinformatics field Working with researchers at IRRI & Brown U. Synthesis of HDF5 and database storage w/Oracle Exploring how to interact with HDF5 files through SQL interface November 13,

Trillion Particle Simulation on NERSC’s hopper system November 13, Cool recent application VPIC with 100,000 nodes on hopper Achieved 27GB/s sustained rate to each 32TB HDF5 file (out of 35GB/s theoretical peak)

Topics November 13,

Where We’ve Been Release 1.0 First “prototype” release in Oct, 1997 Incorporated core data model: datatypes, dataspaces & datasets and groups Parallel support added in r1.0.1, in Jan, 1999 Release Oct, 1999 Added support for bitfield, opaque, enumeration, variable-length and reference datatypes. Added new ‘h5toh4’ tool Lots of polishing Performance optimizations November 13,

Where We’ve Been Release Feb, 2001 Added Virtual File Driver (VFD) API layer, with many drivers Added ‘h4toh5’, h5cc tools, XML output to h5dump Added array datatype F90 & C++ API wrappers Performance optimizations Release July, 2003 Generic Property API Compact dataset storage Added ‘h5diff’, ‘h5repack’, ‘h5jam’, ‘h5import’ tools Performance optimizations November 13,

Where We’re At Now Release Feb, 2008 Features to support netCDF-4 Creation order indexing on links and attributes Integer-to-floating point conversion support NULL dataspace More efficient group storage External Links New H5L (links) and H5O (objects) APIs Shared Object Header Messages Unicode-8 Support Anonymous object creation New tools: ‘h5mkgrp’, ‘h5stat’, ‘h5copy’ CMake build support Performance optimizations November 13,

Repository Statistics November 13, HDF5 Library Source Code, Lines of Code by Date

Repository Statistics November 13, HDF5 Library Source Code, Lines of Code by Release

Software Engineering in HDF5 We spend considerable effort in ensuring we produce very high quality code for HDF5. Current efforts: Correctness regression testing Nightly testing of >60 configurations on >20 machines Performance regression testing Applying static code analysis – Coverity, Klocwork Memory leak detection – valgrind Code coverage – coming soon November 13,

Performance Regression Test Suite November 13,

HDF minor release (May ‘12) Library: New API routine to validate object paths: H5LTpath_valid() New API routines to work with file images. New feature to merge committed datatypes when copying them. Parallel I/O: New API routine to set MPI atomicity: H5Fset_mpi_atomicity() Tools: New features added to h5repack, h5stat & h5repack. Bugs fixed: Many bugs fixed in library and tools. November 13,

HDF minor release (Nov ‘12) Library: Reduced memory footprint for internal buffering. Improved behavior of collective chunk I/O. Parallel I/O: New API routine to query why collective I/O was broken: H5Pget_mpio_no_collective_cause() Tools: New features added to h5import. Retired some out of date performance tools. Bugs fixed: Updated to latest autotools release. Many fixes to tools, high-level library and FORTRAN. November 13,

Where We’ll Be Soon Release Overview Beta release in November, 2011 Soon! Stopped adding major features, fleshing out our current efforts now Major Efforts: Improved scalability of chunked dataset access Single-Writer/Multiple Reader (SWMR) Access Improved fault tolerance Initial support for asynchronous I/O November 13,

Where We’ll Be Soon Release Details New chunked dataset indexing methods Single-Writer/Multiple-Reader (SWMR) Access Improved Fault Tolerance Journaled Metadata Writing Ordered Updates Page-aligned and buffered metadata access Persistent file free space tracking Basic support for asynchronous I/O Expanded Virtual File Driver (VFD) interface Lazy metadata writes (in serial) F2003 Support Compressed group information High-level “HPC” API Collective I/O on multiple datasets Metadata broadcast between processes Performance optimizations Avoid file truncation November 13,

Where We Might Get To Release Maybe? Full C99 type support (long double, complex, boolean types, etc) Support for runtime file format limits Improved variable-length datatype storage Virtual Object Layer Abstraction layer to allow HDF5 objects to be stored in any container Allows application to break “all collective” metadata modification limit in current HDF5 November 13,

Where We’re Not Going We’re not changing multi-threaded concurrency support Keep “global lock” on library Will use asynchronous I/O heavily Will be using threads internally though November 13,

Topics November 13,

Tool activities in the works New tool – h5watch Display changes to a dataset, metadata and raw data New tool – h5compare Rewritten and improved version of h5diff Improved code quality and testing Tools library: general purpose APIs for tools Tools library currently only for our developers Want to make it public so that people can use it in their products November 13,

Topics November 13,

HDF-Java 2.9 (Nov, ‘12) Maintenance release, no major new features Built with: HDF4 2.8, HDF , and Java 1.7 Many bug fixes and extended regression tests New HDFView features: Added feature to show groups/attributes in creation order Exclude fill Values in data calculation Added 'reload' option to quickly close and reopen a file November 13,

Topics November 13,

HDF5 in the Future “The best way to predict the future is to invent it.” – Alan Kay November 13,

Plans, Guesses & Speculations: HPC Improve Parallel I/O Performance: Continue to improve our use of MPI and parallel file system features Reduce # of I/O accesses for metadata access Integrate with in-situ/in-transit frameworks Support asynchronous parallel I/O Support Single-Write/Multiple-Reader (SWMR) access in parallel November 13,

Plans, Guesses & Speculations: Safety Improve Journaled HDF5 File Access: Journal raw data operations Allow multi-operation journal transactions to be created by applications Support fully asynchronous journal operations Enable journaling for Parallel HDF5 November 13,

Plans, Guesses & Speculations: Threadsafety Focus on asynchronous I/O, instead of improving multi-threaded concurrency of HDF5 library: Library currently thread-safe, but not concurrent Instead of improving concurrency, focus on heavily leveraging asynchronous I/O Use internal multi-threading where possible November 13,

Plans, Guesses & Speculations: Grab-bag Improve data model: Shared dataspaces Attributes on dataspaces and datatypes Improve raw data chunk cache implementation More efficient storage and I/O of variable- length data, including compression November 13,

Topics November 13,

Exascale FastForward Research Whamcloud*, EMC & The HDF Group were recently awarded a contract for exascale storage research and prototyping: 12/doe_primes_pump_for_exascale_supercomput ers.htmlhttp:// 12/doe_primes_pump_for_exascale_supercomput ers.html Using HDF5 data model and interface as top layer of next generation storage system for future exascale systems Laundry list of new features to prototype in HDF5 Pointer datatypes, asynchronous I/O, transactions, end-to-end consistency checking, query/index features, python wrappers, etc. November 13, * - Intel, now

Autotuning and Performance Tracing Why? Because the dominant I/O support request at NERSC is poor I/O performance, many/most of which can be solved by enabling Lustre striping, or tuning another I/O parameter Scientists shouldn’t have to figure this stuff out! Two Areas of Focus: Evaluate techniques for autotuning HPC application I/O File system, MPI, HDF5 Record and Replay HDF5 I/O operations November 13,

Autotuning HPC I/O Goal: Avoid tuning each application to each machine and file system Create I/O autotuner library that can inject “optimal” parameters for I/O operations on a given system Apply to precompiled application binaries Application can be dynamically linked with I/O autotuning library No changes to application or HDF5 library Tested with several HPC applications already: VPIC, GCRM, Vorpal Up to 16x performance improvement, compared to system default settings See poster for more details! November 13,

Recording and Replaying HDF5 Goal: Extract an “I/O Kernel” from application, without examining application code Method: Dynamically link library at run-time to record all HDF5 calls and parameters in “replay file” Create parallel replay tool that uses recording to replay HDF5 operations November 13,

Recording and Replaying HDF5 Benefits: Easy to create I/O kernel from any application, even one with no source code Can use replay file as accurate application I/O benchmark Can move replay file to another system for comparison Can autotune from replay, instead of application Challenges: Serializing complex parameters to HDF5 Replay files can be very large Accurate parallel replay is difficult November 13,

Community Outreach Work with HPC community to serve their needs: Focus on high-profile applications or “I/O kernels” and remove HDF5 bottlenecks discovered You tell us! November 13,

The HDF Group Thank You! Questions & Comments? November 13,