Presentation is loading. Please wait.

Presentation is loading. Please wait.

The HDF Group HDF5 Overview Elena Pourmal The HDF Group 1 10/17/15ICALEPCS 2015.

Similar presentations


Presentation on theme: "The HDF Group HDF5 Overview Elena Pourmal The HDF Group 1 10/17/15ICALEPCS 2015."— Presentation transcript:

1 www.hdfgroup.org The HDF Group HDF5 Overview Elena Pourmal epourmal@hdfgroup.org The HDF Group 1 10/17/15ICALEPCS 2015

2 www.hdfgroup.org Outline The HDF Group company Products and services Overview of HDF5 What is coming in HDF5 1.10.0 release? Future directions 210/17/15ICALEPCS 2015

3 www.hdfgroup.org THE HDF GROUP COMPANY 310/17/15ICALEPCS 2015

4 www.hdfgroup.org Champaign, Illinois, USA 410/17/15ICALEPCS 2015

5 www.hdfgroup.org The HDF Group www.hdfgroup.org Not-for-profit company (since 2006), ex-NCSA at University of Illinois Offices in 5 states About 40 employees (more than 50% growth in the past 9 years) -Core software developers -Domain specialists -Documentation team -Technical support Mission-driven 510/17/15ICALEPCS 2015

6 www.hdfgroup.org The HDF Group Mission To ensure long-term accessibility of HDF data through sustainable development and support of HDF technologies. 10/17/156ICALEPCS 2015

7 www.hdfgroup.org The HDF Group philosophy Committed to Open Source HDF software is free BSD type of license Community involvement Testing Patches New features (e.g., CMake support) Serving diverse user base Remote sensing, HPC, non-destructive testing, medical records, scientific modeling, etc. 710/17/15ICALEPCS 2015

8 www.hdfgroup.org Revenue by Source 810/17/15 NASA, NOAA ICALEPCS 2015

9 www.hdfgroup.org Revenue by Project Type 10/17/159ICALEPCS 2015

10 www.hdfgroup.org PRODUCTS AND SERVICES 1010/17/15ICALEPCS 2015

11 www.hdfgroup.org The HDF Group products Main product: HDF Technology Suite -For managing high volume complex, heterogeneous data -Flagship: HDF5 data store -Flexible and efficient storage and I/O -Portable -Highly customizable -Misc. tools -Specialized software and tools (e.g., JPSS) 1110/17/15ICALEPCS 2015

12 www.hdfgroup.org HDF5 IN 5 MINUTES Data challenges addressed by HDF5 1210/17/15ICALEPCS 2015

13 www.hdfgroup.org HDF5 Technology Platform HDF5 Abstract Data Model Defines the “building blocks” for data organization and specification Files, Groups, Links, Datasets, Attributes, Datatypes, Dataspaces HDF5 Software Tools Language Interfaces (C, Fortran, C++, Java) HDF5 Library HDF5 Binary File Format Bit-level organization of HDF5 file Defined by HDF5 File Format Specification HDF5 Ecosystem Tools and services (h5py, MATLAB, IDL, OPeNDAP, etc.) Communities (Earth Sciences, medical imaging, modeling and visualization) Community standards (NeXus, HDF-EOS5, h5part, CGNS) Institutional support and endorsement (NASA, NOAA, DOE) 1310/17/15ICALEPCS 2015

14 www.hdfgroup.org Members of the HDF community 1410/17/15ICALEPCS 2015

15 www.hdfgroup.org Success stories Petabytes of NASA remote sensing data in HDF4 and HDF5 file formats New NASA/JPSS missions chose HDF5 format for data archiving 1510/17/15 Need to organize complex collections of data Long term data preservation Efficient, scalable storage and access lat | lon | temp ----|-----|----- 12 | 23 | 3.1 12 | 23 | 3.1 15 | 24 | 4.2 15 | 24 | 4.2 17 | 21 | 3.6 17 | 21 | 3.6 ICALEPCS 2015

16 www.hdfgroup.org Success story: Trillion Particle Simulation 1610/17/15 Physics plasma simulation at NERSC Cray XE6 Simulation ran on 120,000 cores using 80% of computing resources 90% of available memory 50% of Lustre scratch system and writing 10 one-trillion particle dumps of 30-42 TBs in HDF5 files; sustained ~ 27 GB/sec; total 350 TBs in HDF5 ICALEPCS 2015

17 www.hdfgroup.org The HDF Group services Helpdesk and mailing lists -help@hdfgroup.orghelp@hdfgroup.org -hdf-forum@hdfgroup.orghdf-forum@hdfgroup.org -Open to all users of HDF HDF5 Documentation https://www.hdfgroup.org/HDF5/doc/index.html HDF Examples (C, Fortran, C++, Java, Python, MATLAB) https://www.hdfgroup.org/HDF5/examples/ 1710/17/15ICALEPCS 2015

18 www.hdfgroup.org The HDF Group services Standard support Assistance in general areas of HDF usage Premium support Access to our consulting and training resources Limited consulting hours are included Enterprise support Help with developing common strategies for managing HDF data within organization Organization shares consulting/troubleshooting services Training Consulting, custom development and support 1810/17/15ICALEPCS 2015

19 www.hdfgroup.org HDF5 1.10.0 RELEASE New Upcoming Features 1910/17/15ICALEPCS 2015

20 www.hdfgroup.org PERSISTENT FILE FREE SPACE TRACKING Reusing free file space in a file 2010/17/15ICALEPCS 2015

21 www.hdfgroup.org Unused space in HDF5 file HDF5 library currently only tracks free space while file is open Space from deleted objects Space from resized compressed chunks Free space in the file is “lost” after file is closed h5repack is used to remove “holes” in the file New function H5Pset_file_space Sets a property to track free space in the file that can be reused when file is reopened Allows fine tuning space tracking 2110/17/15ICALEPCS 2015

22 www.hdfgroup.org SCALABLE CHUNK INDEXING Improving performance and saving space 2210/17/15ICALEPCS 2015

23 www.hdfgroup.org Optimizing chunking storage and performance HDF5 has an ability to add more data to existing datasets (data arrays) Special storage mechanism – chunked storage B-trees are used to index chunks in the file O(log n) lookup time HDF5 takes advantage of the access pattern and properties of the datasets O(1) lookup time File space savings when storing HDF5 metadata 2310/17/15ICALEPCS 2015

24 www.hdfgroup.org Optimizing chunking storage and performance B-tree implementation was reworked to use less space in the file Used for datasets with more than one unlimited dimension New indexing structures were introduced to achieve O(1) performance and storage savings in special cases 2410/17/15ICALEPCS 2015

25 www.hdfgroup.org Optimizing chunking storage and performance Examples of O(1) lookup access: Fixed-size chunked dataset with no compression filters Algorithmic lookup Fixed-size chunked dataset with compression filters Array to index chunks Fixed-size dataset stored in one chunk (i.e., we now allow compression for contiguous dataset) No index Dataset with one unlimited dimension Extensible array to index chunks 2510/17/15ICALEPCS 2015

26 www.hdfgroup.org CONCURRENCY: SINGLE-WRITER/MULTIPLE- READER 2610/17/15ICALEPCS 2015

27 www.hdfgroup.org Concurrent Access to Data 10/17/1527 HDF5 File Writer Reader … which can be read by a reader… with no IPC necessary. New data elements … … are added to a dataset in the file… ICALEPCS 2015

28 www.hdfgroup.org VIRTUAL DATASET (VDS) Managing data stored across HDF5 files 2810/17/15ICALEPCS 2015

29 www.hdfgroup.org 4 granules in 9 GMODO-SVM07… files 2910/17/15 VDS Use Case with NPP satellite data Visualization with IDV ICALEPCS 2015

30 www.hdfgroup.org3010/17/15 One virtual dataset with 36 granules stored in one file VDS Use Case with NPP satellite data Visualization with IDV ICALEPCS 2015

31 www.hdfgroup.org VDS use case: Percival detector 10/17/15 31 time Series of images a.h5b.h5 c.h5 BC D d.h5 Virtual Dataset VDS has images A, B, C and D interleaved VDS.h5 Dataset BDataset C Dataset D A C D A B t1t1 t2t2 t3t3 t4t4 t 3+4k t 1+4k Dataset A reader writer

32 www.hdfgroup.org VDS: Conceptual View 10/17/15 32

33 www.hdfgroup.org METADATA CACHE IMAGE Performance boost when opening and closing HDF5 files 3310/17/15ICALEPCS 2015

34 www.hdfgroup.org Problem: Metadata Cache Image HDF5 metadata is typically small and scattered throughout the file. Resulting many small I/Os a major problem for parallel file systems. Metadata cache minimizes this during normal operation, but must still populate cache on file open, and flush it on file close. Problem if files are opened and closed often. 10/17/1534ICALEPCS 2015

35 www.hdfgroup.org Solution: Metadata Cache Image Store the contents of the metadata cache in a single block at file close, and then populate the cache with the stored entries on file open. If access pattern is similar over close and reopen, should save a significant number of small I/O operations. This solution is implemented in the metadata cache image feature. 10/17/1535ICALEPCS 2015

36 www.hdfgroup.org Metadata Cache Image To enable, set cache image FAPL property on file create or open: H5AC_cache_image_config_t cache_image_config = {H5AC__CURR_CACHE_IMAGE_CONFIG_VERSION, TRUE, 0}; fapl_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_libver_bounds(fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST); H5Pset_mdc_image_config(fapl_id, &cache_image_config); Then create or open file as usual. 10/17/1536ICALEPCS 2015

37 www.hdfgroup.org Metadata Cache Image Metadata cache image is read and deleted automatically on file open. Must set cache image FAPL property again if a new cache image is desired on file close. Earlier versions of HDF5 that don't understand the cache image will refuse to open the file. One can use a light-weight utility to remove caching info making file compatible with 1.8 Prototype implementation showed order of magnitude speedup on parallel systems 10/17/1537ICALEPCS 2015

38 www.hdfgroup.org DATA AGGREGATION AND PAGE BUFFERING Performance imporvemnts 3810/17/15ICALEPCS 2015

39 www.hdfgroup.org Page buffering/ Data aggregation 10/17/1539 Aggregate and align metadata and small data, perform I/O in aligned pages

40 www.hdfgroup.org Data and Metadata Aggregators The new aggregators pack small raw data and metadata allocations into aligned blocks which work with the page buffer. 10/17/1540 HDF5 File MetadataData Small allocations ICALEPCS 2015

41 www.hdfgroup.org HDF5 Page Buffering 10/17/1541 Page buffer contains MD pages (L2 cache) HDF5 File Metadata blocks are multiples of 64K Metadata blocks are aligned

42 www.hdfgroup.org IMPROVEMENTS FOR PARALLEL ACCESS HDF5 Parallel 4210/17/15ICALEPCS 2015

43 www.hdfgroup.org10/17/1543 Problems We Solved for PHDF5 Slowness on opening and closing HDF5 files Metadata Cache Optimizations -Avoiding the Metadata Read Storm -Collective Metadata Writes Avoid Truncate Feature Writing/reading multiple variable s Collective I/O on multiple datasets or Multi-Dataset I/O I/O on selections bigger than 2GB with MPICH 3.1.4 Page Buffering Page Buffering - a layer under the VFD to capture small I/Os and cache them for larger paged size I/Os. ICALEPCS 2015

44 www.hdfgroup.org Metadata reads with CGNS and netCDF-4 10/17/1544 CGNS reads on Blue Gene, GPFS netCDF-4 reads on Cray XE6, GPFS ICALEPCS 2015

45 www.hdfgroup.org Collective I/O on multiple datasets 10/17/1545 Two new routines H5Dread_multi() and H5Dwrite_multi() The plot shows the performance difference between using a single H5Dwrite() multiple times and using H5Dwrite_multi () once on 30 chunked datasets on Cray XE-6 with Lustre file system (hopper). ICALEPCS 2015

46 www.hdfgroup.org BACKWARD/FORWARD COMPATIBILITY ISSUES HDF5 1.10.0 10/17/1546ICALEPCS 2015

47 www.hdfgroup.org Backward/Forward compatibility issues 10/17/1547 HDF5 1.10.0 will always read files created by the earlier versions HDF5 1.10.0 by default will create files that can be read by HDF5 1.8.* HDF5 1.10.0 will create files incompatible with 1.8 version if new features are used Tools to “downgrade” the file created by HDF5 1.10.0 h5format_convert (SWMR files; doesn’t rewrite raw data) h5repack (VDS, SWMR and other; does rewrite data)

48 www.hdfgroup.org EXPLORING NEW DIRECTIONS Examples 4810/17/15ICALEPCS 2015

49 www.hdfgroup.org HDF5 ODBC Driver Open DataBase Connectivity (ODBC) Industry standard middleware API for accessing database management sys. All analytics apps. have an ODBC client HiFive – ODBC driver for HDF5 Windows, [Linux, MacOS X] Client & Client/Serve Accessing HDF5 files from Excel & R 49 10/17/15 Thanks to Gerd Heber, THG ICALEPCS 2015

50 www.hdfgroup.org HDF5 for the Web Can I access HDF5 files remotely? API? My (mobile) client speaks HTTP! What is a file system? Who uses files anymore? Cloud computing w/ HDF5 50 10/17/15 Thanks to John Readey, THG ICALEPCS 2015

51 www.hdfgroup.org Emerging Trends in I/O 10/17/1551 Increased computational power…  Huge expansion of simulation data volume & metadata complexity  Complex to manage and analyze …achieved through parallelism  100,000s nodes with 10s millions cores  More frequent hardware & software failures …tiered storage architectures  High performance fabric & solid state storage on-cluster  Low performance, high capacity disk-based storage off-cluster …object-based storage The HDF Group has been working with Intel and others on the Fast Forward Project to investigate and contribute to those trends ICALEPCS 2015

52 www.hdfgroup.org HDF5 role in the Fast Forward Storage Stack Object storage Virtual Object Layer (VOL) Data Integrity/ Fault Tolerance Transaction End-to-end checksums Data Analysis Extensions Query/View/Index APIs Analysis Shipping 10/17/1552ICALEPCS 2015

53 www.hdfgroup.org HDF5 as an interface to non-HDF5 storage 10/17/1553 https://wiki.hpdd.intel.com/display/PUB/Fast+Forward+Storage+and+IO+Program+Documents ICALEPCS 2015

54 www.hdfgroup.org HDF5 as an interface to non-HDF5 storage 10/17/1554 Different File Formats plugins: ICALEPCS 2015

55 www.hdfgroup.org DATA INDEXING Features we are investigating 5510/17/15ICALEPCS 2015

56 www.hdfgroup.org Indexing and HDF5 10/17/15 56 New APIs for indexing and querying of both structure and contents of HDF5 file H5Q API defines query to apply to a file Create/combine queries (OR, AND) Basic operators supported ( ≤, ≥,=, ≠ ) on either dataset/attribute values, link/attribute names HDF5V API retrieves data HDF5X API adds third-party indexing plugins

57 www.hdfgroup.org Example: Combined query 10/17/1557

58 www.hdfgroup.org The HDF Group Thank You! Questions? 10/17/15 58 ICALEPCS 2015


Download ppt "The HDF Group HDF5 Overview Elena Pourmal The HDF Group 1 10/17/15ICALEPCS 2015."

Similar presentations


Ads by Google