DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon.

Slides:



Advertisements
Similar presentations
Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
Advertisements

Michael Pizzo Software Architect Data Programmability Microsoft Corporation.
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.
The HDF Group ESIP Summer Meeting HDF-Java Overview Joel Plutchak The HDF Group 1 July 8 – 11, 2014.
The Future of NetCDF Russ Rew UCAR Unidata Program Center Acknowledgments: John Caron, Ed Hartnett, NASA’s Earth Science Technology Office, National Science.
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Introduction to Database Management  Department of Computer Science Northern Illinois University January 2001.
HDF5 FastQuery Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices John Shalf, Wes Bethel LBNL Visualization Group Kensheng Wu, Kurt.
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Organizing Data & Information
File Management.
IST Databases and DBMSs Todd S. Bacastow January 2005.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
Euratom – ENEA Association Commonalities and differences between MDSplus and HDF5 data systems G. Manduchi Consorzio RFX, Euratom-ENEA Association, corso.
1 of 14 Substituting HDF5 tools with Python/H5py scripts Daniel Kahn Science Systems and Applications Inc. HDF HDF-EOS Workshop XIV, 28 Sep
Systems analysis and design, 6th edition Dennis, wixom, and roth
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
The HDF Group April 17-19, 2012HDF/HDF-EOS Workshop XV1 Introduction to HDF5 Barbara Jones The HDF Group The 15 th HDF and HDF-EOS Workshop.
Database Architecture Introduction to Databases. The Nature of Data Un-structured Semi-structured Structured.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Database System Concepts and Architecture
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
The HDF Group Virtual Object Layer in HDF5 Exploring new HDF5 concepts May 30-31, 2012HDF5 Workshop at PSI 1.
A Domain-Specific Modeling Language for Scientific Data Composition and Interoperability Hyun ChoUniversity of Alabama at Birmingham Jeff GrayUniversity.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Retro: Modular and efficient retrospection in a database Ross Shaull Liuba Shrira Brandeis University.
LSST: Preparing for the Data Avalanche through Partitioning, Parallelization, and Provenance Kirk Borne (Perot Systems Corporation / NASA GSFC and George.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
1 N-bit and ScaleOffset filters MuQun Yang National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Urbana, IL
1 HDF5 Life cycle of data Boeing September 19, 2006.
Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.
ABSTRACT The JDBC (Java Database Connectivity) API is the industry standard for database- independent connectivity between the Java programming language.
NetCDF Data Model Issues Russ Rew, UCAR Unidata NetCDF 2010 Workshop
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
DM_PPT_NP_v01 SESIP_0715_JR HDF Server HDF for the Web John Readey The HDF Group Champaign Illinois USA.
HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
NetCDF-4: Software Implementing an Enhanced Data Model for the Geosciences Russ Rew, Ed Hartnett, and John Caron UCAR Unidata Program, Boulder
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
CIS 250 Advanced Computer Applications Database Management Systems.
1 Geog 357: Data models and DBMS. Geographic Decision Making.
SDM Center Parallel I/O Storage Efficient Access Team.
Introduction to Databases Angela Clark University of South Alabama.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Data Resource Management Lecture 8. Traditional File Processing Data are organized, stored, and processed in independent files of data records In traditional.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
HDF/HDF-EOS Meeting Oct th 2008, Aurora CO Proposal for adding Named Dimensions to HDF5 Arrays Daniel Kahn Science Systems and Applications, Inc.
Databases and DBMSs Todd S. Bacastow January
Hierarchical Data Formats (HDF) Update
Moving from HDF4 to HDF5/netCDF-4
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
HDF5 October 8, 2017 Elena Pourmal Copyright 2016, The HDF Group.
Informix Red Brick Warehouse 5.1
CSCE 990: Advanced Distributed Systems
Hierarchical Data Format (HDF) Status Update
ICOM 5016 – Introduction to Database Systems
Presentation transcript:

DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C

DM_PPT_NP_v01 SESIP_0715_JP The Technology The HDF5 hierarchical data file format and API is flexible—it supports self-describing, portable, and compact storage, as well as efficient I/O. 2 July 14, 2015 It is a well-described and well-supported format that is used in a wide variety of disciplines.

DM_PPT_NP_v01 SESIP_0715_JP The Problem The HDF5 API does not include mechanisms to efficiently find and access data based on data values, like one would perform a query on a relational database. 3 Members of the HDF Community have developed this capability so that their applications can quickly access targeted pieces of data— rapidly search and select interesting portions of data based on ad hoc search criteria.

DM_PPT_NP_v01 SESIP_0715_JP A Solution Solutions to this problem are called indexing. This is done by adding a layer between the HDF5 API and an application that builds a index on one or more parameters, saving enough information in the index to more efficiently find and retrieve specific parts of one or more datasets in an HDF5 file. 4 July 14, 2015 HDF5 File Application HDF5 API Index Query

DM_PPT_NP_v01 SESIP_0715_JP Implementations Implementations exist for adding indexed access to HDF5 files. A few of them are: 5 July 14, 2015 PyTables FastQuery / FastBit Alacrity HDF5 (prototype) Other experimental work in progress

DM_PPT_NP_v01 SESIP_0715_JP PyTables Uses the Python programming language Built on top of the HDF5 library and the NumPy package Uses Optimized Partially Sorted Index (OPSI) technology designed for fast access to very large (>100M rows) tables 6 July 14, 2015

DM_PPT_NP_v01 SESIP_0715_JP PyTables Example –create a table: table = h5file.create_table(group, 'readout', Particle, "Readout example”) –Query a table: condition = '(name == "Particle: 5") | (name == "Particle: 7")’ for record in table.where(condition): # do something with "record” 7 July 14, 2015

DM_PPT_NP_v01 SESIP_0715_JP PyTables Limitations No support for relationships between datasets Future work: No specifics; a continuing effort that welcomes additional developers, testers, and users Future maintenance and extended development proposals underway The HDF Group is very interested in taking a significant role in this work as it moves forward. 8 July 14, 2015

DM_PPT_NP_v01 SESIP_0715_JP Alacrity Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying Exploits the representation of floating-point values by binning on significant bits, using an inverted index to map each bin The software is a research vessel for a group at University of North Carolina 9 July 14, 2015

DM_PPT_NP_v01 SESIP_0715_JP FastQuery / FastBit FastQuery is an extension to HDF5 from the visualization Group at Lawrence Berkley National Laboratory (LBNL) Based on LBNL’s FastBit, an efficient searching technology that uses bitmap indexing for processing complex, multi-dimensional ad hoc queries on read-only numeric data Extends HDF5’s hyperslab selection mechanism to allow arbitrary range conditions on the data values contained in the datasets Compound queries can span multiple datasets 10 July 14, 2015

DM_PPT_NP_v01 SESIP_0715_JP FastQuery / FastBit Assumptions Data is: –0-3 dimensional block-structured –Limited datatypes: float, double, int32, int64, byte Two-level hierarchical organization: TimeStep, VariableName Future work: Arbitrary nesting More data schemas (unstructured, AMR, etc.) 11 July 14, 2015

DM_PPT_NP_v01 SESIP_0715_JP HDF5 Data Analysis Extensions The HDF Group is developing support for indexing and querying to enable application developers to create complex and high-performance queries on both metadata and data elements within an HDF5 container. These are in the form of objects and associated APIs: –Query Objects: The H5Q API is used to define a query and apply it to an HDF5 container –View Objects: The H5V API is used to generate a selection from a query –Index Objects: The H5X API is used to attach / build an index to data; it is plug-in based to leverage multiple technologies 12 July 14, 2015 Note: These extensions were developed under Intel’s subcontract with Lawrence Livermore National Security, LLC under U.S. Department of Energy contract DE-AC52-07NA27344.

DM_PPT_NP_v01 SESIP_0715_JP HDF5 Data Analysis Extensions Example July 14, 2015 Add index to existing dataset dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); /* Add indexing information */ H5Xcreate(dataset, H5X_PLUGIN_FASTBIT, H5P_DEFAULT); H5Dclose(dataset); Create and apply query float query_lb = 39.1f, query_ub = 42.6f; hid_t query, query1, query2; /* Create a simple query:39.1 < x */ query1 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_GREATER_THAN, H5T_NATIVE_FLOAT, &query_lb); /* Create a second simple query: x < 42.1 */ query2 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_LESS_THAN, H5T_NATIVE_FLOAT, &query_ub); /* Combine query: 39.1 < x < 42.1 */ query = H5Qcombine(query1, H5Q_COMBINE_AND, query2); /* Use query to get selection */ dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); H5Dquery(dataset, query, &dataspace); /* Read data here using dataspace */ H5Dclose(dataset); 13

DM_PPT_NP_v01 SESIP_0715_JP HDF5 Data Analysis Extensions Status Phase I status (2014): Prototype implementations for H5Q, H5V, H5X APIs H5X API plugins for Alacrity and FastBit technologies Incremental update of data is not supported by indexing packages Current work (started July 1): Views generated from queries to abstract selection results on multiple objects Support for indexing on chunked datasets Support for compound types Support for parallel indexing Query optimization Additional indexing plugins 14 July 14, 2015

DM_PPT_NP_v01 SESIP_0715_JP Summary A variety of index methods exist that can be used to speed targeted access to data in HDF5 files. Capabilities and underlying technologies differ so use the best fit for your application. Work is ongoing… let developers know of your needs and experiences! 15 July 14, 2015

DM_PPT_NP_v01 SESIP_0715_JP 16 References & Sources 16 PyTables Alacrity J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel, N. Shah, S. Ethier, C.-S.Chang, J. Chen, H. Kolla, R. Ross, S. Klasky, N. Samatova, “ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying,” Transactions on Large-Scale Data- and Knowledge-Centered Systems, Vol 10 (2013). FastQuery / FastBit K. Wu, “FastBit: an efficient indexing technology for accelerating data-intensive science,” Journal of Physics: Conference Series, vol. 16, no. 1 (2005) HDF5-FastQuery: An API for Simplifying Access to Data Storage, Retrieval, Indexing and Querying. - Report Number: LBNL/PUB-958 (2006) HDF Data Analysis Extensions J. Soumagne, Q. Koziol, RFC: Data Analysis Extensions, RFC THG v4; The HDF Group (2014)

DM_PPT_NP_v01 SESIP_0715_JP 17

DM_PPT_NP_v01 SESIP_0715_JP 18 This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C