A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 2

Outline  Motivation  Challenges Involved  Contributions  Background  Overview of the System Design  Metadata Extraction and Handling  Pre-Processing and Post-Processing Modules  Parallelization of our System  Experiments  Related Work  Conclusion A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 4

Motivation  Scientific Data Management Extremely large datasets  Data Driven Applications Scientific simulations High precision data collection instruments Sensors attached to a satellite A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 5

Challenges Involved  Data exists in a variety of low-level formats Hard for the user to extract the subset of data Significant effort to understand the layout of data  More efficient access to scientific dataset is needed Parallel Computing A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 6

Contributions A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 7  Providing a virtual relational table view over HDF5 dataset  Allows the users to specify the query using the powerful SQL statements  Supporting queries which are based on the dimensions of the dataset  Supporting queries which are based on the dimensions and attributes of the dataset

Background-HDF5  Hierarchical Data Format is the name of a set of file formats and libraries designed to store and organize large amounts of numerical data  Stores the data in a tree like structure  Provides organization by dividing the structure into groups, datasets, attributes A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 8

Structure of HDF5 file A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 9

Parallel HDF5  Allows users to exploit parallelism to improve I/O performance  Provides standard parallel I/O interface and MPI programming  Opens a file in parallel using communicator  Collective parallel access to a file coordinated by all processes A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 10

Our System  Supports SQL-like data subsetting with a virtualized view of HDF5 datasets Metadata Extraction and Handling Pre-processing and Post-processing Modules  Parallel I/O optimizations with Data Virtualization MPI Query Partition A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 11

Query Structure  Support SQL like abstraction with virtualized view of HDF5 datasets SELECT FROM WHERE  Pre-Processing and Post-Processing Queries A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 12

SYSTEM DESIGN SQL query input Master Process SQL parser Metadata descriptor Pre-Processing Module Query Partition Post-Processing Module Slave Processes Data Access Code Parallel HDF5 HDF5 Dataset

Main Steps of Our System(1/2)  Input: SQL query  Output: Necessary subset of data to the user  Process: For every HDF5 dataset, metadata descriptor is generated SQL parser is used to parse the SQL query to retrieve the grammar information Variables and dimensions from the WHERE expression of the SQL query is retrieved A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 14

Main Steps of Our System(2/2)  By evaluating the parse tree and the metadata information, a query request is generated  Based on the query request that was generated the data size is computed  Query-Partitioning module divides the query request into several sub-requests  The data results are obtained by each node based on the sub-request A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 15

HDF5 File Organization  Organizes data as collection of various objects like groups, datasets and attributes  Groups provide logical structuring to data  Datasets contain multi-dimensional array of data elements Dataspace Datatype  Attributes A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 16

Metadata Extraction and Handling  For every HDF5 dataset, a metadata descriptor is generated  Metadata Information for each dataset: Information to interpret data- Datatype Information to describe the logical layout of data – Dataspace Information about Attributes attached to a dataset A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 17

Metadata Extraction Example  Datatype- Integer  Dataspace- Number of dimensions and size of each dimension Number of dimensions – 3 Size of dimension1 – 100 Size of dimension2 – 200 Size of dimension3 – 300  Attributes Temperature Velocity A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 18

Metadata Extraction and Handling  For each group – Information regarding datasets it contains must be extracted  Can be imagined as a table Row- group Columns- all the datasets it contains  Mapping between the dataset variables and group  Information regarding attributes stored for each dataset A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 19

Example HDF5 File GROUP “/” { GROUP “HDFEOS”{ GROUP “GRIDS”{ GROUP “ColumnAmount03” { GROUP “Data Fields” { DATASET “SolarZenithAngle” { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 720, 1440 ) / ( 720, 1440 ) } DATA { } ATTRIBUTE "_FillValue" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA {}} A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 20

Path Information For Dataset SolarZenithAngle the path is /HDFEOS/GRIDS/ColumnAmount03/DataFields/SolarZenithAngle For Attribute _FillValue the path is /HDFEOS/GRIDS/ColumnAmount03/DataFields/SolarZenithAngle/_FillValue Dataspace and Datatype for SolarZenithAngle Datatype: Float Number of Dimensions: 2 Dimension Size: 720X1440 Information about an attribute _Fillvalue: Datatype: Float Array Size: 1 A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 21

Pre-Processing and Post-Processing Modules  Two different types of queries Query based on dimensions Query based on attributes also  First type query supported by HDF5 API Complete understanding of the layout of data Separate programs to retrieve each subset of data  Second type of query No direct support Detailed knowledge of the datasets, HDF5 API and complex programming A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 22

Pre-Processing and Post-Processing Modules  Pre-Processing Module: Inputs: – SQL grammar – Metadata Filtering is done based on dimensions of the dataset  Post-Processing Module: Queried based on the attributes Manually filter out to retrieve necessary subset of data A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 23

Parallelization  Parallel HDF5 has MPI-I/O layer on top of HDF5  API support for parallel access through message passing  Collective I/O call for shared access to a file A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 24

Parallelization  Master-Slave approach with Parallel HDF5 processing  Master Process: Parses the SQL query given by the user Generates data subsetting request Partitions requests into several sub-requests Also performs post-processing  Slave Processes: Receives sub-requests from master process queries a data chunk by accessing the HDF5 in parallel and obtains the data results A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 25

Experiments  Experimental Goals: To evaluate our system with different types of queries on Parallel HDF5 To show performance improvement of the parallel version with sequential subsetting To show our system’s capability on larger datasets To show parallel scalability of our system A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 26

Experimental Setup  Dataset Used: Ozone Monitoring Instrument from NASA website Size available for download: 6.5 MB Extended it to 500 MB, 1 GB, 2 GB and 4 GB  Implementation execution environment: IBM Opteron Cluster Each compute node has Dual core 2.3 GHz Opterons 8 GB memory A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 27

Performance Comparison of sequential and parallel version (4 processors) Dataset Size : 500 MB Dataset Size : 1 GB A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 28

Performance Comparison of sequential and parallel version (4 processors) Dataset Size : 2 GB Dataset Size : 4 GB A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 29

Parallel Scalability of our System Dataset Size : 500 MB Dataset Size : 1 GB A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 30

Parallel Scalability of our System Dataset Size : 2 GB Dataset Size : 4 GB A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 31

Related Work  Li Weng et al provided the automatic data virtualization approach seven years back  SciDB provides a scientific database where natural way of storing data is Arrays  Beomseok Nam et al provide an indexing scheme for efficient retrieval of subset of data- No notion of data virtualization & use of parallel computing  Lot of work on extending relational database technology to support scientific data A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 32

Conclusion  Provide a data management approach for scientific datasets stored in HDF5  Support for SQL queries over virtual view of data  Parallelize queries based on dimensions and also on attributes  Significant performance improvement over Sequential subsetting  System scales well with varying number of nodes and different data sizes A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 33

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 34 Thank You!

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Similar presentations

Presentation on theme: "A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Similar presentations

Presentation on theme: "A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE."— Presentation transcript:

Similar presentations

About project

Feedback