A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Slides:

Advertisements

Similar presentations

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.

Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.

June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.

I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.

A Survey of Wireless Sensor Network Data Collection Schemes by Brett Wilson.

Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory

Активное распределенное хранилище для многомерных массивов Дмитрий Медведев ИКИ РАН.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

XPath Processor MQP Presentation April 15, 2003 Tammy Worthington Advisor: Elke Rundensteiner Computer Science Department Worcester Polytechnic Institute.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.

Hive : A Petabyte Scale Data Warehouse Using Hadoop

HDF5 A new file format & software for high performance scientific data management.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

NPP/ NPOESS Product Data Format Richard E. Ullman NASA/GSFC/NPP NOAA/NESDIS/IPOAlgorithm / System EngineeringData / Information Architecture

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,

Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.

The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.

5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

1 HDF5 Life cycle of data Boeing September 19, 2006.

A High performance I/O Module: the HDF5 WRF I/O module Muqun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.

The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.

Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.

Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Distributed Network Traffic Feature Extraction for a Real-time IDS

Database Performance Tuning and Query Optimization

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Introduction to Apache

Chapter 11 Database Performance Tuning and Query Optimization

Supporting High-Performance Data Processing on Flat-Files

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

New (Applications of) Compiler Techniques for Data Grids

Accelerating Regular Path Queries using FPGA

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 2

3

Outline  Motivation  Challenges Involved  Contributions  Background  Overview of the System Design  Metadata Extraction and Handling  Pre-Processing and Post-Processing Modules  Parallelization of our System  Experiments  Related Work  Conclusion A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 4

Motivation  Scientific Data Management Extremely large datasets  Data Driven Applications Scientific simulations High precision data collection instruments Sensors attached to a satellite A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 5

Challenges Involved  Data exists in a variety of low-level formats Hard for the user to extract the subset of data Significant effort to understand the layout of data  More efficient access to scientific dataset is needed Parallel Computing A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 6

Contributions A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 7  Providing a virtual relational table view over HDF5 dataset  Allows the users to specify the query using the powerful SQL statements  Supporting queries which are based on the dimensions of the dataset  Supporting queries which are based on the dimensions and attributes of the dataset

Background-HDF5  Hierarchical Data Format is the name of a set of file formats and libraries designed to store and organize large amounts of numerical data  Stores the data in a tree like structure  Provides organization by dividing the structure into groups, datasets, attributes A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 8

Structure of HDF5 file A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 9

Parallel HDF5  Allows users to exploit parallelism to improve I/O performance  Provides standard parallel I/O interface and MPI programming  Opens a file in parallel using communicator  Collective parallel access to a file coordinated by all processes A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 10

Our System  Supports SQL-like data subsetting with a virtualized view of HDF5 datasets Metadata Extraction and Handling Pre-processing and Post-processing Modules  Parallel I/O optimizations with Data Virtualization MPI Query Partition A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 11

Query Structure  Support SQL like abstraction with virtualized view of HDF5 datasets SELECT FROM WHERE  Pre-Processing and Post-Processing Queries A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 12

SYSTEM DESIGN SQL query input Master Process SQL parser Metadata descriptor Pre-Processing Module Query Partition Post-Processing Module Slave Processes Data Access Code Parallel HDF5 HDF5 Dataset

Main Steps of Our System(1/2)  Input: SQL query  Output: Necessary subset of data to the user  Process: For every HDF5 dataset, metadata descriptor is generated SQL parser is used to parse the SQL query to retrieve the grammar information Variables and dimensions from the WHERE expression of the SQL query is retrieved A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 14

Main Steps of Our System(2/2)  By evaluating the parse tree and the metadata information, a query request is generated  Based on the query request that was generated the data size is computed  Query-Partitioning module divides the query request into several sub-requests  The data results are obtained by each node based on the sub-request A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 15

HDF5 File Organization  Organizes data as collection of various objects like groups, datasets and attributes  Groups provide logical structuring to data  Datasets contain multi-dimensional array of data elements Dataspace Datatype  Attributes A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 16

Metadata Extraction and Handling  For every HDF5 dataset, a metadata descriptor is generated  Metadata Information for each dataset: Information to interpret data- Datatype Information to describe the logical layout of data – Dataspace Information about Attributes attached to a dataset A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 17

Metadata Extraction Example  Datatype- Integer  Dataspace- Number of dimensions and size of each dimension Number of dimensions – 3 Size of dimension1 – 100 Size of dimension2 – 200 Size of dimension3 – 300  Attributes Temperature Velocity A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 18

Metadata Extraction and Handling  For each group – Information regarding datasets it contains must be extracted  Can be imagined as a table Row- group Columns- all the datasets it contains  Mapping between the dataset variables and group  Information regarding attributes stored for each dataset A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 19

Example HDF5 File GROUP “/” { GROUP “HDFEOS”{ GROUP “GRIDS”{ GROUP “ColumnAmount03” { GROUP “Data Fields” { DATASET “SolarZenithAngle” { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 720, 1440 ) / ( 720, 1440 ) } DATA { } ATTRIBUTE "_FillValue" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA {}} A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 20

Path Information For Dataset SolarZenithAngle the path is /HDFEOS/GRIDS/ColumnAmount03/DataFields/SolarZenithAngle For Attribute _FillValue the path is /HDFEOS/GRIDS/ColumnAmount03/DataFields/SolarZenithAngle/_FillValue Dataspace and Datatype for SolarZenithAngle Datatype: Float Number of Dimensions: 2 Dimension Size: 720X1440 Information about an attribute _Fillvalue: Datatype: Float Array Size: 1 A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 21

Pre-Processing and Post-Processing Modules  Two different types of queries Query based on dimensions Query based on attributes also  First type query supported by HDF5 API Complete understanding of the layout of data Separate programs to retrieve each subset of data  Second type of query No direct support Detailed knowledge of the datasets, HDF5 API and complex programming A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 22

Pre-Processing and Post-Processing Modules  Pre-Processing Module: Inputs: – SQL grammar – Metadata Filtering is done based on dimensions of the dataset  Post-Processing Module: Queried based on the attributes Manually filter out to retrieve necessary subset of data A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 23

Parallelization  Parallel HDF5 has MPI-I/O layer on top of HDF5  API support for parallel access through message passing  Collective I/O call for shared access to a file A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 24

Parallelization  Master-Slave approach with Parallel HDF5 processing  Master Process: Parses the SQL query given by the user Generates data subsetting request Partitions requests into several sub-requests Also performs post-processing  Slave Processes: Receives sub-requests from master process queries a data chunk by accessing the HDF5 in parallel and obtains the data results A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 25

Experiments  Experimental Goals: To evaluate our system with different types of queries on Parallel HDF5 To show performance improvement of the parallel version with sequential subsetting To show our system’s capability on larger datasets To show parallel scalability of our system A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 26

Experimental Setup  Dataset Used: Ozone Monitoring Instrument from NASA website Size available for download: 6.5 MB Extended it to 500 MB, 1 GB, 2 GB and 4 GB  Implementation execution environment: IBM Opteron Cluster Each compute node has Dual core 2.3 GHz Opterons 8 GB memory A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 27

Performance Comparison of sequential and parallel version (4 processors) Dataset Size : 500 MB Dataset Size : 1 GB A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 28

Performance Comparison of sequential and parallel version (4 processors) Dataset Size : 2 GB Dataset Size : 4 GB A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 29

Parallel Scalability of our System Dataset Size : 500 MB Dataset Size : 1 GB A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 30

Parallel Scalability of our System Dataset Size : 2 GB Dataset Size : 4 GB A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 31

Related Work  Li Weng et al provided the automatic data virtualization approach seven years back  SciDB provides a scientific database where natural way of storing data is Arrays  Beomseok Nam et al provide an indexing scheme for efficient retrieval of subset of data- No notion of data virtualization & use of parallel computing  Lot of work on extending relational database technology to support scientific data A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 32

Conclusion  Provide a data management approach for scientific datasets stored in HDF5  Support for SQL queries over virtual view of data  Parallelize queries based on dimensions and also on attributes  Significant performance improvement over Sequential subsetting  System scales well with varying number of nodes and different data sizes A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 33

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets 34 Thank You!