1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.

Slides:



Advertisements
Similar presentations
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Advertisements

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
By Daniela Floresu Donald Kossmann
® Page 1 Intel Compiler Lab – Intel Array Visualizer HDF Workshop VI December 5, 2002 John Readey
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
Crystal Reports User Function Libraries Bruce Ferguson Chelsea Technologies Ltd A Presentation to the Auckland Visual Basic User Group, 28 April, 1999.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon.
An Extension to XML Schema for Structured Data Processing Presented by: Jacky Ma Date: 10 April 2002.
NetCDF-4 The Marriage of Two Data Formats Ed Hartnett, Unidata June, 2004.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Some key-value stores using log-structure Zhichao Liang LevelDB Riak.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.
Indexing for Multidimensional Data An Introduction.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, April 2013 Relational APDM & Relational ASDM models effort done in online.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
Indexing for Multidimensional Data An Introduction.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Artificial Intelligence in Game Design N-Grams and Decision Tree Learning.
Managing the Impacts of Change on Archiving Research Data A Presentation for “International Workshop on Strategies for Preservation of and Open Access.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
1 HDF5 Life cycle of data Boeing September 19, 2006.
A High performance I/O Module: the HDF5 WRF I/O module Muqun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
May 30-31, 2012 HDF5 Workshop at PSI May Shared Object Headers Dana Robinson The HDF Group Efficient Use of HDF5 With High Data Rate X-Ray Detectors.
EXPRESS/Binary Report David Price ISO SC4 Vico Equense March 2006.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Client-Server Paradise ICOM 8015 Distributed Databases.
Files Tutor: You will need ….
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
CS 540 Database Management Systems
Bigtable: A Distributed Storage System for Structured Data
Star Database Tutorial Package Design & Objectivity Discussion Interface Questions – What do you want? -> making requests – What do you get? -> data container.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
CHAPTER 9 File Storage Shared Preferences SQLite.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Bigtable A Distributed Storage System for Structured Data.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
@AU_EarthObs SPD and KEA: HDF5 based file formats for Earth Observation Pete Bunting 1, John Armston 2, Sam Gillingham 3, Neil Flood 4 1. Aberystwyth University,
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
CS 540 Database Management Systems
Test 2 Review Outline.
Efficient Multi-User Indexing for Secure Keyword Search
Moving from HDF4 to HDF5/netCDF-4
LOCO Extract – Transform - Load
Zhangxi Lin, The Rawls College,
HDF5 Metadata and Page Buffering
Crystal Reports User Function Libraries
CSE-291 (Cloud Computing) Fall 2016
On Spatial Joins in MapReduce
Topics Covered in COSC 6340 Data models (ER, Relational, XML (short))
TeraScale Supernova Initiative
Parallel Feature Identification and Elimination from a CFD Dataset
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group

2 Science Produces Large Datasets Observation/experiment driven Observation/experiment driven Simulation driven Information driven 144 MB/hr 200 GB/run > 7GB/expt

3 Why Not Commercial DMBSs? Proprietary format Proprietary format Lack of portability Lack of portability Low scalability Low scalability Lack of desirable access modes Lack of desirable access modes Presence of expensive concurrency control and logging mechanism Presence of expensive concurrency control and logging mechanism Expensive parallel versions Expensive parallel versions

4 State of the Art Not Enough Scientific file formats and associated I/O APIs Scientific file formats and associated I/O APIs Concentrating on HDF5 Concentrating on HDF5 Data recovery is navigational Data recovery is navigational Subsetting only on a small set of attributes Subsetting only on a small set of attributes

5 Why Indexes? Easy Not So Easy

6 Previous Indexing Efforts Implicit indexing in HDF5 Implicit indexing in HDF5 JPL use of HDF Vdatas JPL use of HDF Vdatas HDF-EOS point data HDF-EOS point data PyTables PyTables HDF5 internal B-Tree structures HDF5 internal B-Tree structures

7 Why a Standard Indexing API? Avoid duplication of effort Avoid duplication of effort PyTables PyTables Standardize indexing in HDF5 Standardize indexing in HDF5 Standard API can be differently implemented Standard API can be differently implemented Make indexes portable Make indexes portable Store indexes in HDF5 files Store indexes in HDF5 files

8 H5IN API Create_index Create_index Parameters: location of index, location of data, binning information, memory limits Parameters: location of index, location of data, binning information, memory limits Returns: location of the index Returns: location of the index Query Query Parameters: dataset to query, query string Parameters: dataset to query, query string Returns: selection representing subset of the data corresponding to the query Returns: selection representing subset of the data corresponding to the query

9 Design Decisions Limited scope of the prototype Limited scope of the prototype Index stored in a separate dataset Index stored in a separate dataset Returns a selection Returns a selection Projection index Projection index Support for simple boolean queries Support for simple boolean queries

10 Limited Scope 1 st indexing prototype in HDF5 1 st indexing prototype in HDF5 Presence of implicit indexing Presence of implicit indexing Index on single datasets Index on single datasets Query over single datasets Query over single datasets Conditions should be over a single dataset Conditions should be over a single dataset Result could be mapped to a separate dataset Result could be mapped to a separate dataset

11 Index Storage Root Group: / DAY1DAY2DAY3DAY4 F3F2F1 Location Data

12 Index Storage Root Group: / DAY3 F3F2F1 Location Data LD_INDEX F1 F2

13 Index Storage Root Group: / DAY3 T_IN P_IN Pressure Temperature

14 Returns a Selection Temperature Pressure Concise Storage Concise Storage Efficient Boolean operations Efficient Boolean operations FIND PRESSURE WHERE TEMP IN [100, 200]

15 Projection Index TempCategoryPressure 52A32 42D34 57F21 22A22 67D27 AD F AF D

16 Binning

17 Projection Index Pressure Temp

18 Why Projection Index ? Data is read only Data is read only Mostly dataset once written is not changed Mostly dataset once written is not changed Index does not need to be updated Index does not need to be updated Projection indexes well suited Projection indexes well suited Number of disk accesses is same as in case of a B-Tree Number of disk accesses is same as in case of a B-Tree Are not considering multidimensional queries Are not considering multidimensional queries

19 Only Simple Boolean Queries Query Format Query Format SELECT SELECTION WHEREc11 < Attribute1 < c12 AND c21 < Attribute2 < c22 … Results being selections boolean operations can be done inside the library Results being selections boolean operations can be done inside the library

20 Conclusion Developing a standard indexing API in HDF5 Developing a standard indexing API in HDF5 Creating a proof of concept prototype using projection indexes Creating a proof of concept prototype using projection indexes Take first step towards developing a query language for HDF5 Take first step towards developing a query language for HDF5

21 Future Work Multi-dimensionality Multi-dimensionality Multiple datasets in same file Multiple datasets in same file Multiple datasets across files Multiple datasets across files Indexes on attributes Indexes on attributes Allow user to index subset of datasets Allow user to index subset of datasets