Download presentation
Presentation is loading. Please wait.
Published byLuke Goodley Modified over 9 years ago
1
1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group
2
2 Science Produces Large Datasets Observation/experiment driven Observation/experiment driven Simulation driven Information driven 144 MB/hr 200 GB/run > 7GB/expt
3
3 Why Not Commercial DMBSs? Proprietary format Proprietary format Lack of portability Lack of portability Low scalability Low scalability Lack of desirable access modes Lack of desirable access modes Presence of expensive concurrency control and logging mechanism Presence of expensive concurrency control and logging mechanism Expensive parallel versions Expensive parallel versions
4
4 State of the Art Not Enough Scientific file formats and associated I/O APIs Scientific file formats and associated I/O APIs Concentrating on HDF5 Concentrating on HDF5 Data recovery is navigational Data recovery is navigational Subsetting only on a small set of attributes Subsetting only on a small set of attributes
5
5 Why Indexes? Easy Not So Easy
6
6 Previous Indexing Efforts Implicit indexing in HDF5 Implicit indexing in HDF5 JPL use of HDF Vdatas JPL use of HDF Vdatas HDF-EOS point data HDF-EOS point data PyTables PyTables HDF5 internal B-Tree structures HDF5 internal B-Tree structures
7
7 Why a Standard Indexing API? Avoid duplication of effort Avoid duplication of effort PyTables PyTables Standardize indexing in HDF5 Standardize indexing in HDF5 Standard API can be differently implemented Standard API can be differently implemented Make indexes portable Make indexes portable Store indexes in HDF5 files Store indexes in HDF5 files
8
8 H5IN API Create_index Create_index Parameters: location of index, location of data, binning information, memory limits Parameters: location of index, location of data, binning information, memory limits Returns: location of the index Returns: location of the index Query Query Parameters: dataset to query, query string Parameters: dataset to query, query string Returns: selection representing subset of the data corresponding to the query Returns: selection representing subset of the data corresponding to the query
9
9 Design Decisions Limited scope of the prototype Limited scope of the prototype Index stored in a separate dataset Index stored in a separate dataset Returns a selection Returns a selection Projection index Projection index Support for simple boolean queries Support for simple boolean queries
10
10 Limited Scope 1 st indexing prototype in HDF5 1 st indexing prototype in HDF5 Presence of implicit indexing Presence of implicit indexing Index on single datasets Index on single datasets Query over single datasets Query over single datasets Conditions should be over a single dataset Conditions should be over a single dataset Result could be mapped to a separate dataset Result could be mapped to a separate dataset
11
11 Index Storage Root Group: / DAY1DAY2DAY3DAY4 F3F2F1 Location Data
12
12 Index Storage Root Group: / DAY3 F3F2F1 Location Data LD_INDEX F1 F2
13
13 Index Storage Root Group: / DAY3 T_IN P_IN Pressure Temperature
14
14 Returns a Selection Temperature Pressure Concise Storage Concise Storage Efficient Boolean operations Efficient Boolean operations FIND PRESSURE WHERE TEMP IN [100, 200]
15
15 Projection Index TempCategoryPressure 52A32 42D34 57F21 22A22 67D27 AD F AF D
16
16 Binning 123456789101112131415 1-34-67-910-1213-15
17
17 Projection Index 60 50 40 31 30 29 Pressure Temp
18
18 Why Projection Index ? Data is read only Data is read only Mostly dataset once written is not changed Mostly dataset once written is not changed Index does not need to be updated Index does not need to be updated Projection indexes well suited Projection indexes well suited Number of disk accesses is same as in case of a B-Tree Number of disk accesses is same as in case of a B-Tree Are not considering multidimensional queries Are not considering multidimensional queries
19
19 Only Simple Boolean Queries Query Format Query Format SELECT SELECTION WHEREc11 < Attribute1 < c12 AND c21 < Attribute2 < c22 … Results being selections boolean operations can be done inside the library Results being selections boolean operations can be done inside the library
20
20 Conclusion Developing a standard indexing API in HDF5 Developing a standard indexing API in HDF5 Creating a proof of concept prototype using projection indexes Creating a proof of concept prototype using projection indexes Take first step towards developing a query language for HDF5 Take first step towards developing a query language for HDF5
21
21 Future Work Multi-dimensionality Multi-dimensionality Multiple datasets in same file Multiple datasets in same file Multiple datasets across files Multiple datasets across files Indexes on attributes Indexes on attributes Allow user to index subset of datasets Allow user to index subset of datasets
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.