Dynamic Indexing in SpatialHadoop Tin K. Vu, CSE Dept, UC Riverside Advisor: Prof. Ahmed Eldawy, CSE Dept, UC Riverside Project Full Presentation, course CS267, F16
Introduction
SpatialHadoop A framework for big spatial data: language, storage, MapReduce, operations. Work efficiently with static data.
Dynamic Indexing in SpatialHadoop Make SpatialHadoop being able to work with dynamic data. Maintain performance of spatial queries.
Store new data in master node. Repartition with low cost. Key ideas Store new data in master node. Repartition with low cost. How does it overcome limitations of existing works? Keep the advantages of SpatialHadoop. Construct a good strategy for repartitioning.
Outline Introduction Related works Dynamic Indexing Experiments Conclusions
Related works
Big Spatial Indexes Hadoop-GIS: partition based-on density. Limitations: support limited types of spatial data type or query. SpatialHadoop: pre-indexing based-on boundary. Limitation: work with static data.
Dynamic Indexes AsterixDB, HBase: support high rate of data ingestion. Limitations: support limited types of spatial data type or query.
Dynamic Spatial Indexes MD-HBase, GeoMesa: view spatial data in key- value aspect. Limitations: support limited types of spatial data type or query.
Dynamic Indexing
Approach Multi-levels tree with HDFS: New data is stored in master node, then flush to slave nodes. Cost model for finding a good repartition strategy.
Indexing System Prototype
Insertion Process Insert to 2nd internal node first. Flush data to corresponding partition when its size reaches to a threshold.
How to repartition?
Similarity between partitions Sim(R1,R2) = Intersection(R1,R2) / Union(R1,R2) Repartition when Sim(R1,R2) < threshold. E.g. 95%. Quality = (G+U)/T * 100% G is total area of new partitions. U is total area of interactions between unchanged partitions and standard corresponding partition. T is total area of standard partitions.
Repartition strategy Step 1: compute boundaries of standard partitions. Step 2: Compute similarities between old partitions and standard partitions. Step 3: Split partitions which its similarity is less than a configurable threshold.
Algorithm: find partitions to split
Experiments
Experiment setup Datasets were randomly generated by SpatialHadoop (100MB, 200MB, 300MB). Single node, HDFS block size: 16MB.
Experiment setup
Experiment setup
Conclusions
Contributions Proposed an indexing prototype. Proposed a cost-model to evaluate cost of repartitioning. Proposed an algorithm to find the good strategy for repartitioning.
Future works Execute experiment with diversity data. Execute experiment to compare spatial query performance between static and dynamic index.
Thank you!