Dynamic Indexing in SpatialHadoop

Slides:



Advertisements
Similar presentations
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Advertisements

Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
CS144: Spatial Index. Example Dataset Grid File (2 points per bucket)
University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.
Query Processing of Massive Trajectory Data based on MapReduce Qiang Ma, Bin Yang (Fudan University) Weining Qian, Aoying Zhou (ECNU) Presented By: Xin.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Processing Data in External Storage CS Data Structures Mehmet H Gunes Modified from authors’ slides.
Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex
Summary of query compilers (Section16.8) Varun Gupta Department of Computer Science ID-216 CS 257.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Focused Matrix Factorization for Audience Selection in Display Advertising BHARGAV KANAGAL, AMR AHMED, SANDEEP PANDEY, VANJA JOSIFOVSKI, LLUIS GARCIA-PUEYO,
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
SpatialHadoop: A MapReduce Framework for Spatial Data
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.
SpatialHadoop:A MapReduce Framework
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
DIST: A Distributed Spatio-temporal Index Structure for Sensor Networks Anand Meka and Ambuj Singh UCSB, 2005.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,
A New Spatial Index Structure for Efficient Query Processing in Location Based Services Speaker: Yihao Jhang Adviser: Yuling Hsueh 2010 IEEE International.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
BATON A Balanced Tree Structure for Peer-to-Peer Networks H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
AQWA Adaptive Query-Workload-Aware Partitioning of Big Spatial Data Dimosthenis Stefanidis Stelios Nikolaou.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Big Data & Test Automation
Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee
Presented by: Omar Alqahtani Fall 2016
Outline Introduction State-of-the-art solutions
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
HBase Mohamed Eltabakh
CPS216: Data-intensive Computing Systems
CS 540 Database Management Systems
Updating SF-Tree Speaker: Ho Wai Shing.
CS122A: Introduction to Data Management Lecture #16: AsterixDB
Tutorial: Big Data Algorithms and Applications Under Hadoop
Scaling Spark on HPC Systems
Janbasktraining.com Hadoop Ecosystem Components 12.
Pathology Spatial Analysis February 2017
CLOUDERA TRAINING For Apache HBase
Introduction to HDFS: Hadoop Distributed File System
Processing Data in External Storage
Central Florida Business Intelligence User Group
SpatialHadoop: A MapReduce Framework for Spatial Data
Selectivity Estimation of Big Spatial Data
Distributed Systems CS
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
The Basics of Apache Hadoop
CS6604 Digital Libraries IDEAL Webpages Presented by
On Spatial Joins in MapReduce
CS110: Discussion about Spark
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Multidimensional Indexes
Assoc. Prof. Dr. Syed Abdul-Rahman Al-Haddad
Execution Framework: Hadoop 2.x
A Framework for Access Methods for Versioned Data
Chapter 11 Indexing And Hashing (1)
Yi Wang, Wei Jiang, Gagan Agrawal
Distributed Systems CS
Efficient Processing of Top-k Spatial Preference Queries
Donghui Zhang, Tian Xia Northeastern University
Parallel Feature Identification and Elimination from a CFD Dataset
Efficient Aggregation over Objects with Extent
SDMX meeting Big Data technologies
Presentation transcript:

Dynamic Indexing in SpatialHadoop Tin K. Vu, CSE Dept, UC Riverside Advisor: Prof. Ahmed Eldawy, CSE Dept, UC Riverside Project Full Presentation, course CS267, F16

Introduction

SpatialHadoop A framework for big spatial data: language, storage, MapReduce, operations. Work efficiently with static data.

Dynamic Indexing in SpatialHadoop Make SpatialHadoop being able to work with dynamic data. Maintain performance of spatial queries.

Store new data in master node. Repartition with low cost. Key ideas Store new data in master node. Repartition with low cost. How does it overcome limitations of existing works? Keep the advantages of SpatialHadoop. Construct a good strategy for repartitioning.

Outline Introduction Related works Dynamic Indexing Experiments Conclusions

Related works

Big Spatial Indexes Hadoop-GIS: partition based-on density. Limitations: support limited types of spatial data type or query. SpatialHadoop: pre-indexing based-on boundary. Limitation: work with static data.

Dynamic Indexes AsterixDB, HBase: support high rate of data ingestion. Limitations: support limited types of spatial data type or query.

Dynamic Spatial Indexes MD-HBase, GeoMesa: view spatial data in key- value aspect. Limitations: support limited types of spatial data type or query.

Dynamic Indexing

Approach Multi-levels tree with HDFS: New data is stored in master node, then flush to slave nodes. Cost model for finding a good repartition strategy.

Indexing System Prototype

Insertion Process Insert to 2nd internal node first. Flush data to corresponding partition when its size reaches to a threshold.

How to repartition?

Similarity between partitions Sim(R1,R2) = Intersection(R1,R2) / Union(R1,R2) Repartition when Sim(R1,R2) < threshold. E.g. 95%. Quality = (G+U)/T * 100% G is total area of new partitions. U is total area of interactions between unchanged partitions and standard corresponding partition. T is total area of standard partitions.

Repartition strategy Step 1: compute boundaries of standard partitions. Step 2: Compute similarities between old partitions and standard partitions. Step 3: Split partitions which its similarity is less than a configurable threshold.

Algorithm: find partitions to split

Experiments

Experiment setup Datasets were randomly generated by SpatialHadoop (100MB, 200MB, 300MB). Single node, HDFS block size: 16MB.

Experiment setup

Experiment setup

Conclusions

Contributions Proposed an indexing prototype. Proposed a cost-model to evaluate cost of repartitioning. Proposed an algorithm to find the good strategy for repartitioning.

Future works Execute experiment with diversity data. Execute experiment to compare spatial query performance between static and dynamic index.

Thank you!