Voronoi-based Geospatial Query Processing with MapReduce

Slides:



Advertisements
Similar presentations
Voronoi-based Geospatial Query Processing with MapReduce
Advertisements

Finding the Sites with Best Accessibilities to Amenities Qianlu Lin, Chuan Xiao, Muhammad Aamir Cheema and Wei Wang University of New South Wales, Australia.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
MapReduce.
Spatial Join Queries. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal
Indexing Network Voronoi Diagrams*
SASH Spatial Approximation Sample Hierarchy
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
Spatio-temporal Databases Time Parameterized Queries.
An Efficient and Scalable Approach to CNN Queries in a Road Network Hyung-Ju Cho and Chin-Wan Chung Dept. of EECS, KAIST VLDB 2005.
On Efficient Spatial Matching Raymond Chi-Wing Wong (the Chinese University of Hong Kong) Yufei Tao (the Chinese University of Hong Kong) Ada Wai-Chee.
1 Efficient Method for Maximizing Bichromatic Reverse Nearest Neighbor Raymond Chi-Wing Wong (Hong Kong University of Science and Technology) M. Tamer.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
FLANN Fast Library for Approximate Nearest Neighbors
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Ch 4. The Evolution of Analytic Scalability
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
VLDB '2006 Haibo Hu (Hong Kong Baptist University, Hong Kong) Dik Lun Lee (Hong Kong University of Science and Technology, Hong Kong) Victor.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Introduction to Hadoop and HDFS
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
AQWA Adaptive Query-Workload-Aware Partitioning of Big Spatial Data Dimosthenis Stefanidis Stelios Nikolaou.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Strategies for Spatial Joins
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Parallel Databases.
Multidimensional Access Structures
Influence sets based on Reverse Nearest Neighbor Queries
Parallel Density-based Hybrid Clustering
Cloud Data Anonymization Using Hadoop Map-Reduce Framework With Qos Evaluation and Behaviour analysis PROJECT GUIDE: Ms.S.Subbulakshmi TEAM MEMBERS: A.Mahalakshmi( ).
Query Processing in Databases Dr. M. Gavrilova
SpatialHadoop: A MapReduce Framework for Spatial Data
Sameh Shohdy, Yu Su, and Gagan Agrawal
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
The Basics of Apache Hadoop
K Nearest Neighbor Classification
6. Introduction to nonparametric clustering
On Spatial Joins in MapReduce
February 26th – Map/Reduce
Communication and Memory Efficient Parallel Decision Tree Construction
Cse 344 May 4th – Map/Reduce.
Efficient Evaluation of k-NN Queries Using Spatial Mashups
Ch 4. The Evolution of Analytic Scalability
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
The Gamma Database Machine Project
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Voronoi-based Geospatial Query Processing with MapReduce Afsin Akdogan, Ugur Demiryurek, Farnoush Banaei-Kashani, Cyrus Shahabi Integrated Media System Center, University of Southern Califonia Los Angeles, CA 90089 Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on 2014.06.16(MON) Regular seminar TaeHoon Kim

Contents Introduction Background Constructing Voronoi Diagram with MapReduce Query processing Performance Evaluation Conclusions

1. Introduction With the recent advances in location-based services, the amount of geospatial data is rapidly growing Geospatial queries are computationally complex problems which are time consuming to solve, especially with large datasets On the other hand, we observe that a large variety of geospatial queries are intrinsically parallelizable. Cloud computing enables a considerable reduction in operational expenses by providing flexible resources that can instantaneously scale up and down The big IT vendors including Google, IBM, Microsoft and Amazon are ramping up parallel programming infrastructures ramp something up ~을 늘리다[증가시키다]

1. Introduction Google’s MapReduce programing model Parallelization for processing large datasets Can do that at a very large scale Google processes 20 petabytes of data per day with MapReduce A variety of geospatial queries can be efficiently modeled using MapReduce Reverse Nearest Neighbor query(RNN) RNN Given a query point q and a set of data points P, RNN query retrieves all the data points that have q as their nearest neighbors p1 p2 p4 p3 p5 Example of RNN Query p5: {p2, p3, p4}

1. Introduction Parallel spatial query processing has been studied in the contexts of parallel databases, cluster systems, and peer to peer systems as well as cloud platforms However, the points in each partition are not locally indexed Meanwhile, several approaches that use distributed hierarchical index structures R-tree based P2PR-Tree, SD-Rtree, kd-tree based k-RP, quad-tree based hQT Problems of distributed hierarchical index structures Do not scale due to the traditional top-down search that overloads the nodes near the tree root, and fail to provide full decentralization

1. Introduction R-Tree index with MapReduce Without supporting types of query Motivate the problem of geospatial query processing without specifically explaining how to process the queries Hierarchical indices like R-tree are structurally not suitable for MapReduce In this paper, we propose a MapReduce-based approach Both constructs a flat spatial index, Voronoi diagram (VD), and enables efficient parallel processing of a wide range of spatial queries Spatial queries reverse nearest neighbor (RNN), maximum reverse nearest neighbor (MaxRNN) k nearest neighbor (kNN) queries

2. Background Voronoi diagrams A voronoi diagram decomposes a space into disjoint polygons based on the set of generators(i.e., data points) Given a set of generators S in the Euclidean space, Voronoi diagram associates all locations in the plane to their closest generator Each generator s has a Voronoi polygon consisting of all points closer to s than to any other site S1 S2 S3 S6 S5 S4 : Data point : A set of generators {S1, S2, S3, S4, S5 } >

2. Background MapReduce Hadoop first each mapper is assigned to one or more splits depending on the number of machines in the cluster Secondly, each mapper reads inputs provided as a <key, value> pair at a time, applies the map function to it and generates intermediate <key, value>pairs as the output Finally, reducers fetch the output of the mappers and merge all of the intermediate values that share the same intermediate key, process them together and generate the final output

3. Constructing Voronoi Diagram with MapReduce Voronoi Diagram construction is inherently suitable for MapReduce modeling Because a VD can be obtained by merging multiple partial Voronoi diagrams (PVD) VD is built with MapReduce using divide and conquer method The main idea behind the divide and conquer method 1. Given a set of data points P as an input in Euclidean space, first the points are sorted in increasing order by X coordinate 2. P is separated into several subsets of equal size 3. A PVD is generated for the points of each subset, and then all of the PVDs are merged to obtain the final VD for P

3. Constructing Voronoi Diagram with MapReduce Map phase Given a set of data points P = { p1,p2 … p13,p14 } as input, first the points are sorted, then they are divided into two splits equal in size The first mapper reads the data points in Split 1, generates a PVD and emits <1, PVD1> Likewise, the second mapper emits, <1, PVD2> Reduce phase The reducer aggregates these PVDs and merges them into a single VD The final output of the reducer is in the following <key, value> format : <pi, VNs(Pi) > where VNs(Pi) represents the set of Voronoi neighbors of pi

3. Voronoi Generation: Map phase Generate Partial Voronoi Diagrams (PVD) Split 1 Split 2 right left mapper reads the data points in Split 1, generates a PVD and emits Output : <1, PVD1> p7 p1 p3 p5 p8 p4 p2 p6 right left emit emit <key, value>: <1, PVD(Split 1)> <1>, <p2,p3,p4> <key, value>: <1, PVD(Split 2) <1>, <p6,p7,p8>

3. Voronoi Generation: Reduce phase Remove superfluous edges and generate new edges Split 1 Split 2 right left The reducer aggregates these PVDs and merges them into a single VD Final Output <pi, VNs(Pi) > p7 p1 p3 p5 p8 p4 p2 p6 right left emit <key, value>: <point, Voronoi Neighbors> <p1, {p2, p3}> <p2, {p1, p3, p4}> <p3, {p1, p2, p4, p5}> …..

4. Query Processing : RNN Suppose the input the map function is <point, Voronoi Neighbors(VNs)>, ∀𝑝∈𝑃 Map phase Each point p finds its Nearest Neighbor Emit : <NN(pn), pn> Reduce phase The reducer aggregates all pairs with the same key pK Emit : <point, Reverse Nearest Neighbors>

4. RNN query processing Mapper emit <p5, p2> <p5, p3> Reduce Reducer <p5, {p2, p3, p4}> emit

4. Query Processing : MaxRNN Nearest Neighbor Circle(NNC) Given a point p and its nearest neighbor p’, NNC of p is the circle centered at p’ with radius |p, p’| MapReduce Finds the NN of every point and computes the radiuses of the circles Finds the overlapping circles first, Then, it finds the intersection points that represent the optimal region

4. Max RNN query processing Mapper 1st step p1 finds p3 as the nearest neighbor among its VNs {p2,p3} and sets |p1,p3| as the radius of its NNC <(p1, |p1,p3|)>, <(p2,r2),(p3 r3)> Mapper 2nd step p1 checks its VNs<p2,p3>, finds that is overlaps <p2,p3> and computes the intersection points {i2,i3,i5,i6} Each intersection point i, then computes its weight by counting the NNCs covering I <1, i,w(i)> p1 p3 p2 r2 r3 p1 p3 p2 i5 i1 i2 i3 i4 i6 <1, (i2,3)>, <1, (i3,3)> <1, (i5,2)> <1,(i6,2)> emit Map Reducer Finds the intersection points with the heightest weights who are emitted as the final ouput i1,i2,i3 emit

4. Query Processing : kNN K Nearest Neighbor Query(kNN) Given a query point q and a set of data points P, k Nearest Neighbor(kNN) query finds the k closest data point pi∈𝑃 to q where d(q, pi) ≤ d(q, p) Suppose we answer a 3NN query with two mappers and one reducer for query point q1 Map phase Each mapper reads its split Simultaneously process the query point q2 and other queries as well Reduce phase Receives the query point as key and its neighbors as value, and emits them as they are

4. kNN query processing Map NN is among the VNs of p1, { p2, p3, p4, p5, p6, p7, p8 } Map Output <q1>, <{p1, p2, p5}> Reduce Receives the query point as key and its neighbors as value, and emits them as they are <q1,> <p1, p2 … pi> <q2,> <p1, p2 … pi> we answer a 3NN query with two mappers and one reducer for query point q1

Performance Evaluation Competitors Voronoi Diagram R-tree Factor Index generation time Query response time Parameter Node number kNN : K Datasets BSN: all businesses in the entire U.S., containing approximately 1,300,000 data points. RES: all restaurants in the entire U.S., containing approximately 450,000 data points. Environment Hadoop on Amazon EC2 0.20.1 64bit Fedora 8 Linux Operating System with 15GB memory 4 virtual cores, and disks with 1,690 GB storage

Performance Evaluation VD & R-Tree construction time Construction of the R-tree index takes less time than that of VD Because, the mapper generating PVDs execute expensive computations and there is only one reducer merging the PVDs in the cluster

Performance Evaluation NN query on VD & R-Tree Our experiments show that VD outperforms R-tree in query response time This is because with VD, points can immediately find there VNs in the map phase

Performance Evaluation RNN The average response time slightly decreases with the increasing number of data points; hence the overall throughput increases Single Node without parallelism Multi Node with parallelism

Performance Evaluation MaxRNN As shown, parallelization reduces the response time at linear rate for both datasets We observe a performance increase from %30 up to 50% Effect of node number on MaxRNN

Performance Evaluation kNN This is because, for given number of data points P and given number of query point Q, (MapReduce based kNN search)MRK always outputs P X Q <key, value> pairs in the map phase independent of k Due to the massive amount of data generated by the mapper, the response time is high

Conclusions In this paper, we investigated the problem of handling and efficient querying of massive spatial data in a cloud architecture We have presented a distributed Voronoi index and techniques to answer three types of geospatial queries RNN, MaxRNN, kNN Future Work : Supporting other types of queries Bichromatic reverse nearest neighbor Skyline reverse k nearest neighbor