Efficient Processing of k Nearest Neighbor Joins using MapReduce.

Slides:



Advertisements
Similar presentations
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Danzhou Liu Ee-Peng Lim Wee-Keong Ng
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.
Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen Yi-Ping Hung Chiou-Shann Fuh 8 th International Conference on Computer.
Da Yan, Zhou Zhao and Wilfred Ng The Hong Kong University of Science and Technology.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
Spatial Queries Nearest Neighbor Queries.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
FLANN Fast Library for Approximate Nearest Neighbors
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
SEMILARITY JOIN COP6731 Advanced Database Systems.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
DataBases & Data Mining Joined Specialization Project „Data Mining Classification Tool” By Mateusz Żochowski & Jakub Strzemżalski.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Efficient Processing of Top-k Spatial Preference Queries
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
Genetic algorithms (GA) for clustering Pasi Fränti Clustering Methods: Part 2e Speech and Image Processing Unit School of Computing University of Eastern.
A New Spatial Index Structure for Efficient Query Processing in Location Based Services Speaker: Yihao Jhang Adviser: Yuling Hsueh 2010 IEEE International.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Date: 2011/1/11 Advisor: Dr. Koh. Jia-Ling Speaker: Lin, Yi-Jhen Mr. KNN: Soft Relevance for Multi-label Classification (CIKM’10) 1.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
Spatial Data Management
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
SIMILARITY SEARCH The Metric Space Approach
Data Driven Resource Allocation for Distributed Learning
Sameh Shohdy, Yu Su, and Gagan Agrawal
On Spatial Joins in MapReduce
Distributed In-Memory Processing of All k Nearest Neighbor Queries G
Voronoi-based Geospatial Query Processing with MapReduce
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Selected Topics: External Sorting, Join Algorithms, …
Distributed Database Management Systems
Data Mining – Chapter 4 Cluster Analysis Part 2
Efficient Processing of Top-k Spatial Preference Queries
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Efficient Processing of k Nearest Neighbor Joins using MapReduce

INTRODUCTION k nearest neighbor join (kNN join) is a special type of join that combines each object in a dataset R with the k objects in another dataset S that are closest to it. As a combination of the k nearest neighbor (kNN) query and the join operation, kNN join is an expensive operation. Most of the existing work rely on some centralized indexing structure such as the B+-tree and the R-tree, which cannot be accommodated in such a distributed and parallel environment directly.

AN OVERVIEW OF KNN JOIN USING MAPREDUCE basic strategy:R=U 1≤i≤N Ri, where Ri∩Rj = ∅, i ≠ j; each subset Ri is distributed to a reducer. S has to be sent to each reducer to be joined with Ri; finally R ∝ S = U 1≤i≤N Ri ∝ S. |R|+N·|S|. H-BRJ: splits both R and S into √n R=U 1≤i≤ √n Ri S=U 1≤i≤ √n Si. Better strategy: Ri ∝ S=Ri ∝ Si and R ∝ S=U 1≤i≤N Ri ∝ Si. |R|+α·|S|

In summary, for the purpose of minimizing the join cost, we need to 1. find a good partitioning of R; 2. find the minimal set of Si for each Ri ∈ R, given a partitioning of R. ※ The minimum set of Si is Si =U 1≤j≤|Ri| KNN(ri, S). However,it is impossible to find out the k nearest neighbors for all ri apriori. AN OVERVIEW OF KNN JOIN USING MAPREDUCE

HANDLING KNN JOIN USING MAPREDUCE

DATA PREPROCESSING A good partitioning of R for optimizing kNN join should cluster objects based on their proximity. Random Selection Farthest Selection k-means Selection ※ It is not easy to find pivots.

First MapReduce Job perform data partitioning and collect some statistics for each partition.

Second MapReduce Job Distance Bound of kNN ub(s,P i R ) = U(P i R ) + |p i,p j | + |p j,s| θ i = max ∀ s ∈ KNN(P i R,S) |ub(s, P i R )| ①

Second MapReduce Job Finding S i for R i lb(s, P i R ) = max{0, |p i, p j | − U(P i R ) − |s, p j |} ② if (lb(s, P i R )>θi) ③ then s KNN(P i R,S) LB(P j S,P i R ) = |pi, pj|- U(P i R ) -θi if (|s,p j | ≥LB(P j S,P i R )) then s KNN(P i R,S) s ∈ [LB(P j S,P i R ),U(P j S )]

Second MapReduce Job In this way, objects in each partition of R and their potential k nearest neighbors will be sent to the same reducer. By parsing the key value pair (k2, v2), the reducer can derive the partition P i R and subset Si that consists of P j1 S,...,P jM S ∀ r ∈ P i R, in order to reduce the number of distance computations, we first sort the partitions from Si by the distances from their pivots to pivot pi in the ascending order. ※ compute θi ← max ∀ s ∈ KNN(PRi,S)|ub(s,PRi )| ※ Refine θi but I think it is useless.

Second MapReduce Job define d(o,HP(pi, pj)) =. if d(o,HP(pi, pj)) > θ then ∀ q ∈ P i R |o,q|> θ if max{L(P i S ), |pi, q| − θ} ≤ |pi,o| ≤ min{U(P i O ), |pi, q|+ θ} then |q, o| ≤ θ

MINIMIZING REPLICATION OF S |s, pj| ≥ LB(P j S, P i R ) => large LB(P j S, P i R ) keep small |s, p j | =>split the dataset into finer granularity and the bound of the kNN distances for all objects in each partition of R will become tighter. R =U 1≤i≤N G i, G i ∩ G j = ∅, i = j. s is assigned to S i only if |s, p j | ≥ LB(P j S, G i ). where LB(P j S, G i ) = min ∀ P i R ∈ G i LB(P j S, P i R ) RP(S) = ∑ ∀ Gi ∑ ∀ P j S |{s|s ∈ P j S ∧ |s, p j | ≥ LB(P j S,Gi)}|

MINIMIZING REPLICATION OF S Geometric Grouping Greedy Grouping minimize the size of RP(S,Gi ∪ {P j R }) − RP(S,Gi) but it is rather cost, so ∃ s ∈ PSl, |s, p j | ≤ LB(P j S,Gi) RP(S,Gi) ≈ ∀ P j S ⊂ S {P j S |LB(P j S,Gi) ≤ U(P j S )}

EXPERIMENTAL EVALUATION

The End! Thanks