Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Processing of k Nearest Neighbor Joins using MapReduce.

Similar presentations


Presentation on theme: "Efficient Processing of k Nearest Neighbor Joins using MapReduce."— Presentation transcript:

1 Efficient Processing of k Nearest Neighbor Joins using MapReduce

2 INTRODUCTION k nearest neighbor join (kNN join) is a special type of join that combines each object in a dataset R with the k objects in another dataset S that are closest to it. As a combination of the k nearest neighbor (kNN) query and the join operation, kNN join is an expensive operation. Most of the existing work rely on some centralized indexing structure such as the B+-tree and the R-tree, which cannot be accommodated in such a distributed and parallel environment directly.

3 AN OVERVIEW OF KNN JOIN USING MAPREDUCE basic strategy:R=U 1≤i≤N Ri, where Ri∩Rj = ∅, i ≠ j; each subset Ri is distributed to a reducer. S has to be sent to each reducer to be joined with Ri; finally R ∝ S = U 1≤i≤N Ri ∝ S. |R|+N·|S|. H-BRJ: splits both R and S into √n R=U 1≤i≤ √n Ri S=U 1≤i≤ √n Si. Better strategy: Ri ∝ S=Ri ∝ Si and R ∝ S=U 1≤i≤N Ri ∝ Si. |R|+α·|S|

4 In summary, for the purpose of minimizing the join cost, we need to 1. find a good partitioning of R; 2. find the minimal set of Si for each Ri ∈ R, given a partitioning of R. ※ The minimum set of Si is Si =U 1≤j≤|Ri| KNN(ri, S). However,it is impossible to find out the k nearest neighbors for all ri apriori. AN OVERVIEW OF KNN JOIN USING MAPREDUCE

5 HANDLING KNN JOIN USING MAPREDUCE

6 DATA PREPROCESSING A good partitioning of R for optimizing kNN join should cluster objects based on their proximity. Random Selection Farthest Selection k-means Selection ※ It is not easy to find pivots.

7 First MapReduce Job perform data partitioning and collect some statistics for each partition.

8 Second MapReduce Job Distance Bound of kNN ub(s,P i R ) = U(P i R ) + |p i,p j | + |p j,s| θ i = max ∀ s ∈ KNN(P i R,S) |ub(s, P i R )| ①

9 Second MapReduce Job Finding S i for R i lb(s, P i R ) = max{0, |p i, p j | − U(P i R ) − |s, p j |} ② if (lb(s, P i R )>θi) ③ then s KNN(P i R,S) LB(P j S,P i R ) = |pi, pj|- U(P i R ) -θi if (|s,p j | ≥LB(P j S,P i R )) then s KNN(P i R,S) s ∈ [LB(P j S,P i R ),U(P j S )]

10 Second MapReduce Job In this way, objects in each partition of R and their potential k nearest neighbors will be sent to the same reducer. By parsing the key value pair (k2, v2), the reducer can derive the partition P i R and subset Si that consists of P j1 S,...,P jM S ∀ r ∈ P i R, in order to reduce the number of distance computations, we first sort the partitions from Si by the distances from their pivots to pivot pi in the ascending order. ※ compute θi ← max ∀ s ∈ KNN(PRi,S)|ub(s,PRi )| ※ Refine θi but I think it is useless.

11 Second MapReduce Job define d(o,HP(pi, pj)) =. if d(o,HP(pi, pj)) > θ then ∀ q ∈ P i R |o,q|> θ if max{L(P i S ), |pi, q| − θ} ≤ |pi,o| ≤ min{U(P i O ), |pi, q|+ θ} then |q, o| ≤ θ

12 MINIMIZING REPLICATION OF S |s, pj| ≥ LB(P j S, P i R ) => large LB(P j S, P i R ) keep small |s, p j | =>split the dataset into finer granularity and the bound of the kNN distances for all objects in each partition of R will become tighter. R =U 1≤i≤N G i, G i ∩ G j = ∅, i = j. s is assigned to S i only if |s, p j | ≥ LB(P j S, G i ). where LB(P j S, G i ) = min ∀ P i R ∈ G i LB(P j S, P i R ) RP(S) = ∑ ∀ Gi ∑ ∀ P j S |{s|s ∈ P j S ∧ |s, p j | ≥ LB(P j S,Gi)}|

13 MINIMIZING REPLICATION OF S Geometric Grouping Greedy Grouping minimize the size of RP(S,Gi ∪ {P j R }) − RP(S,Gi) but it is rather cost, so ∃ s ∈ PSl, |s, p j | ≤ LB(P j S,Gi) RP(S,Gi) ≈ ∀ P j S ⊂ S {P j S |LB(P j S,Gi) ≤ U(P j S )}

14 EXPERIMENTAL EVALUATION

15

16

17

18

19 The End! Thanks


Download ppt "Efficient Processing of k Nearest Neighbor Joins using MapReduce."

Similar presentations


Ads by Google