Presentation is loading. Please wait.

Presentation is loading. Please wait.

SEMILARITY JOIN COP6731 Advanced Database Systems.

Similar presentations


Presentation on theme: "SEMILARITY JOIN COP6731 Advanced Database Systems."— Presentation transcript:

1 SEMILARITY JOIN COP6731 Advanced Database Systems

2 Basic Similarity Queries Range Query  Find similar items: k-Nearest-Neighbor (kNN) Query  Find the k most similar items:

3 Similarity Join  Given two sets, R and S, of data points  Find all pairs (r,s) є RxS, such that d(r,s) ≤ ε.  Applications - duplicate detection, similarity comparison, etc.

4 Similarity Join - SQL-like Notation SELECT* FROM R, S WHERE d(R.r, S.s) ≤ ε  є too small, no results  ε too large, very large result set

5 k-Closest Pair Query  Given two sets, R and S, of data points  Find those k (r,s) pairs that yield least distance  r and s are NN of each other  This is called distance join

6 k-Closest Pair Query SQL-like Notation SELECT* FROM R, S ORDER BYd(R.r, S.s) UNTILk Applications Find all pairs of people who have the most similar interests Find music scores which are most similar to each other

7 k-Nearest Neighbor Join  Combine each point with its k nearest neighbors from the other data set  SQL-like Notation: SELECT* FROMR, S GROUP BYR.r GROUP SIZEk ORDER BYd(R.r, S.s)

8 k-Nearest Neighbor Join

9 k-NN Join Applications  k-means clustering 1. k initial centers randomly selected 2. Assign each database point to its nearest center 3. Redetermine center for each cluster 4. Repeat Steps 2 and 3 until convergence  Classify new objects according to the majority of their k nearest neighbors

10 Nested Loop Join  Simple nested loop For each R-points, iterate over S-points Scan S |R| times, very expensive  Nested block loop For each page of R-points, iterate over S-points Scan S only |R|/|page| times, more cost effective

11 Indexed Nested Loop Join  For each R-point, determine matches in S using the index  For large number of dimensions and/or high selectivity (due to large ε), not as competitive as nested loop join

12 Spatial Join vs Similarity Join  Represent each data point as hypercube of edge-length 0.71·ε  Map similarity join wrt ε to spatial join on hypercubes If two hypercubes overlap, the corresponding points are within ε distance from each other That is, they are neighbors wrt ε

13 R-tree Spatial Join (RSJ)  Assumption: Index preconstructed on R and S with equal tree height Procedure RSJ (R, S: page) for each r є R.children do for each s є S.children do if (r  s ≠Φ) then RSJ(r,s);

14 Adapt RSJ for Similarity Join  Distance predicate rather than intersection  Mindist(R,S) computes least distance of two points in (R,S) Procedure RsimJ(R, S, ε) if IsDirPg(R) Λ IsDirPg(S) then for each r є R.children do for each s є S.children do if mindist(r,s) ≤ ε then RsimJ(r, s, ε); /* recursive */ else /* R & S are data pages */ for each p є R.points do for each q є S.points do if d(r, s) ≤ ε then output(p, q);

15 Performance Issues in R-tree Join  Cost dominated by point-distance computations - CPU-bound  Random page accesses can be worse than nested block loop join

16 Parallel Similarity Join  A task corresponds to a pair of tree nodes (data page or directory page)  Various task assignment strategies Round robin Static range assignment Dynamic task assignment to achieve load balancing

17 Breadth-First R-tree Join  Shortcoming of RsimJ Depth-first traversal is sequential in nature No strategy for improving locality in inner loop resulting in inefficient page access pattern  Solution Proceed level by level (i.e., breadth first traversal) Determine all relevant pairs for the next level Access these relevant pairs in the order of their physical locations in storage

18 Reducing Random Access in Breadth-First Traversal  Space is regularly tiled with a space filling curve (e.g., Hilbert curve) defined  Store the index tree level by level  For each level, store tree nodes according to their space-filling-curve order

19 Without Preconstructed Index (1)  Tree construction time often much less than join time - amortize during join  Indexes can be constructed temporarily for join  Techniques include Hilbert R-tree and ε- kdB tree Hilbert R-tree: Sort points by SFC, and pack adjacent points to page

20 Without Preconstructed Index (2) ε-kdB tree:  Space is partitioned into grid cells with grid line distance ε  Tree structure is specific to given ε, and must be constructed for each join leaf leaves ε root


Download ppt "SEMILARITY JOIN COP6731 Advanced Database Systems."

Similar presentations


Ads by Google