Download presentation
Presentation is loading. Please wait.
Published byNelson Randall Modified over 9 years ago
1
SEMILARITY JOIN COP6731 Advanced Database Systems
2
Basic Similarity Queries Range Query Find similar items: k-Nearest-Neighbor (kNN) Query Find the k most similar items:
3
Similarity Join Given two sets, R and S, of data points Find all pairs (r,s) є RxS, such that d(r,s) ≤ ε. Applications - duplicate detection, similarity comparison, etc.
4
Similarity Join - SQL-like Notation SELECT* FROM R, S WHERE d(R.r, S.s) ≤ ε є too small, no results ε too large, very large result set
5
k-Closest Pair Query Given two sets, R and S, of data points Find those k (r,s) pairs that yield least distance r and s are NN of each other This is called distance join
6
k-Closest Pair Query SQL-like Notation SELECT* FROM R, S ORDER BYd(R.r, S.s) UNTILk Applications Find all pairs of people who have the most similar interests Find music scores which are most similar to each other
7
k-Nearest Neighbor Join Combine each point with its k nearest neighbors from the other data set SQL-like Notation: SELECT* FROMR, S GROUP BYR.r GROUP SIZEk ORDER BYd(R.r, S.s)
8
k-Nearest Neighbor Join
9
k-NN Join Applications k-means clustering 1. k initial centers randomly selected 2. Assign each database point to its nearest center 3. Redetermine center for each cluster 4. Repeat Steps 2 and 3 until convergence Classify new objects according to the majority of their k nearest neighbors
10
Nested Loop Join Simple nested loop For each R-points, iterate over S-points Scan S |R| times, very expensive Nested block loop For each page of R-points, iterate over S-points Scan S only |R|/|page| times, more cost effective
11
Indexed Nested Loop Join For each R-point, determine matches in S using the index For large number of dimensions and/or high selectivity (due to large ε), not as competitive as nested loop join
12
Spatial Join vs Similarity Join Represent each data point as hypercube of edge-length 0.71·ε Map similarity join wrt ε to spatial join on hypercubes If two hypercubes overlap, the corresponding points are within ε distance from each other That is, they are neighbors wrt ε
13
R-tree Spatial Join (RSJ) Assumption: Index preconstructed on R and S with equal tree height Procedure RSJ (R, S: page) for each r є R.children do for each s є S.children do if (r s ≠Φ) then RSJ(r,s);
14
Adapt RSJ for Similarity Join Distance predicate rather than intersection Mindist(R,S) computes least distance of two points in (R,S) Procedure RsimJ(R, S, ε) if IsDirPg(R) Λ IsDirPg(S) then for each r є R.children do for each s є S.children do if mindist(r,s) ≤ ε then RsimJ(r, s, ε); /* recursive */ else /* R & S are data pages */ for each p є R.points do for each q є S.points do if d(r, s) ≤ ε then output(p, q);
15
Performance Issues in R-tree Join Cost dominated by point-distance computations - CPU-bound Random page accesses can be worse than nested block loop join
16
Parallel Similarity Join A task corresponds to a pair of tree nodes (data page or directory page) Various task assignment strategies Round robin Static range assignment Dynamic task assignment to achieve load balancing
17
Breadth-First R-tree Join Shortcoming of RsimJ Depth-first traversal is sequential in nature No strategy for improving locality in inner loop resulting in inefficient page access pattern Solution Proceed level by level (i.e., breadth first traversal) Determine all relevant pairs for the next level Access these relevant pairs in the order of their physical locations in storage
18
Reducing Random Access in Breadth-First Traversal Space is regularly tiled with a space filling curve (e.g., Hilbert curve) defined Store the index tree level by level For each level, store tree nodes according to their space-filling-curve order
19
Without Preconstructed Index (1) Tree construction time often much less than join time - amortize during join Indexes can be constructed temporarily for join Techniques include Hilbert R-tree and ε- kdB tree Hilbert R-tree: Sort points by SFC, and pack adjacent points to page
20
Without Preconstructed Index (2) ε-kdB tree: Space is partitioned into grid cells with grid line distance ε Tree structure is specific to given ε, and must be constructed for each join leaf leaves ε root
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.