Download presentation
Presentation is loading. Please wait.
Published byTrevor Garrett Modified over 9 years ago
1
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join
2
23 2 Feature Based Similarity
3
23 3 Simple Similarity Queries Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.
4
23 4 Join Applications: Catalogue Matching Catalogue matching E.g. Astronomy catalogues R S
5
23 5 Join Applications: Clustering Clustering (e.g. DBSCAN) Similarity self-join
6
23 6 R-Tree Similarity Join Depth-first traversal of two trees [Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993] R S
7
23 7 The -kdB-Tree [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997] Assumption: 2 adjacent -stripes fit in main mem. Unrealistic for large data sets which are... clustered, skewed and high-dimensional data
8
23 8 Epsilon Grid Order [Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]
9
23 9 Common Properties Decomposition of data/space into regions Regions described by hyper-rectangles for each pair (P,Q) of partitions having dist (P,Q) for each pair of points (p,q) on (P,Q) test dist (p,q) ; Most CPU-effort in distance test between vectors: Idea: Speed-up distance test
10
23 10 Related Work: Plane Sweep for Polygons [Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000] Observations: More efficient to use x-axis as sweep direction. Projection of polygons to y-axis yield high overlap Decide by projections of the bounding boxes (integrate a pdf)
11
23 11 Distance computation between feature vectors p,q for (i=0 ; i 2 ) break ; } Order dimensions by Mating Probability (increasing) Feature Vectors in the Similarity Join d0d0 d1d1
12
23 12 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis d0d0 d1d1
13
23 13 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis Consider two projections in 2-dimensional space d0d0 d0d0 d0d0 d0d0 d0d0 d0d0 d0d0
14
23 14 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis Consider two projections in 2-dimensional space d 0 -Projection of each point pair located in this event space d0[P]d0[P] d0[Q]d0[Q]
15
23 15 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis Consider two projections in 2-dimensional space d0[P]d0[P] d 0 -Projection of each point pair located in this event space mating point pairs on -stripe d0[Q]d0[Q] y x y x +
16
23 16 Computation of the Mating Probability To determine mating probability for d i : Project bounding boxes on d i -axis Consider two projections in 2-dimensional space Mating Probability for d 0 d0[P]d0[P] d0[Q]d0[Q]
17
23 17 Optimal Dimension Order For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability Algorithm: for each pair (P,Q) of partitions having dist (P,Q) determine ODO ; for each pair of points (p,q) on (P,Q) test dist (p,q) using ODO ;
18
23 18 Shape of the Intersection Area 20 different shapes are possible, e.g. 1223 2233 2223 Easy proof of completeness and efficient case distinction by assigning codes to the corners 1: Corner is left or above the -stripe 2: Corner is on the -stripe 3: Corner is right or below the -stripe Easy formulas (only 45° and 90° angles)
19
23 19 Experimental Evaluation: R-tree Sim. Join 8-dimensional data, uniformly distributed
20
23 20 Experimental Evaluation: R-tree Sim. Join 16-dimensional data, from CAD-similarity search
21
23 21 Experimental Evaluation: Scalability MuX, uniform dataZ-RSJ, uniform data
22
23 22 Experimental Evaluation: Scalability EGO, CAD data
23
23 Conclusion Conclusion: Similarity join is an important database primitive for knowledge discovery in databases Many different basic algorithms Most accelerable by our optimal dimension order Future Work: New applications of the similarity join Further optimization (multi-parameter) of the sim. join Parallel and distributed environments
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.