23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal Dimension Order: A Generic Technique for the Similarity Join

23 2 Feature Based Similarity

23 3 Simple Similarity Queries  Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.

23 4 Join Applications: Catalogue Matching  Catalogue matching E.g. Astronomy catalogues R S

23 5 Join Applications: Clustering  Clustering (e.g. DBSCAN)  Similarity self-join

23 6 R-Tree Similarity Join  Depth-first traversal of two trees [Brinkhoff, Kriegel, Seeger: Efficient Process. of Spatial Joins Using R-trees, Sigmod Conf. 1993]  R S

23 7 The  -kdB-Tree [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997]  Assumption: 2 adjacent  -stripes fit in main mem.  Unrealistic for large data sets which are... clustered, skewed and high-dimensional data

23 8 Epsilon Grid Order [Böhm, Braunmüller, Krebs, Kriegel: Epsilon Grid Order. SIGMOD Conf. 2001]

23 9 Common Properties  Decomposition of data/space into regions  Regions described by hyper-rectangles for each pair (P,Q) of partitions having dist (P,Q)    for each pair of points (p,q) on (P,Q) test dist (p,q)   ;  Most CPU-effort in distance test between vectors:  Idea: Speed-up distance test

23 10 Related Work: Plane Sweep for Polygons [Shin, Moon, Lee: Adaptive Multi-Stage Distance Join Processing, SIGMOD Conf. 2000]  Observations: More efficient to use x-axis as sweep direction. Projection of polygons to y-axis yield high overlap Decide by projections of the bounding boxes (integrate a pdf)

23 11  Distance computation between feature vectors p,q for (i=0 ; i  2 ) break ; }  Order dimensions by Mating Probability (increasing) Feature Vectors in the Similarity Join d0d0 d1d1

23 12 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis d0d0 d1d1

23 13 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis  Consider two projections in 2-dimensional space d0d0 d0d0 d0d0 d0d0 d0d0 d0d0 d0d0

23 14 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis  Consider two projections in 2-dimensional space d 0 -Projection of each point pair located in this event space d0[P]d0[P] d0[Q]d0[Q]

23 15 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis  Consider two projections in 2-dimensional space d0[P]d0[P] d 0 -Projection of each point pair located in this event space mating point pairs on  -stripe  d0[Q]d0[Q] y  x  y  x + 

23 16 Computation of the Mating Probability To determine mating probability for d i :  Project bounding boxes on d i -axis  Consider two projections in 2-dimensional space   Mating Probability for d 0 d0[P]d0[P] d0[Q]d0[Q]

23 17 Optimal Dimension Order  For a given pair (P,Q) of partitions the optimal dimension order ODO is the sequence of dimensions with increasing mating probability  Algorithm: for each pair (P,Q) of partitions having dist (P,Q)    determine ODO ; for each pair of points (p,q) on (P,Q) test dist (p,q)   using ODO ;

23 18 Shape of the Intersection Area  20 different shapes are possible, e.g. 1223 2233 2223  Easy proof of completeness and efficient case distinction by assigning codes to the corners 1: Corner is left or above the  -stripe 2: Corner is on the  -stripe 3: Corner is right or below the  -stripe  Easy formulas (only 45° and 90° angles)

23 19 Experimental Evaluation: R-tree Sim. Join  8-dimensional data, uniformly distributed

23 20 Experimental Evaluation: R-tree Sim. Join  16-dimensional data, from CAD-similarity search

23 21 Experimental Evaluation: Scalability MuX, uniform dataZ-RSJ, uniform data

23 22 Experimental Evaluation: Scalability EGO, CAD data

23 Conclusion  Conclusion: Similarity join is an important database primitive for knowledge discovery in databases Many different basic algorithms Most accelerable by our optimal dimension order  Future Work: New applications of the similarity join Further optimization (multi-parameter) of the sim. join Parallel and distributed environments

23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

Similar presentations

Presentation on theme: "23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

Similar presentations

Presentation on theme: "23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal."— Presentation transcript:

Similar presentations

About project

Feedback