Presentation is loading. Please wait.

Presentation is loading. Please wait.

Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.

Similar presentations


Presentation on theme: "Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join."— Presentation transcript:

1 Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join

2 Christian Böhm 2 Feature Based Similarity

3 Christian Böhm 3 Simple Similarity Queries  Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.

4 Christian Böhm 4 Join Applications: Catalogue Matching  Catalogue matching E.g. Astronomic catalogues R S

5 Christian Böhm 5 Join Applications: Clustering  Clustering (e.g. DBSCAN)  Similarity self-join

6 Christian Böhm 6 R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S,  ) if IsDirpg (R)  IsDirpg (S) then foreach r  R.children do foreach s  S.children do if mindist (r,s)   then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,  ) ; else (* assume R,S both DataPg *) foreach p  R.points do foreach q  S.points do if |p  q|  then report (p,q);  R S

7 Christian Böhm 7 Cost Modeling  Single similarity queries: Access prob. of pages modeled using the concept of Minkowski Sum

8 Christian Böhm 8 Cost Modeling  Binomial formula:

9 Christian Böhm 9 Cost Modeling  Mating probability of index pages:  Probability that distance between two pages   Two-fold application of Minkowski sum

10 Christian Böhm 10 Page Capacity Optimization  Cost model can determine index selectivity which depends on various parameters  Page capacity (number of stored points) is an important parameter  Known from similarity search: Page capacity optimization yields considerable improvement

11 Christian Böhm 11 Analysis of the Index Overhead  Assuming 100% selectivity (index doesnt work) How much more expensive is index usage ?  CPU: Distance betw. boxes more expensive to compute than distance betw. points:  Smaller capacity  more box distance computations

12 Christian Böhm 12 Analysis of the Index Overhead  Disk I/O: High constant cost per page access (move disk head) Page access is by factor  10000 / d more expensive than continuous reading of a point Smaller capacity  more disk head movement

13 Christian Böhm 13 Analysis of the Index Overhead  What selectivity is needed that index pays off ?

14 Christian Böhm 14 Optimization  I/O cost function: is optimized by  CPU cost function: is optimized by:

15 Christian Böhm 15 Optimization  I/O cost: Large capacity optimum (several 10,000 points, typically)  CPU cost: Small capacity optimum (< 100 points, typically) No compromise achievable

16 Christian Böhm 16 Multipage Index (MuX)  CPU-performance like CPU optimized index  I/O- performance like I/O optimized index separate optimization

17 Christian Böhm 17 Experimental Evaluation Uniform 4DUniform 8D

18 Christian Böhm 18 Experimental Evaluation CAD Data 16DColor Images 64D

19 Christian Böhm 19 Conclusions  Summary High potential for performance gains of the similarity join by page capacity optimization Necessary to separately optimize I/O and CPU  Future research potential Similarity join for metric index structures Approximate similarity join Parallel similarity join algorithms

20 Christian Böhm 20 Consequences  Assume for I/O optimization selectivity   Page accesses in a nested block loop like style: if mindist(r,s)  then join (r,s) ; foreach joining R-page r in cache do load (s) ; if s joins some of the cached R-pg then foreach S-page s do fill cache with pages of R (1 page free) ;

21 Christian Böhm 21 R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S,  ) if IsDirpg (R)  IsDirpg (S) then foreach r  R.children do foreach s  S.children do if mindist (r,s)   then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,  ) ; else (* assume R,S both DataPg *) foreach p  R.points do foreach q  S.points do if |p  q|  then report (p,q);  R S


Download ppt "Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join."

Similar presentations


Ads by Google