Download presentation
Presentation is loading. Please wait.
Published byLindsay Lindsey Modified over 9 years ago
1
Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join
2
Christian Böhm 2 Feature Based Similarity
3
Christian Böhm 3 Simple Similarity Queries Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.
4
Christian Böhm 4 Join Applications: Catalogue Matching Catalogue matching E.g. Astronomic catalogues R S
5
Christian Böhm 5 Join Applications: Clustering Clustering (e.g. DBSCAN) Similarity self-join
6
Christian Böhm 6 R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s, ) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q); R S
7
Christian Böhm 7 Cost Modeling Single similarity queries: Access prob. of pages modeled using the concept of Minkowski Sum
8
Christian Böhm 8 Cost Modeling Binomial formula:
9
Christian Böhm 9 Cost Modeling Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum
10
Christian Böhm 10 Page Capacity Optimization Cost model can determine index selectivity which depends on various parameters Page capacity (number of stored points) is an important parameter Known from similarity search: Page capacity optimization yields considerable improvement
11
Christian Böhm 11 Analysis of the Index Overhead Assuming 100% selectivity (index doesnt work) How much more expensive is index usage ? CPU: Distance betw. boxes more expensive to compute than distance betw. points: Smaller capacity more box distance computations
12
Christian Böhm 12 Analysis of the Index Overhead Disk I/O: High constant cost per page access (move disk head) Page access is by factor 10000 / d more expensive than continuous reading of a point Smaller capacity more disk head movement
13
Christian Böhm 13 Analysis of the Index Overhead What selectivity is needed that index pays off ?
14
Christian Böhm 14 Optimization I/O cost function: is optimized by CPU cost function: is optimized by:
15
Christian Böhm 15 Optimization I/O cost: Large capacity optimum (several 10,000 points, typically) CPU cost: Small capacity optimum (< 100 points, typically) No compromise achievable
16
Christian Böhm 16 Multipage Index (MuX) CPU-performance like CPU optimized index I/O- performance like I/O optimized index separate optimization
17
Christian Böhm 17 Experimental Evaluation Uniform 4DUniform 8D
18
Christian Böhm 18 Experimental Evaluation CAD Data 16DColor Images 64D
19
Christian Böhm 19 Conclusions Summary High potential for performance gains of the similarity join by page capacity optimization Necessary to separately optimize I/O and CPU Future research potential Similarity join for metric index structures Approximate similarity join Parallel similarity join algorithms
20
Christian Böhm 20 Consequences Assume for I/O optimization selectivity Page accesses in a nested block loop like style: if mindist(r,s) then join (r,s) ; foreach joining R-page r in cache do load (s) ; if s joins some of the cached R-pg then foreach S-page s do fill cache with pages of R (1 page free) ;
21
Christian Böhm 21 R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s, ) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q); R S
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.