Download presentation
Presentation is loading. Please wait.
Published byNoel Wells Modified over 9 years ago
1
39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next Decade's Applications Keynote at iiWAS 2002
2
39 2 1 1 Similarity Search
3
39 3 Feature Based Similarity
4
39 4 Simple Similarity Queries Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.
5
39 5 Multidimensional Index Structure (R-tree) Data Page: point 1 : x 11, x 12, x 13,... point 2 : x 21, x 22, x 23,... point 3 : x 31, x 32, x 33,... Directory Page: rectangle 1, address 1 rectangle 2, address 2 rectangle 3, address 3 rectangle 4, address 4
6
39 6 Range Query with Depth-First Traversal
7
39 7 Nearest Neighbor: Priority Algorithm 4 page accesses [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995]
8
39 8 Problems of High-Dim. Index Structures „Curse of dimensionality“: Search performance of index deteriorates in high dim. Outperformed by sequential scan Solution Optimize various parameters of index structures Needed: Cost model for queries How many pages are expected to be accessed for Range queries (with given ) Nearest neighbor queries (with given k)
9
39 9 Cost Estimation (Uniformity/Independence) Minkowski sum: Estimation of the access probability of a page [Böhm: A Cost Model for Query Processing in High-Dimensional Data Spaces, TODS 25(2), 2000] Nearest neighbor: Estimate distance by point density
10
39 10 Cost Estimation Boundary and saturation effects in high dim. space (considered by our model extension) Correlation between attributes (considered by the concept of fractal dimension) Cluster structure has also impact on performance Currently neglected by our model Histograms and similar data descriptions difficult in high-dimensional space (number of histo-bins exponential in dimensionality) Other descriptions of cluster structure (dendrograms) Subject to future work
11
39 11 Optimization of Index Structures To avoid the possibility to outperform index based query processing by the sequential scan: Optimize various parameters such as Logical block size of the index pages Indexed dimension I/O schedule optimization (fast index scan) Data quantization
12
39 12 Page Size Optimization
13
39 13 Page Size Optimization [Böhm, Kriegel: Dynamically Optimizing High Dimensional Index Structures, EDBT 2000]
14
39 14 Optimized Dimension Assignment Matching Hi-dim. Index R-tree Inverted List B-tree Problem in hi-dim: Too few splits in each dimension Problem in hi-dim: Too many results in each dimension [Berchtold, Böhm, Keim, Kriegel, Xu: Optimal...Tree Striping, DaWaK 2000]
15
39 15 Optimized Dimension Assignment Matching Hi-dim. Index R-tree Inverted List B-tree Compromise: A moderate number of R-trees each indexing a few dimensions OPTIMIZE! [Berchtold, Böhm, Keim, Kriegel, Xu: Optimal...Tree Striping, DaWaK 2000]
16
39 16 Schedule Optimization (Fast Index Scan) Range Query: Required Pages are known from the directory
17
39 17 Schedule Optimization (NN Queries) Current expenses are traded for possible later savings Start at 100% page and extend forward and backward Optimize the cumulated cost balance (CCB): [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000]
18
39 18 Quantization Approximate the points by quantization grid based on quantiles Benefit:fewer bits for representation Cost: Grid cell partially intersected access the original point data How to choose grid resolution ??? [Weber, Schek, Blott: A Quantitative Analysis and Performance Study..., VLDB 1998]
19
39 19 Independent Quantization (IQ tree) Combines index, scan, and quantization [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000] Grid resolution optimized by cost model
20
39 20 Open Research Problems in Optimization Multi-Parameter Optimization: How can parameters be optimized simultaneously? Are there conflicts between optimization goals? Example: Uniform data: Quantization Correlated data: Tree Striping
21
39 21 Open Research Problems in Optimization Consider Insert/Delete/Update: If the data set faces heavy update, the constructed index should look differently compared with more static data sets Update-bound: Construct index rather simple Query-bound: Spend more effort to organize data Can be considered as an optimization problem
22
39 22 2 2 Data Mining
23
39 23 KDD Algorithms Based on Similarity Queries DBSCAN OPTICS.... LOF Dist. Based Outliers.... Simultan. Nearest Neighbor Classific..... Spatial Trend Detect. Spatial Assoc. Rules
24
39 24 Similarity Join Catalogue Matching R S
25
39 25 Clustering Clustering (e.g. DBSCAN) [Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters, KDD 1996]
26
39 26 Cache Behavior
27
39 27 Clustering and Similarity Join DBSCAN uses similarity join as basic operations [Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000]
28
39 28 k-Nearest Neighbor Classification Example: Objects with known class New objects k = 3 New objects Known objects
29
39 29 Distance Range Join ( -Join) Most widespread and best evaluated join Often also called the similarity join
30
39 30 k-Closest Pair Query In SQL notation: SELECT * FROM R, S ORDER BY ||R.obj S.obj|| STOP AFTER k
31
39 31 k-Nearest Neighbor Join In SQL notation: (limited to k = 1) SELECT * FROM R, S GROUP BY R.obj ORDER BY ||R.obj S.obj|| STOP AFTER K (* k *)
32
39 32 R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S, ) if IsDirpg (R) IsDirpg (S) then foreach r R.children do foreach s S.children do if mindist (r,s) then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s, ) ; else (* assume R,S both DataPg *) foreach p R.points do foreach q S.points do if |p q| then report (p,q); R S [Brinkhoff, Kriegel, Seeger: Efficient Processing of Spatial Joins using R-trees, SIGMOD 1993]
33
39 33 Modeling and Optimization [Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, ICDE 2001] Mating probability of index pages: Probability that distance between two pages Two-fold application of Minkowski sum
34
39 34 Modeling and Optimization I/O cost: High const. cost per page Large capacity optimum CPU cost: Low const. cost per page Low capacity optimum CPU-performance like CPU optimized index I/O- performance like I/O optimized index
35
39 35 Open Problems for Research (Sim. Join) Modeling and Optimization: Dimension Quantization Page scheduling Caching strategies Nearest Neighbor Join Applications Algorithms General Integration into object-relational DBMS
36
39 36 3 3 New Challenges
37
39 37 New Challenges Incertain Features: Application: Biometric Identification Particularities: Features individually associated with incertainty (e.g. as Gaussian distributions) Queries: Probability of match Find objects with highes probability of match Find objects with probability of match >= Feature a 1 Relative probability
38
39 38 Support of e-commerce in all phases Marketing customer segmentation Sales and booking advanced similarity search Add-on products Sales transaction analysis Advanced Similarity Adaptable Multimodal models Relevance-feedback Convex hull Multimodal models Adaptable New Challenges Relevance-feedback Convex hull
39
39 New Challenges Stock quota: Technical chart analysis Known: Database techniques for similarity search in time sequences (DFT, etc.)
40
39 40 New Challenges Professional analyst tools use: Trading signals generated by indicators (etc. MACD) Formations indicating trends in charts Relationships to the market and to derivatives
41
39 41 Conclusion Database primitives: abstraction from application: Similarity Search Clustering Classification Similarity Join Outlier Detection Advantages General solution, reuse Separately optimizable Range Queries Nearest Neighbor Queries
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.