39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next.

39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next Decade's Applications Keynote at iiWAS 2002

39 2 1 1 Similarity Search

39 3 Feature Based Similarity

39 4 Simple Similarity Queries  Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.

39 5 Multidimensional Index Structure (R-tree) Data Page: point 1 : x 11, x 12, x 13,... point 2 : x 21, x 22, x 23,... point 3 : x 31, x 32, x 33,... Directory Page: rectangle 1, address 1 rectangle 2, address 2 rectangle 3, address 3 rectangle 4, address 4

39 6 Range Query with Depth-First Traversal

39 7 Nearest Neighbor: Priority Algorithm 4 page accesses [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995]

39 8 Problems of High-Dim. Index Structures  „Curse of dimensionality“: Search performance of index deteriorates in high dim. Outperformed by sequential scan  Solution Optimize various parameters of index structures  Needed: Cost model for queries How many pages are expected to be accessed for Range queries (with given  ) Nearest neighbor queries (with given k)

39 9 Cost Estimation (Uniformity/Independence)  Minkowski sum: Estimation of the access probability of a page [Böhm: A Cost Model for Query Processing in High-Dimensional Data Spaces, TODS 25(2), 2000] Nearest neighbor: Estimate distance by point density

39 10 Cost Estimation  Boundary and saturation effects in high dim. space (considered by our model extension)  Correlation between attributes (considered by the concept of fractal dimension)  Cluster structure has also impact on performance Currently neglected by our model Histograms and similar data descriptions difficult in high-dimensional space (number of histo-bins exponential in dimensionality) Other descriptions of cluster structure (dendrograms)  Subject to future work

39 11 Optimization of Index Structures  To avoid the possibility to outperform index based query processing by the sequential scan:  Optimize various parameters such as Logical block size of the index pages Indexed dimension I/O schedule optimization (fast index scan) Data quantization

39 12 Page Size Optimization

39 13 Page Size Optimization [Böhm, Kriegel: Dynamically Optimizing High Dimensional Index Structures, EDBT 2000]

39 14 Optimized Dimension Assignment Matching Hi-dim. Index R-tree Inverted List B-tree Problem in hi-dim: Too few splits in each dimension Problem in hi-dim: Too many results in each dimension [Berchtold, Böhm, Keim, Kriegel, Xu: Optimal...Tree Striping, DaWaK 2000]

39 15 Optimized Dimension Assignment Matching Hi-dim. Index R-tree Inverted List B-tree Compromise: A moderate number of R-trees each indexing a few dimensions OPTIMIZE! [Berchtold, Böhm, Keim, Kriegel, Xu: Optimal...Tree Striping, DaWaK 2000]

39 16 Schedule Optimization (Fast Index Scan) Range Query: Required Pages are known from the directory

39 17 Schedule Optimization (NN Queries)  Current expenses are traded for possible later savings  Start at 100% page and extend forward and backward  Optimize the cumulated cost balance (CCB): [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000]

39 18 Quantization  Approximate the points by quantization grid based on quantiles  Benefit:fewer bits for representation  Cost: Grid cell partially intersected  access the original point data  How to choose grid resolution ??? [Weber, Schek, Blott: A Quantitative Analysis and Performance Study..., VLDB 1998]

39 19 Independent Quantization (IQ tree) Combines index, scan, and quantization [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000] Grid resolution optimized by cost model

39 20 Open Research Problems in Optimization  Multi-Parameter Optimization: How can parameters be optimized simultaneously? Are there conflicts between optimization goals? Example: Uniform data:  Quantization Correlated data:  Tree Striping

39 21 Open Research Problems in Optimization  Consider Insert/Delete/Update:  If the data set faces heavy update, the constructed index should look differently compared with more static data sets Update-bound: Construct index rather simple Query-bound: Spend more effort to organize data  Can be considered as an optimization problem

39 22 2 2 Data Mining

39 23 KDD Algorithms Based on Similarity Queries DBSCAN OPTICS.... LOF Dist. Based Outliers.... Simultan. Nearest Neighbor Classific..... Spatial Trend Detect. Spatial Assoc. Rules

39 24 Similarity Join  Catalogue Matching R S

39 25 Clustering  Clustering (e.g. DBSCAN) [Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters, KDD 1996]

39 26 Cache Behavior

39 27 Clustering and Similarity Join  DBSCAN uses similarity join as basic operations [Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000]

39 28 k-Nearest Neighbor Classification  Example: Objects with known class New objects k = 3 New objects Known objects

39 29 Distance Range Join (  -Join) Most widespread and best evaluated join Often also called the similarity join

39 30 k-Closest Pair Query In SQL notation: SELECT * FROM R, S ORDER BY ||R.obj  S.obj|| STOP AFTER k

39 31 k-Nearest Neighbor Join In SQL notation: (limited to k = 1) SELECT * FROM R, S GROUP BY R.obj ORDER BY ||R.obj  S.obj|| STOP AFTER K (*  k *)

39 32 R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S,  ) if IsDirpg (R)  IsDirpg (S) then foreach r  R.children do foreach s  S.children do if mindist (r,s)   then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,  ) ; else (* assume R,S both DataPg *) foreach p  R.points do foreach q  S.points do if |p  q|  then report (p,q);  R S [Brinkhoff, Kriegel, Seeger: Efficient Processing of Spatial Joins using R-trees, SIGMOD 1993]

39 33 Modeling and Optimization [Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, ICDE 2001]  Mating probability of index pages:  Probability that distance between two pages   Two-fold application of Minkowski sum

39 34 Modeling and Optimization  I/O cost: High const. cost per page Large capacity optimum  CPU cost: Low const. cost per page Low capacity optimum  CPU-performance like CPU optimized index  I/O- performance like I/O optimized index

39 35 Open Problems for Research (Sim. Join)  Modeling and Optimization: Dimension Quantization Page scheduling Caching strategies  Nearest Neighbor Join Applications Algorithms  General Integration into object-relational DBMS

39 36 3 3 New Challenges

39 37 New Challenges Incertain Features:  Application: Biometric Identification  Particularities: Features individually associated with incertainty (e.g. as Gaussian distributions)  Queries: Probability of match Find objects with highes probability of match Find objects with probability of match >=  Feature a 1 Relative probability

39 38 Support of e-commerce in all phases Marketing  customer segmentation Sales and booking  advanced similarity search Add-on products  Sales transaction analysis  Advanced Similarity Adaptable Multimodal models Relevance-feedback Convex hull Multimodal models Adaptable New Challenges Relevance-feedback Convex hull

39 New Challenges Stock quota: Technical chart analysis  Known: Database techniques for similarity search in time sequences (DFT, etc.)

39 40 New Challenges  Professional analyst tools use: Trading signals generated by indicators (etc. MACD) Formations indicating trends in charts Relationships to the market and to derivatives

39 41 Conclusion  Database primitives: abstraction from application: Similarity Search  Clustering Classification  Similarity Join Outlier Detection  Advantages General solution, reuse Separately optimizable Range Queries Nearest Neighbor Queries

39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next.

Similar presentations

Presentation on theme: "39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next.

Similar presentations

Presentation on theme: "39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next."— Presentation transcript:

Similar presentations

About project

Feedback