39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Spatio-temporal Databases
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
Lecture outline Density-based clustering (DB-Scan) – Reference: Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu: A Density-Based Algorithm for.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Searching on Multi-Dimensional Data
Clustering Prof. Navneet Goyal BITS, Pilani
Multidimensional Data
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Continuous Intersection Joins Over Moving Objects Rui Zhang University of Melbourne Dan Lin Purdue University Kotagiri Ramamohanarao University of Melbourne.
Spatial Mining.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Spatio-temporal Databases Time Parameterized Queries.
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
Chapter 8 File organization and Indices.
High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,
1 ISI’02 Multidimensional Databases Challenge: representation for efficient storage, indexing & querying Examples (time-series, images) New multidimensional.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
R-tree Analysis. R-trees - performance analysis How many disk (=node) accesses we’ll need for range nn spatial joins why does it matter?
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.

SEMILARITY JOIN COP6731 Advanced Database Systems.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Christian Böhm Ludwig Maximilians Universität München The Similarity Join: A Powerful Database Primitive for High Performance Data Mining Tutorial, 17th.
Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Density-Based Clustering Algorithms
Data Warehousing.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
The Curse of Dimensionality Richard Jang Oct. 29, 2003.
Challenges in Mining Large Image Datasets Jelena Tešić, B.S. Manjunath University of California, Santa Barbara
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Presented by Ho Wai Shing
R-trees: An Average Case Analysis. R-trees - performance analysis How many disk (=node) accesses we ’ ll need for range nn spatial joins why does it matter?
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Indexing Multidimensional Data
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Spatial Data Management
Data Transformation: Normalization
Spatial Indexing.
Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity.
Spatial Indexing I Point Access Methods.
Database Performance Tuning and Query Optimization
Topic 3: Cluster Analysis
The University of Adelaide, School of Computer Science
Outlier Discovery/Anomaly Detection
K Nearest Neighbor Classification
Spatio-temporal Databases
Physical Database Design
CSE572, CBS572: Data Mining by H. Liu
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Chapter 11 Database Performance Tuning and Query Optimization
Data Transformations targeted at minimizing experimental variance
Spatio-temporal Databases
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
Presentation transcript:

39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next Decade's Applications Keynote at iiWAS 2002

Similarity Search

39 3 Feature Based Similarity

39 4 Simple Similarity Queries  Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.

39 5 Multidimensional Index Structure (R-tree) Data Page: point 1 : x 11, x 12, x 13,... point 2 : x 21, x 22, x 23,... point 3 : x 31, x 32, x 33,... Directory Page: rectangle 1, address 1 rectangle 2, address 2 rectangle 3, address 3 rectangle 4, address 4

39 6 Range Query with Depth-First Traversal

39 7 Nearest Neighbor: Priority Algorithm 4 page accesses [Hjaltason, Samet: Ranking in Spatial Databases, SSD 1995]

39 8 Problems of High-Dim. Index Structures  „Curse of dimensionality“: Search performance of index deteriorates in high dim. Outperformed by sequential scan  Solution Optimize various parameters of index structures  Needed: Cost model for queries How many pages are expected to be accessed for Range queries (with given  ) Nearest neighbor queries (with given k)

39 9 Cost Estimation (Uniformity/Independence)  Minkowski sum: Estimation of the access probability of a page [Böhm: A Cost Model for Query Processing in High-Dimensional Data Spaces, TODS 25(2), 2000] Nearest neighbor: Estimate distance by point density

39 10 Cost Estimation  Boundary and saturation effects in high dim. space (considered by our model extension)  Correlation between attributes (considered by the concept of fractal dimension)  Cluster structure has also impact on performance Currently neglected by our model Histograms and similar data descriptions difficult in high-dimensional space (number of histo-bins exponential in dimensionality) Other descriptions of cluster structure (dendrograms)  Subject to future work

39 11 Optimization of Index Structures  To avoid the possibility to outperform index based query processing by the sequential scan:  Optimize various parameters such as Logical block size of the index pages Indexed dimension I/O schedule optimization (fast index scan) Data quantization

39 12 Page Size Optimization

39 13 Page Size Optimization [Böhm, Kriegel: Dynamically Optimizing High Dimensional Index Structures, EDBT 2000]

39 14 Optimized Dimension Assignment Matching Hi-dim. Index R-tree Inverted List B-tree Problem in hi-dim: Too few splits in each dimension Problem in hi-dim: Too many results in each dimension [Berchtold, Böhm, Keim, Kriegel, Xu: Optimal...Tree Striping, DaWaK 2000]

39 15 Optimized Dimension Assignment Matching Hi-dim. Index R-tree Inverted List B-tree Compromise: A moderate number of R-trees each indexing a few dimensions OPTIMIZE! [Berchtold, Böhm, Keim, Kriegel, Xu: Optimal...Tree Striping, DaWaK 2000]

39 16 Schedule Optimization (Fast Index Scan) Range Query: Required Pages are known from the directory

39 17 Schedule Optimization (NN Queries)  Current expenses are traded for possible later savings  Start at 100% page and extend forward and backward  Optimize the cumulated cost balance (CCB): [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000]

39 18 Quantization  Approximate the points by quantization grid based on quantiles  Benefit:fewer bits for representation  Cost: Grid cell partially intersected  access the original point data  How to choose grid resolution ??? [Weber, Schek, Blott: A Quantitative Analysis and Performance Study..., VLDB 1998]

39 19 Independent Quantization (IQ tree) Combines index, scan, and quantization [Berchtold, Böhm, Jagadish, Kriegel, Sander: Independent Quantization..., ICDE 2000] Grid resolution optimized by cost model

39 20 Open Research Problems in Optimization  Multi-Parameter Optimization: How can parameters be optimized simultaneously? Are there conflicts between optimization goals? Example: Uniform data:  Quantization Correlated data:  Tree Striping

39 21 Open Research Problems in Optimization  Consider Insert/Delete/Update:  If the data set faces heavy update, the constructed index should look differently compared with more static data sets Update-bound: Construct index rather simple Query-bound: Spend more effort to organize data  Can be considered as an optimization problem

Data Mining

39 23 KDD Algorithms Based on Similarity Queries DBSCAN OPTICS.... LOF Dist. Based Outliers.... Simultan. Nearest Neighbor Classific..... Spatial Trend Detect. Spatial Assoc. Rules

39 24 Similarity Join  Catalogue Matching R S

39 25 Clustering  Clustering (e.g. DBSCAN) [Ester, Kriegel, Sander, Xu: A Density Based Algorithm for Discovering Clusters, KDD 1996]

39 26 Cache Behavior

39 27 Clustering and Similarity Join  DBSCAN uses similarity join as basic operations [Böhm, Braunmüller, Breunig, Kriegel: High Perf. Clustering based on the Sim. Join, CIKM 2000]

39 28 k-Nearest Neighbor Classification  Example: Objects with known class New objects k = 3 New objects Known objects

39 29 Distance Range Join (  -Join) Most widespread and best evaluated join Often also called the similarity join

39 30 k-Closest Pair Query In SQL notation: SELECT * FROM R, S ORDER BY ||R.obj  S.obj|| STOP AFTER k

39 31 k-Nearest Neighbor Join In SQL notation: (limited to k = 1) SELECT * FROM R, S GROUP BY R.obj ORDER BY ||R.obj  S.obj|| STOP AFTER K (*  k *)

39 32 R-tree Spatial Join (RSJ) procedure r_tree_sim_join (R, S,  ) if IsDirpg (R)  IsDirpg (S) then foreach r  R.children do foreach s  S.children do if mindist (r,s)   then CacheLoad(r); CacheLoad(s); r_tree_sim_join (r,s,  ) ; else (* assume R,S both DataPg *) foreach p  R.points do foreach q  S.points do if |p  q|  then report (p,q);  R S [Brinkhoff, Kriegel, Seeger: Efficient Processing of Spatial Joins using R-trees, SIGMOD 1993]

39 33 Modeling and Optimization [Böhm, Kriegel: A Cost Model and Index Architecture for the Similarity Join, ICDE 2001]  Mating probability of index pages:  Probability that distance between two pages   Two-fold application of Minkowski sum

39 34 Modeling and Optimization  I/O cost: High const. cost per page Large capacity optimum  CPU cost: Low const. cost per page Low capacity optimum  CPU-performance like CPU optimized index  I/O- performance like I/O optimized index

39 35 Open Problems for Research (Sim. Join)  Modeling and Optimization: Dimension Quantization Page scheduling Caching strategies  Nearest Neighbor Join Applications Algorithms  General Integration into object-relational DBMS

New Challenges

39 37 New Challenges Incertain Features:  Application: Biometric Identification  Particularities: Features individually associated with incertainty (e.g. as Gaussian distributions)  Queries: Probability of match Find objects with highes probability of match Find objects with probability of match >=  Feature a 1 Relative probability

39 38 Support of e-commerce in all phases Marketing  customer segmentation Sales and booking  advanced similarity search Add-on products  Sales transaction analysis  Advanced Similarity Adaptable Multimodal models Relevance-feedback Convex hull Multimodal models Adaptable New Challenges Relevance-feedback Convex hull

39 New Challenges Stock quota: Technical chart analysis  Known: Database techniques for similarity search in time sequences (DFT, etc.)

39 40 New Challenges  Professional analyst tools use: Trading signals generated by indicators (etc. MACD) Formations indicating trends in charts Relationships to the market and to derivatives

39 41 Conclusion  Database primitives: abstraction from application: Similarity Search  Clustering Classification  Similarity Join Outlier Detection  Advantages General solution, reuse Separately optimizable Range Queries Nearest Neighbor Queries