Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity.

Slides:



Advertisements
Similar presentations
High-dimensional Similarity Join
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
1 Spatial Join. 2 Papers to Present “Efficient Processing of Spatial Joins using R-trees”, T. Brinkhoff, H-P Kriegel and B. Seeger, Proc. SIGMOD, 1993.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order Edgar Chávez Karina Figueroa Gonzalo Navarro UNIVERSIDAD MICHOACANA, MEXICO.
Spatial Join Queries. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer.
39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next.
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimensionality Reduction
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.
CS 347Notes 041 CS 347: Distributed Databases and Transaction Processing Notes04: Query Optimization Hector Garcia-Molina.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
CS186 Final Review Query Optimization.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
San Diego, 06/12/03 San Diego, 06/12/03 Martin Pfeifle, Database Group, University of Munich Using Sets of Feature Vectors for Similarity Search on Voxelized.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Music retrieval Conventional music retrieval systems Exact queries: ”Give me all songs from J.Lo’s latest album” What about ”Give me the music that I like”?
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
SEMILARITY JOIN COP6731 Advanced Database Systems.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Density-Based Clustering Algorithms
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Jaruloj Chongstitvatana Advanced Data Structures 1 Index Structures for Multimedia Data Feature-based Approach.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
Clustering.
Multi-object Similarity Query Evaluation Michal Batko.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-
Indexing Multidimensional Data
Clustering (1) Clustering Similarity measure Hierarchical clustering
Spatial Data Management
Data Mining Soongsil University
Spatial Indexing I Point Access Methods.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
The Quad tree The index is represented as a quaternary tree
Query Processing in Databases Dr. M. Gavrilova
Evaluation of Relational Operations
Sameh Shohdy, Yu Su, and Gagan Agrawal
Database Management Systems (CS 564)
Spatio-temporal Pattern Queries
K Nearest Neighbor Classification
Joining Massive High-Dimensional Datasets
Nearest-Neighbor Classifiers
Lecture#12: External Sorting (R&G, Ch13)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
CSE572, CBS598: Data Mining by H. Liu
Selected Topics: External Sorting, Join Algorithms, …
Locality Sensitive Hashing
Chapters 15 and 16b: Query Optimization
CSE572, CBS572: Data Mining by H. Liu
Dimension reduction : PCA and Clustering
What Is Good Clustering?
Data Transformations targeted at minimizing experimental variance
Evaluation of Relational Operations: Other Techniques
Database Systems (資料庫系統)
CSE572: Data Mining by H. Liu
Efficient Processing of Top-k Spatial Preference Queries
External Sorting Dina Said
Presentation transcript:

Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data

Feature Based Similarity

Simple Similarity Queries Specify query object and Find similar objects – range query Find the k most similar objects – nearest neighbor q.

Join Applications: Catalogue Matching E.g. Astronomic catalogues S R

Join Applications: Clustering Clustering (e.g. DBSCAN) Similarity self-join

Grid partitioning General idea: Grid approximation where grid line distance = e Similar idea in the e-kdB-tree [Shim, Srikant, Agrawal: High-dimensional Similarity Joins, ICDE 1997] Disadvantage of any grid approach: Number of neighboring grid cells: 3d - 1

Scalability of the e-kdB-tree Assumption: 2 adjacent e-stripes fit in main mem. Unrealistic for large data sets which are ... clustered, skewed and high-dimensional data

Epsilon Grid Order

e-Grid-Order Is a Total Strict Order Irreflexivity Transitivity Asymmetry e-grid-order can be used in any sorting algorithm

e-Interval Coarse approximation of join mates: Used for I/O processing

I/O Processing for the Self Join Decompose the sorted file into I/O units

Epsilon Grid Order

CPU Processing I/O units are further decomposed before joining Simple divide-and-conquer:  No further sorting Decomposition: maximize active dimensions

CPU Processing Point distance computations: Order of dimensions Neighboring inactive dimensions Unspecified dimensions Active dimension Aligned inactive dimensions

Experimental Results 8-dimensional uniformly distributed vectors

Experimental Results (2) 16-d feature vectors from CAD application

Conclusions Summary Future research potential High potential for performance gains of the similarity join by page capacity optimization Necessary to separately optimize I/O and CPU Future research potential Similarity join for metric index structures Approximate similarity join Parallel similarity join algorithms