High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,

Slides:

Advertisements

Similar presentations

Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.

Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance:

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.

Danzhou Liu Ee-Peng Lim Wee-Keong Ng

Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.

3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.

Spatial Mining.

Indexing Network Voronoi Diagrams*

Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.

A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Spatial Indexing I Point Access Methods.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

FLANN Fast Library for Approximate Nearest Neighbors

Module 04: Algorithms Topic 07: Instance-Based Learning

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Indexing for Multidimensional Data An Introduction.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

Density-Based Clustering Algorithms

Indexing for Multidimensional Data An Introduction.

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

The Curse of Dimensionality Richard Jang Oct. 29, 2003.

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Observer Relative Data Extraction Linas Bukauskas 3DVDM group Aalborg University, Denmark 2001.

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.

CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

Presented by Ho Wai Shing

Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.

23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.

IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)

Dense-Region Based Compact Data Cube

Indexing Multidimensional Data

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Spatial Indexing I Point Access Methods.

K Nearest Neighbor Classification

Similarity Search: A Matching Based Approach

Donghui Zhang, Tian Xia Northeastern University

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Presentation transcript:

High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology, Chicago 2 University of Illinois at Springfield Database and Expert Systems Applications 2006 ┼ Work supported by the NSF under grant no. IIS

CS695 April 13, Outline Problem Definition Existing Solutions Our Goal Design Principle Garden HD Clustering and Γ Partitioning System Architecture and Processes Results Conclusions

CS695 April 13, Problem Definition Consider a database of addresses of clubs Typical queries are: –Find all the clubs within 35 miles of 10 West 31st. Street, Chicago. –Find 5 nearest clubs [d1] [d2]

CS695 April 13, Problem Definition K-Nearest Neighbor (k-NN) Search: –Given a database with N points and a query point q in some metric space, find k  1 points closest to q. [1] Applications: –Computational geometry –Geographic information systems (GIS) –Multimedia databases –Data mining –Etc.

CS695 April 13, Challenge of k-NN Search In High-dimensional feature spaces indexing structures face the problem of dead space (KDB- Trees) or overlaps (R-tree). Volume and area grows exponentially with respect to number of dimensions. Finding k-NN points is costly. Traditional access methods are at par with sequential scan – “Curse of dimensionality”

CS695 April 13, Existing Solutions Approximation and dimensionality reduction. Exact Nearest Neighbor Solutions Significant effort in finding the exact nearest neighbors has yielded limited success. VA-File A-tree iDistance R-tree SS-tree SR-tree

CS695 April 13, Goal Our goal: –Scalability with respect to dimensionality –Acceptable pre-processing (data-loading) time –Ability to work on incremental loads of data.

CS695 April 13, Our Solution Clustering Space partitioning Indexing 0 1 1

CS695 April 13, Design Principle “multi-dimensional data must be grouped on storage in a way that minimizes the extensions of storage clusters along all relevant dimensions and achieves high storage utilization”.

CS695 April 13, What does it Imply? Storage organization must maximize the densities of storage clusters Reduce their internal empty space Improve search performance even before the retrieval process hits persistent storage For best results, employ a genuine clustering algorithm

CS695 April 13, Achieving the Principles Data space reduction: –Detecting dense areas (dense cells) in the space with minimum amounts of empty space. Data clustering: –Detecting the largest areas with the above mentioned property, called data clusters.

CS695 April 13, Garden HD Clustering Motivated by the stated principle. Efficiently and effectively separates disjoint areas with points. Hybrid of cell- and density-based clustering that operates in two phases. Recursive space partition -  partitioning. Merging of dense cells.

CS695 April 13,  partitioning G no. generators D no. dimensions, No. regions = 1+(G–1)  D Space partition is compactly represented by a  filter (in memory).  subspace  region0  region1  region2  region3  Region

CS695 April 13, Data-Sensitive Gamma Partition DSGP :– Data-Sensitive Gamma Partition Effective boundaries KDB-Trees

CS695 April 13, System Architecture Data Clustering “Data-Sensitive” Space Partitioning Incremental Data Loading Data Loading Data Retrieval Region Search Similarity Search

CS695 April 13, Basic Processes Each region in space represented by separate KDB-tree –KDB-trees perform implicit slicing Initial and incremental loading of data –Dynamic assignment of multi-dimensional data to index pages Retrieval –Region and k-nearest neighbor search –Several stages of refinement

CS695 April 13, Similarity Search - GammaNN Nearest neighbor search using GammaNN. Query Point Region Representatives Query Hyper-sphere Clipped portions to be queried

CS695 April 13, Region Search

CS695 April 13, Experimental Setup PC with 3.6 GHz CPU, 3GB RAM, and 280GB disk. Page size was 8K bytes. Normalized D-dimensional space [0,1] D. The GammaNN implementations with and without explicit clustering are referred to here as ‘data aware’ and ‘data blind’ algorithms, respectively. Comparison with Sequential Scan and VA-File.

CS695 April 13, Datasets Data: –Synthetic data Up to 100 dimensions, 100,000 points. Distributed across 11 clusters—one in the center and 10 in random corners of the space – Real data 54-dimensional, 580,900 points, forest cover type (“covtype”). Distributed across 11 different classes. UCI Machine learning repository.

CS695 April 13, Metrics Pre-processing time –Time of space partitioning, I/O and the time for data loading (i.e., the construction of indices plus insertion of data). –For VA-File, only the time to generate the vector approximation file. Performance –Average page access for k-NN queries. –Time to process k-NN queries.

CS695 April 13, Experimental Results

CS695 April 13, Performance Synthetic Data

CS695 April 13, Performance Real Data

CS695 April 13, Progress with k in k-NN

CS695 April 13, Incremental Load of Data

CS695 April 13, Conclusions Comparison of the data-sensitive and data-blind approach clearly highlights the importance of clustering data on storage for efficient similarity search. Our approach can support exact similarity search while accessing only a small fraction of data. The algorithm is very efficient in high dimensionalities and performs better than sequential scan and the VA- File technique. The performance remains good even after incremental loads of data without re-clustering.

CS695 April 13, Current and Future Work Incorporate R-trees or A-trees in place of KDB-trees. Provide facility for handling data with missing values.

CS695 April 13, References 1.Fagin, R., Kumar, R., Shivakumar, D.: Efficient similarity search and classification via rank aggregation, Proc. Proc. ACM SIGMOD Conf., (2003) Orlandic, R., Lukaszuk, J.: Efficient high-dimensional indexing by superimposing space-partitioning schemes, Proc. 8th International Database Engineering & Applications Symposium IDEAS’04, (2004) Orlandic, R., Lai, Y., Yee, W.G.: Clustering high-dimensional data using an efficient and effective data space reduction, Proc. ACM Conference on Information and Knowledge Management CIKM’05, (2005) Jagdish H. V., Ooi B. C., Tan K. L., Yu C., Zhang R., iDistance: An Adaptive B+- Tree Based Indexing Method for Nearest Neighbor Search, ACM Transactions on Database Systems, Vol. 30, No. 2, (2005): Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity search methods in high-dimensional spaces, Proc. 24th VLDB Conf., (1998) Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: An index structure for high-dimensional spaces using relative approximation, Proc. 26th VLDB Conf., (2000)

CS695 April 13, Questions ?