High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology, Chicago 2 University of Illinois at Springfield Database and Expert Systems Applications 2006 ┼ Work supported by the NSF under grant no. IIS
CS695 April 13, Outline Problem Definition Existing Solutions Our Goal Design Principle Garden HD Clustering and Γ Partitioning System Architecture and Processes Results Conclusions
CS695 April 13, Problem Definition Consider a database of addresses of clubs Typical queries are: –Find all the clubs within 35 miles of 10 West 31st. Street, Chicago. –Find 5 nearest clubs [d1] [d2]
CS695 April 13, Problem Definition K-Nearest Neighbor (k-NN) Search: –Given a database with N points and a query point q in some metric space, find k 1 points closest to q. [1] Applications: –Computational geometry –Geographic information systems (GIS) –Multimedia databases –Data mining –Etc.
CS695 April 13, Challenge of k-NN Search In High-dimensional feature spaces indexing structures face the problem of dead space (KDB- Trees) or overlaps (R-tree). Volume and area grows exponentially with respect to number of dimensions. Finding k-NN points is costly. Traditional access methods are at par with sequential scan – “Curse of dimensionality”
CS695 April 13, Existing Solutions Approximation and dimensionality reduction. Exact Nearest Neighbor Solutions Significant effort in finding the exact nearest neighbors has yielded limited success. VA-File A-tree iDistance R-tree SS-tree SR-tree
CS695 April 13, Goal Our goal: –Scalability with respect to dimensionality –Acceptable pre-processing (data-loading) time –Ability to work on incremental loads of data.
CS695 April 13, Our Solution Clustering Space partitioning Indexing 0 1 1
CS695 April 13, Design Principle “multi-dimensional data must be grouped on storage in a way that minimizes the extensions of storage clusters along all relevant dimensions and achieves high storage utilization”.
CS695 April 13, What does it Imply? Storage organization must maximize the densities of storage clusters Reduce their internal empty space Improve search performance even before the retrieval process hits persistent storage For best results, employ a genuine clustering algorithm
CS695 April 13, Achieving the Principles Data space reduction: –Detecting dense areas (dense cells) in the space with minimum amounts of empty space. Data clustering: –Detecting the largest areas with the above mentioned property, called data clusters.
CS695 April 13, Garden HD Clustering Motivated by the stated principle. Efficiently and effectively separates disjoint areas with points. Hybrid of cell- and density-based clustering that operates in two phases. Recursive space partition - partitioning. Merging of dense cells.
CS695 April 13, partitioning G no. generators D no. dimensions, No. regions = 1+(G–1) D Space partition is compactly represented by a filter (in memory). subspace region0 region1 region2 region3 Region
CS695 April 13, Data-Sensitive Gamma Partition DSGP :– Data-Sensitive Gamma Partition Effective boundaries KDB-Trees
CS695 April 13, System Architecture Data Clustering “Data-Sensitive” Space Partitioning Incremental Data Loading Data Loading Data Retrieval Region Search Similarity Search
CS695 April 13, Basic Processes Each region in space represented by separate KDB-tree –KDB-trees perform implicit slicing Initial and incremental loading of data –Dynamic assignment of multi-dimensional data to index pages Retrieval –Region and k-nearest neighbor search –Several stages of refinement
CS695 April 13, Similarity Search - GammaNN Nearest neighbor search using GammaNN. Query Point Region Representatives Query Hyper-sphere Clipped portions to be queried
CS695 April 13, Region Search
CS695 April 13, Experimental Setup PC with 3.6 GHz CPU, 3GB RAM, and 280GB disk. Page size was 8K bytes. Normalized D-dimensional space [0,1] D. The GammaNN implementations with and without explicit clustering are referred to here as ‘data aware’ and ‘data blind’ algorithms, respectively. Comparison with Sequential Scan and VA-File.
CS695 April 13, Datasets Data: –Synthetic data Up to 100 dimensions, 100,000 points. Distributed across 11 clusters—one in the center and 10 in random corners of the space – Real data 54-dimensional, 580,900 points, forest cover type (“covtype”). Distributed across 11 different classes. UCI Machine learning repository.
CS695 April 13, Metrics Pre-processing time –Time of space partitioning, I/O and the time for data loading (i.e., the construction of indices plus insertion of data). –For VA-File, only the time to generate the vector approximation file. Performance –Average page access for k-NN queries. –Time to process k-NN queries.
CS695 April 13, Experimental Results
CS695 April 13, Performance Synthetic Data
CS695 April 13, Performance Real Data
CS695 April 13, Progress with k in k-NN
CS695 April 13, Incremental Load of Data
CS695 April 13, Conclusions Comparison of the data-sensitive and data-blind approach clearly highlights the importance of clustering data on storage for efficient similarity search. Our approach can support exact similarity search while accessing only a small fraction of data. The algorithm is very efficient in high dimensionalities and performs better than sequential scan and the VA- File technique. The performance remains good even after incremental loads of data without re-clustering.
CS695 April 13, Current and Future Work Incorporate R-trees or A-trees in place of KDB-trees. Provide facility for handling data with missing values.
CS695 April 13, References 1.Fagin, R., Kumar, R., Shivakumar, D.: Efficient similarity search and classification via rank aggregation, Proc. Proc. ACM SIGMOD Conf., (2003) Orlandic, R., Lukaszuk, J.: Efficient high-dimensional indexing by superimposing space-partitioning schemes, Proc. 8th International Database Engineering & Applications Symposium IDEAS’04, (2004) Orlandic, R., Lai, Y., Yee, W.G.: Clustering high-dimensional data using an efficient and effective data space reduction, Proc. ACM Conference on Information and Knowledge Management CIKM’05, (2005) Jagdish H. V., Ooi B. C., Tan K. L., Yu C., Zhang R., iDistance: An Adaptive B+- Tree Based Indexing Method for Nearest Neighbor Search, ACM Transactions on Database Systems, Vol. 30, No. 2, (2005): Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity search methods in high-dimensional spaces, Proc. 24th VLDB Conf., (1998) Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: An index structure for high-dimensional spaces using relative approximation, Proc. 26th VLDB Conf., (2000)
CS695 April 13, Questions ?