Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,

Similar presentations


Presentation on theme: "High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,"— Presentation transcript:

1 High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology, Chicago 2 University of Illinois at Springfield Database and Expert Systems Applications 2006 ┼ Work supported by the NSF under grant no. IIS-0312266.

2 CS695 April 13, 2007 2 Outline Problem Definition Existing Solutions Our Goal Design Principle Garden HD Clustering and Γ Partitioning System Architecture and Processes Results Conclusions

3 CS695 April 13, 2007 3 Problem Definition Consider a database of addresses of clubs Typical queries are: –Find all the clubs within 35 miles of 10 West 31st. Street, Chicago. –Find 5 nearest clubs [d1] [d2]

4 CS695 April 13, 2007 4 Problem Definition K-Nearest Neighbor (k-NN) Search: –Given a database with N points and a query point q in some metric space, find k  1 points closest to q. [1] Applications: –Computational geometry –Geographic information systems (GIS) –Multimedia databases –Data mining –Etc.

5 CS695 April 13, 2007 5 Challenge of k-NN Search In High-dimensional feature spaces indexing structures face the problem of dead space (KDB- Trees) or overlaps (R-tree). Volume and area grows exponentially with respect to number of dimensions. Finding k-NN points is costly. Traditional access methods are at par with sequential scan – “Curse of dimensionality”

6 CS695 April 13, 2007 6 Existing Solutions Approximation and dimensionality reduction. Exact Nearest Neighbor Solutions Significant effort in finding the exact nearest neighbors has yielded limited success. VA-File A-tree iDistance R-tree SS-tree SR-tree

7 CS695 April 13, 2007 7 Goal Our goal: –Scalability with respect to dimensionality –Acceptable pre-processing (data-loading) time –Ability to work on incremental loads of data.

8 CS695 April 13, 2007 8 Our Solution Clustering Space partitioning Indexing 0 1 1

9 CS695 April 13, 2007 9 Design Principle “multi-dimensional data must be grouped on storage in a way that minimizes the extensions of storage clusters along all relevant dimensions and achieves high storage utilization”.

10 CS695 April 13, 2007 10 What does it Imply? Storage organization must maximize the densities of storage clusters Reduce their internal empty space Improve search performance even before the retrieval process hits persistent storage For best results, employ a genuine clustering algorithm

11 CS695 April 13, 2007 11 Achieving the Principles Data space reduction: –Detecting dense areas (dense cells) in the space with minimum amounts of empty space. Data clustering: –Detecting the largest areas with the above mentioned property, called data clusters.

12 CS695 April 13, 2007 12 Garden HD Clustering Motivated by the stated principle. Efficiently and effectively separates disjoint areas with points. Hybrid of cell- and density-based clustering that operates in two phases. Recursive space partition -  partitioning. Merging of dense cells.

13 CS695 April 13, 2007 13  partitioning G no. generators D no. dimensions, No. regions = 1+(G–1)  D Space partition is compactly represented by a  filter (in memory).  subspace  region0  region1  region2  region3  Region4 0 1 1

14 CS695 April 13, 2007 14 Data-Sensitive Gamma Partition DSGP :– Data-Sensitive Gamma Partition 1 2 3 4 Effective boundaries KDB-Trees

15 CS695 April 13, 2007 15 System Architecture Data Clustering “Data-Sensitive” Space Partitioning Incremental Data Loading Data Loading Data Retrieval Region Search Similarity Search

16 CS695 April 13, 2007 16 Basic Processes Each region in space represented by separate KDB-tree –KDB-trees perform implicit slicing Initial and incremental loading of data –Dynamic assignment of multi-dimensional data to index pages Retrieval –Region and k-nearest neighbor search –Several stages of refinement

17 CS695 April 13, 2007 17 Similarity Search - GammaNN Nearest neighbor search using GammaNN. Query Point Region Representatives Query Hyper-sphere Clipped portions to be queried

18 CS695 April 13, 2007 18 Region Search 1 2 3 4

19 CS695 April 13, 2007 19 Experimental Setup PC with 3.6 GHz CPU, 3GB RAM, and 280GB disk. Page size was 8K bytes. Normalized D-dimensional space [0,1] D. The GammaNN implementations with and without explicit clustering are referred to here as ‘data aware’ and ‘data blind’ algorithms, respectively. Comparison with Sequential Scan and VA-File.

20 CS695 April 13, 2007 20 Datasets Data: –Synthetic data Up to 100 dimensions, 100,000 points. Distributed across 11 clusters—one in the center and 10 in random corners of the space – Real data 54-dimensional, 580,900 points, forest cover type (“covtype”). Distributed across 11 different classes. UCI Machine learning repository.

21 CS695 April 13, 2007 21 Metrics Pre-processing time –Time of space partitioning, I/O and the time for data loading (i.e., the construction of indices plus insertion of data). –For VA-File, only the time to generate the vector approximation file. Performance –Average page access for k-NN queries. –Time to process k-NN queries.

22 CS695 April 13, 2007 22 Experimental Results

23 CS695 April 13, 2007 23 Performance Synthetic Data

24 CS695 April 13, 2007 24 Performance Real Data

25 CS695 April 13, 2007 25 Progress with k in k-NN

26 CS695 April 13, 2007 26 Incremental Load of Data

27 CS695 April 13, 2007 27 Conclusions Comparison of the data-sensitive and data-blind approach clearly highlights the importance of clustering data on storage for efficient similarity search. Our approach can support exact similarity search while accessing only a small fraction of data. The algorithm is very efficient in high dimensionalities and performs better than sequential scan and the VA- File technique. The performance remains good even after incremental loads of data without re-clustering.

28 CS695 April 13, 2007 28 Current and Future Work Incorporate R-trees or A-trees in place of KDB-trees. Provide facility for handling data with missing values.

29 CS695 April 13, 2007 29 References 1.Fagin, R., Kumar, R., Shivakumar, D.: Efficient similarity search and classification via rank aggregation, Proc. Proc. ACM SIGMOD Conf., (2003) 301-312 2.Orlandic, R., Lukaszuk, J.: Efficient high-dimensional indexing by superimposing space-partitioning schemes, Proc. 8th International Database Engineering & Applications Symposium IDEAS’04, (2004) 257-264 3.Orlandic, R., Lai, Y., Yee, W.G.: Clustering high-dimensional data using an efficient and effective data space reduction, Proc. ACM Conference on Information and Knowledge Management CIKM’05, (2005) 201-208 4.Jagdish H. V., Ooi B. C., Tan K. L., Yu C., Zhang R., iDistance: An Adaptive B+- Tree Based Indexing Method for Nearest Neighbor Search, ACM Transactions on Database Systems, Vol. 30, No. 2, (2005): 364-395. 5.Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity search methods in high-dimensional spaces, Proc. 24th VLDB Conf., (1998) 194-205 6.Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: An index structure for high-dimensional spaces using relative approximation, Proc. 26th VLDB Conf., (2000) 516-526

30 CS695 April 13, 2007 30 Questions ? kulksac@iit.edu http://cs.iit.edu/~egalite


Download ppt "High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,"

Similar presentations


Ads by Google