Danzhou Liu Ee-Peng Lim Wee-Keong Ng Efficient k Nearest Neighbor Queries on Remote Spatial Databases Using Range Estimation Danzhou Liu Ee-Peng Lim Wee-Keong Ng Center for Advanced Information Systems, School of Computer Engineering Nanyang Technological University, Nanyang Ave, Singapore 639798, Singapore
Outline Introduction Related work k-NN query algorithm based on range estimation Range estimation methods Experiments Conclusions SSDBM2002
Introduction Spatial database provides persistent storage for spatial objects (e.g., points, polylines, polygons) Spatial database supports Representation of spatial attributes Storage/indexing of spatial data values using some spatial indices (e.g., R-tree and Quadtree) Queries involving spatial attributes SSDBM2002
k-Nearest Neighbor Queries Definition k-Nearest Neighbor (k-NN) query: locating k spatial objects nearest to a given query point Wide range of applications: Geographic Information Systems (GIS), e.g., finding the nearest two hospitals Computer Aided Design (CAD), e.g, finding the nearest three resistors in a circuit board SSDBM2002
Motivation Large volume of spatial data on WWW Geospatial Data Clearinghouse (a collection of over 250 spatial database servers) Yahoo, Tiger and other map services Limited Web-based query interfaces Support simple spatial queries (e.g., window queries) No support for remote index access SSDBM2002
The Geospatial Data Clearinghouse Large amount of useful geospatial information on WWW SSDBM2002
The Geospatial Data Clearinghouse Limited Web-based query interface; supports only window queries SSDBM2002
Objective Develop efficient algorithms to evaluate k-NN queries on remote spatial databases using window queries: Propose a generic k-NN query processing algorithm that accommodates different range estimation methods Develop efficient range estimation methods Conduct experiments to evaluate performance of proposed range estimation methods Develop sampling methods to obtain statistical knowledge of remote databases needed for range estimation methods SSDBM2002
Related Work Algorithms for simple k-NN queries may be divided into three major groups: Partition-based algorithms Graph-based algorithms Range-based algorithms SSDBM2002
Partition-based Algorithms Retrieve k nearest neighbors from spatial indices by pruning away nodes that cannot lead to k nearest neighbors Examples Branch-and-bound R-tree traversal algorithm Pipelined fashion algorithm Not applicable to Web environment Spatial indices are usually not available to non-local applications Creating local indices is infeasible due to large amount of data SSDBM2002
Graph-based Algorithms Pre-compute nearest neighbors of spatial objects; create new index structures for pre-computed nearest neighbor information to support search Example Voronoi-based algorithm Not applicable to Web environment Retrieving all spatial objects on remote database servers is sometimes impractical Creating local indices is infeasible due to large amount of data SSDBM2002
Range-based Algorithms Use range queries to retrieve k nearest neighbors Examples Use sampling for range estimation Use distance distributions for range estimation Use reference points for range estimation Not applicable to Web environment Determining sample size and selecting samples of spatial objects properly are still a challenge Creating local indices is infeasible due to large amount of data SSDBM2002
Proposed k-NN Algorithm Based on range estimation New strategies for k-NN query evaluation in Web environment are required Use window queries for probing spatial database SSDBM2002
Density-based Range Estimation Method Based on uniform spatial object distribution assumption Range estimated by EstiRange1 function is Ranges estimated by EstiRange2 function are SSDBM2002
Bucket-based Range Estimation Method Use summary information about partitions or buckets of spatial objects for range estimation Summary information Bucket MBB, number of spatial objects in bucket Buckets are created using different strategies [1] Sort the set of max distance between buckets and query point Range estimated is the minimal bucket-query point max distance that contains at least k nearest neighbor objects Use one window query SSDBM2002
Example: k = 5 SSDBM2002
Experiments New Jersey road dataset from TIGER [30] SSDBM2002
Performance measures: Number of iterations h A SSDBM2002
Experimental Results Minimum, maximum and upper bounds on the number of iterations of the density-based range estimation method SSDBM2002
Iteration and accuracy of the density-based range estimation method SSDBM2002
Experimental Results Efficiency of density-based and bucket-based range estimation methods SSDBM2002
Conclusions A window query approach to evaluate k-NN queries on remote spatial databases motivated by Large amount of spatial information on the Web Limited query interface Proposed range estimation methods Performances increase with k. No a clear winner SSDBM2002
SSDBM2002
Types of Range Estimation Methods Tight estimation methods Estimated range is not large enough; i.e., both EstiRange1 and EstiRange2 functions may be invoked e.g., density-based method Loose estimation methods Estimated range is large enough; i.e., only the EstiRange1 function is invoked e.g., bucket-based method SSDBM2002
Future Work Extending range estimation methods with sampling techniques to determine data distribution Current range estimation methods depend on statistical knowledge provided by database owners Investigate how the statistical knowledge can be approximated through sampling Developing strategies to select the appropriate range estimation methods for evaluating k-NN queries. Developing Web applications of k-NN queries. SSDBM2002
Four Strategies to Create Buckets Equi-Count, Equi-Area, Min-Skew, and Min-Overlap partitioning strategies [1] Charminar Dataset Spatial Densities in Charminar Equi-Area Partitioning Equi-Count Partitioning Min-Skew Partitioning Min-Overlap Partitioning SSDBM2002