Influence sets based on Reverse Nearest Neighbor Queries

Slides:



Advertisements
Similar presentations
Comments We consider in this topic a large class of related problems that deal with proximity of points in the plane. We will: 1.Define some proximity.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Nearest Neighbor Queries using R-trees
Grey Level Enhancement Contrast stretching Linear mapping Non-linear mapping Efficient implementation of mapping algorithms Design of classes to support.
Dynamic Planar Convex Hull Operations in Near- Logarithmic Amortized Time TIMOTHY M. CHAN.
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Searching on Multi-Dimensional Data
By Groysman Maxim. Let S be a set of sites in the plane. Each point in the plane is influenced by each point of S. We would like to decompose the plane.
Reverse Furthest Neighbors in Spatial Databases Bin Yao, Feifei Li, Piyush Kumar Florida State University, USA.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
ISEE: Efficient k-Nearest-Neighbor Monitoring over Moving Obejcts [SSDBM 2007] Wei Wu, Kian-Lee Tan National University of Singapore.
Spatial Queries Nearest Neighbor Queries.
Cluster Analysis (1).
What is Cluster Analysis?
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
1 On Optimal Worst-Case Matching Cheng Long (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and.
Copyright © Cengage Learning. All rights reserved. 2 Limits and Derivatives.
1 Reverse Nearest Neighbor Queries for Dynamic Databases SHOU Yu Tao Jan. 10 th, 2003 SIGMOD 2000.
Circle-Radius form By definition, a circle is the set of all points in a plane that lie a given distance from a given point. The given distance.
Spatial Data Management
School of Computing Clemson University Fall, 2012
Copyright © Cengage Learning. All rights reserved.
Computational Geometry
Multiway Search Trees Data may not fit into main memory
Tree-Structured Indexes: Introduction
Data Science Algorithms: The Basic Methods
The Design and Analysis of Algorithms
COP Introduction to Database Structures
Progressive Computation of The Min-Dist Optimal-Location Query
CMPS 3130/6130 Computational Geometry Spring 2017
Data Mining K-means Algorithm
Chapter 17 Object-Oriented Data Structures
KD Tree A binary search tree where every node is a
Nearest Neighbor Queries using R-trees
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Hash Table.
Computer Vision Lecture 5: Binary Image Processing
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
R-tree: Indexing Structure for Data in Multi-dimensional Space
K Nearest Neighbor Classification
Randomized Algorithms CS648
Transformations of Functions
Probabilistic Data Management
Introduction to Database Systems
Multi-Way Search Trees
B+-Trees and Static Hashing
Indexing and Hashing Basic Concepts Ordered Indices
Probabilistic Data Management
Sorting … and Insertion Sort.
cse 521: design and analysis of algorithms
Applied Combinatorics, 4th Ed. Alan Tucker
Copyright © Cengage Learning. All rights reserved.
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Systems (資料庫系統)
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Grey Level Enhancement
Chapter 6 Network Flow Models.
Lecture 2 Geometric Algorithms.
Donghui Zhang, Tian Xia Northeastern University
CO4301 – Advanced Games Development Week 12 Using Trees
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
Algorithms and Data Structures Lecture XIV
Data Mining CSCI 307, Spring 2019 Lecture 23
Presentation transcript:

Influence sets based on Reverse Nearest Neighbor Queries Presenter: Anoopkumar Hariyani

What’s Influence sets? Examples: Decision Support System. Maintaining Document Repository. What are the naïve solution for Influence sets ? Problems with this solutions.

Asymmetric nature of Nearest Neighbor relation.

Reverse Nearest Neighbor Queries. Formal definitions: NN(q) = {r ε S| p ε S : d(q, r) <= d(q, p)} RNN(q) = {r ε S| p ε S : d(r, q) <= d(r, p)} Variants: Monochromatic vs Bichromatic . Static vs Dynamic.

Our Approach to RNN Queries. Static case: Step 1: For each point p ε S, determine the distance to the nearest neighbor of p in S, denoted N(p). Formally, N(p) = min q ε S –{p} d(p,q). For each p ε S, generate a circle (p,N(p)) where p is its center and N(p) its radius. Step 2: For any query q, determine all the circles (p,N(p)) that contain q and return their centers p.

2. Dynamic case: Consider a insertion of point q, Determine the reverse nearest neighbors p of q. For each such point p, we replace circle (p,N(p)) with(p, d(p, q)), and update N(p) to equal d(p, q). Find N(q), the distance of q from its nearest neighbor, and add (q,N(q)) to the collection of circles.

Consider an deletion of point q, We need to remove the circle (q,N(q)) from the collection of circles. Determine all the reverse nearest neighbors p of q. For each such point p, determine its current N(p) and replace its existing circle with (p,N(p)).

Scalable RNN Queries: Static case: The first step in being able to efficiently answer RNN queries is to pre-compute the nearest neighbor for each and every point. Given a query point q, a straightforward but naïve approach for finding reverse nearest neighbors is to sequentially scan through the entries (pi -> pj) of a pre-computed all-NN list in order to determine which points pi are closer to q than to pi's current nearest neighbor pj. Ideally, one would like to avoid having to sequentially scan through the data. As we know that, a RNN query reduces to a point enclosure query in a database of nearest neighborhood objects (e.g., circles for L2 distance in the plane); these objects can be obtained from the all-nearest neighbor distances. We propose to store the objects explicitly in an R-tree. Henceforth, we shall refer to this instantiation of an R-tree as an RNN-tree. Thus, we can answer RNN queries by a simple search in the R-tree for those objects enclosing q.

2. Dynamic case: A sequential scan of a pre-computed all-NN list can be used to determine the reverse nearest neighbors of a given point query q. Insertion and deletion can be handled similarly. We incrementally maintain the RNN-tree in the presence of insertion and deletions. For this will require a supporting access method that can find nearest neighbors of points efficiently. At this point, one may wonder if a single R-tree will suffice for finding reverse nearest neighbors as well as nearest neighbors. This turns out to be not the case since geometric objects rather than points are stored in the RNN-tree, and thus the bounding boxes are not optimized for nearest neighbor search performance on points. We use a separate R-tree for NN queries, henceforth referred to as an NN-tree.

Dynamic case (continued) :

Experiments on RNN Queries: We compared the proposed algorithms given to the basic scanning approach. Data sets: Our testbed includes two real data sets. The first is mono and the second is bichromatic. 1. cities1 - Centers of 100K cities and small towns in the USA (chosen randomly from the large data sets of 132k cities) represented as latitude and longitude coordinates. 2. cities2 - Coordinates of 100K red cities (i.e.clients) and 400 black cities (i.e.servers). The red cities are mutually disjoint from the cities, and points from both colors were chosen at random from the same source.

Experiment On RNN Queries (continued): We chose 500 query points at random (without replacement) from the same source that the data sets were chosen; note that these points are external to the data sets. For dynamic queries, we simulated a mixed workload of insertions by randomly choosing between insertions and deletions. In the case of insertions, one of the 500 query points were inserted; for deletions, an existing point was chosen at random. We report the average I/O per query, that is, the cumulative number of page accesses divided by the number of queries.

Experiment On RNN Queries (continued): Static case: We uniformly sampled the cities1 data set to get subsets of varying sizes, between 10K and 100K points. Figure shows the I/O performance of the proposed method compared to sequential scan.

2. Dynamic case: we used the cities1 data set and uniformly sampled it to get subsets of varying sizes, between 10K and 100K points. As shown in Figure , the I/O cost for an even workload of insertions and deletions appears to scale logarithmically, whereas the scanning method scales linearly. It is interesting that the average I/O is up to four times worse than in the static case, although this factor decreases for larger data sets.

Influence Sets: There are two potential problems with the effectiveness of any approach to finding influence sets. 1. Precision problem 2. Recall problem. The first issue that arises in finding influence sets is what region to search in. Two possibilities immediately present themselves: find the closest points (i.e. the k-nearest neighbors) or all points within some radius (i.e. range search). The black points represent servers and the white points represent clients. In this example, we wish to find all the clients for which q is their closest server. The example illustrates that a range (alternatively, k-NN) query cannot find the desired information in this case, regardless of which value of (or k) is chosen. Figure 8(a) shows a 'safe' radius l in which all points are reverse nearest neighbors of q; however, there exist reverse nearest neighbors of q outside l. Figure 8(b) shows a wider radius h that includes all of the reverse nearest neighbors of q but also includes points which are not.

Extended notion of Influence Sets: Reverse k-Nearest Neighbor: For static queries, the only difference in our solution is that we store the neighborhood of kth neighbor rather than nearest neighbor. When inserting or deleting q, we first find the set of affected points using the enclosure problem as done for answering queries. For insertion, we perform a range query to determine the k nearest neighbors of each such affected point and do necessary updates. For deletion, the neighborhood radius of the affected points is expanded to the distance of the (k + 1)th neighbor, which can be found by a modified NN search on R-trees.

Extended notion of Influence Sets (continued): Reverse furthest neighbor: Define the influence set of a point q to be the set of all points r such that q is farther from r than any other point of the database is from r. Say S is the set of points which will be fixed. A query point is denoted q. For simplicity, we will first describe our solution for the L∞ distance. Preprocessing: We first determine the furthest point for each point p ε S and denote it as f(p). We will put a square with center p and sides 2d(p,f(p)) for each p, say this square is Rp. Query processing: The simple observation is that for any query q, the reverse furthest neighbors r are those for which the Rr does not include q. Thus the problem we have is square non-enclosure problem.

Consider the intervals xr and yr obtained by projecting the square Rr on x and y axis respectively. A point q = (x, y) is not contained in Rr if and only if either xr does not contain x or yr does not contain y. Therefore, if we return all the xr's that do not contain x as well as those yr 's that do not contain y's, each square r in the output is repeated atmost twice. So the problem can be reduced to a one dimensional problem on intervals without losing much efficiency. we are given a set of intervals, say N of them. Each query is a one dimensional point, say p, and the goal is to return all interval that do not contain p. For solving this problem, we maintain two sorted arrays, one of the right endpoints of the intervals and the other of their left endpoints. It suffices to perform two binary searches with p in the two arrays, to determine the intervals that do not contain p.

Conclusions: The problem of RNN queries can be reduced to point enclosure problem in geometric objects. The nearest neighbor and range queries are inefficient in influence sets problem. The sequential scan approach scales linearly, whereas the algorithms proposed here scales logarithmically.

Questions?