Download presentation
Presentation is loading. Please wait.
Published byRudolph Wilfred Sherman Modified over 8 years ago
1
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason Yeung CSIS 7101 Advanced Database
2
ContributionContribution As dimension increase, the distance to the nearest neighbor approaches the distance to the farthest neighbor The distinction between nearest and farthest neighbors may blur with as few as 15 dimensions Linear scan almost out-performs Nearest Neighbor processing techniques
3
Part One Introduce Nearest Neighbor
4
What is nearest neighbor (NN) problem? Given a collection of data points and a query point in a m-dimensional metric space, find the data point that is closest to the query point
5
Nearest neighbor algorithm 1.Make two sets of nodes, set A and set B and put all nodes into set B 2.Put your starting node into set A 3.Pick the node which is closest to the last node which was placed in set A and is not in set A; put this closest neighboring node into set A 4.Repeat step 3 until all nodes are in set A
6
Query point and its nearest neighbor Query point NN
7
Practical Applications of NN Search Medical Imaging Molecular Biology Spatial and Multimedia databases
8
Adaptable Similarity Approach In multimedia database, given an image database one may want to retrieve all images that are similar to a given query image data domain is high dimensional
9
Color-oriented similarity of images On Image Database
10
Shape-oriented similarity of images Aims at the level of individual pixels
11
Shape-oriented similarity of 3-D objects On 3-D protein database
12
Approximation-based Shape similarity of 3-D surface segments Measures the similarity of 3-D segments by using geometric approx.
13
Exception Case Distance between the nearest neighbor and any other point in the data set is small We call this unstable query
14
Part Two Nearest Neighbor in high dimensional
15
Unstable query A nearest neighbor query is unstable for a given € if the distance from the query point to most data point is less than (1 + € ) times the distance from the query point to its nearest neighbor
16
NN in High-Dimensional Space Proof in the paper that the concept NN become meaningless as dimensionality (m) increases If a pre-condition holds: As m increases, the difference in distance between the query point and all data points become negligible (i.e., the query becomes unstable)
17
Assumption for the pre-condition to hold The data distribution and query distribution are IID in all dimensions Unique dimensions with correlation between all dimensions
18
What is IID? Independent and identically distributed It means that the distribution of values in each dimension is identical (i.e. all values are uniformly distributed or dimensional have same skew) and independent
19
High Dimensional indexing can be meaningful When the dimensions of both the query point and data points follow identical distribution, but are completely dependent (i.e: value in D1 = values in D2= … ) The result is a set of data points and query point on a diagonal line The underlying query can actually be converted to 1D NN problem
20
Graphical View X Y Z All dimension has same value All data points are on the diagonal
21
High Dimensional indexing can be meaningful (Cont ’ d) The underlying dimensionality is much lower than the actual dimensionality E.g.: It is a 3-D data set, but the data always have the Z coordinate
22
High Dimensional indexing can be meaningful (Cont ’ d) When the query point is within some small distance of a data point (instead of being required to be identical to a data point) The result set of the query is to return all points within the closest cluster, not just the nearest point
23
NN query in clustered data Query point Nearest Cluster E.g.: Data falls into discrete classes or cluster in some potentially high dimensional feature space
24
Distribution of distances in clustered data Points are close and are in same cluster (NN meaningful) Point are in other cluster which are all far
25
Experimental studies of NN Want to find out the rate of convergence Based on 3 synthetic work-load and one real data set NN can become unstable with as few as 10- 20 dimensions The graph is exponential In reality, the dimensions might be 1000
26
Correlated Distributions Recursive and uniform workload (NN not meaningful) Two degrees of freedom workload (NN meaningful)
27
Part Three Linear Scan is powerful …
28
NN indexing VS Linear scan Linear scan can handily beats NN indexing NN indexing is meaningful when data consists of small, well-formed clusters And the query is guaranteed to land in or very near one of these cluster
29
Why Linear scan A set of sequentially arranged disk pages is much faster than unordered retrieval of the same pages Fetching a large number of data pages through multi-dimensional index usually results in unordered retrieval
30
Linear Scan outperforms Both the SS tree and the R* tree at 10 dimensions in all cases SR tree in all cases at 16 dimensional synthetic data set
31
JustificationJustification All the report performance studies examined situations in which the difference in distance between the query point and NN differed little from the distance to other data points In reality, it might be different
32
Other related work Dimensionality Curse Fractal Dimensions
33
Dimensionality Curse Vague indication that high dimensionality causes problems in some situations Examples: NN problem “ Boundary effects ” not taken into account on NN query in high dimensional case
34
Fractal Dimensions It is a measure of how "complicated" a self-similar figure (data) is NN queries become stable when fractal dimensionality is low In reality, real data sets do not exhibit fractal behavior
35
ConclusionConclusion The effect of dimensionality on NN queries High dimensional index can be meaningful Evaluate NN workload Linear scan outperforms NN processing technique on some meaningful workload
36
ReferenceReference Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft. What Is “ Nearest Neighbor ” Meaningful? Thomas Seidl. Adaptable Similarity Search in 3-D Spatial Database System http://www.dbs.informatik.uni- muenchen.de/Forschung/Similarity/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.