Download presentation
Presentation is loading. Please wait.
Published byClaribel Magdalene Briggs Modified over 8 years ago
1
Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information Science and Engineering, University of Florida
2
Outline Problem/Solution Problem/Solution Background Background The Omni-concept The Omni-concept Members of the Omni-family Members of the Omni-family Experimental Results Experimental Results
3
Problem Diverse and complex data Diverse and complex data How to search How to search Expensive distance calculations Expensive distance calculations
4
Solution Reduce the number of distance calculations Reduce the number of distance calculations The Omni-Concept/Family The Omni-Concept/Family Select a set of foci Select a set of foci Gauge all other objects with their distance from this set Gauge all other objects with their distance from this set The foci increase the pruning of distance calculations The foci increase the pruning of distance calculations Scalable Scalable
5
Background: Metric Spaces Set of objects S = {s 1,s 2,s 3,…,s n } of domain S, d() has following properties: Set of objects S = {s 1,s 2,s 3,…,s n } of domain S, d() has following properties: Symmetry: d(s 1,s 2 ) = d(s 2,s 1 ) Symmetry: d(s 1,s 2 ) = d(s 2,s 1 ) Non-negativity: 0<d(s 1,s 2 ) < infinity, s 1 ≠ s 2, and d(s 1,s 1 ) = 0 Non-negativity: 0<d(s 1,s 2 ) < infinity, s 1 ≠ s 2, and d(s 1,s 1 ) = 0 Triangle inequality: d(s 1,s 3 ) ≤ d(s 1,s 2 ) + d(s 2,s 3 ) Triangle inequality: d(s 1,s 3 ) ≤ d(s 1,s 2 ) + d(s 2,s 3 ) A metric space is a pair M = A metric space is a pair M = Spatial datasets following an Lp distance function are special cases of metric spaces. Spatial datasets following an Lp distance function are special cases of metric spaces.
6
Range and NN Queries Range: Given a query object s q, and a max search distance r q : Rquery(s q,r q )= {s i | s i ∈ S: d(s i,s q ) ≤ r q } Range: Given a query object s q, and a max search distance r q : Rquery(s q,r q )= {s i | s i ∈ S: d(s i,s q ) ≤ r q } NN: Given a query object s q ∈ S: NNquery(s q )= {s n ∈ S | ∀ s i ∈ S: d(s n,s q ) ≤ d(s i,s q )} NN: Given a query object s q ∈ S: NNquery(s q )= {s n ∈ S | ∀ s i ∈ S: d(s n,s q ) ≤ d(s i,s q )}
7
Current solutions Metric tree of Uhlmann Metric tree of Uhlmann Vantage-point tree Vantage-point tree Generalized hyper-plane tree Generalized hyper-plane tree Multi-vantage point tree Multi-vantage point tree Geometric Near Access tree Geometric Near Access tree The M-tree The M-tree
8
Intrinsic Dimensionality Some assume embedding dimensionality of dataset define behavior on a query. Some assume embedding dimensionality of dataset define behavior on a query. Datasets can inhibit small portion of embedding space. Datasets can inhibit small portion of embedding space. Intrinsic dimensionality gives better precision in selectivity. Intrinsic dimensionality gives better precision in selectivity. Use correlation of fractal dimensions D2 as an approximation of the intrinsic dimension. Use correlation of fractal dimensions D2 as an approximation of the intrinsic dimension.
9
Omni-concepts Omni-foci base (F): Given M F = {f 1,f 2,…,f l | f K ∈ S, f k ≠f j, l≤N}, Omni-foci base (F): Given M F = {f 1,f 2,…,f l | f K ∈ S, f k ≠f j, l≤N}, Omni-coordinates (C i ): C i = {, for all f k ∈ F} Omni-coordinates (C i ): C i = {, for all f k ∈ F} mbOr: Given F and a collection of objects A = {x 1,x 2,….x n } ⊂ S, the intersection of the metric intervals R A = | l 1 I i where I i = [min(d(x j,f i )), max(d(x j,f i ))}, 1 <=i<=l, 1 <= j <=n. mbOr: Given F and a collection of objects A = {x 1,x 2,….x n } ⊂ S, the intersection of the metric intervals R A = | l 1 I i where I i = [min(d(x j,f i )), max(d(x j,f i ))}, 1 <=i<=l, 1 <= j <=n.
10
df1a df1b df1a df1b df2b df2a
11
Cardinality of F Good number for the cardinality of F would be between the next integer that contains the intrinsic dimension ceil(D2)+1 and 2*ceil(D2)+1. Good number for the cardinality of F would be between the next integer that contains the intrinsic dimension ceil(D2)+1 and 2*ceil(D2)+1.
12
How to choose foci: HF-Algorithm 2 6 7 3 5 3 6 5.5 10 s2 s1 s3 s4 s5 s6
13
HF-Algorithm HF-Algorithm practical: O(N) HF-Algorithm practical: O(N) Requires l*N distance calculations Requires l*N distance calculations Best foci algorithm O(N!/(N-l)!) Best foci algorithm O(N!/(N-l)!)
14
Omni-sequential Omni-sequential Omni-sequential Calculate C i Precede distance calculation by for f k ∈ F if | df k (s i ) – df k (s q ) | > r q if | df k (s i ) – df k (s q ) | > r q then skip distance calc.
15
OmniB+-tree Store C i in l B+trees, one for each focus Store C i in l B+trees, one for each focus Subsets I k ⊂ S are retrieved from corresponding b+-tree and used to generate mbOr. Subsets I k ⊂ S are retrieved from corresponding b+-tree and used to generate mbOr. I k is objects between df k (s q ) – r q and df k (s q ) + r q I k is objects between df k (s q ) – r q and df k (s q ) + r q Calculate distance from s q to each obj in intersection. Calculate distance from s q to each obj in intersection.
16
OmniR-tree Algorithm to do insertion, node partitioning, range queries are same. Algorithm to do insertion, node partitioning, range queries are same. KNN requires NN algorithm used in metric tree. A deep search first preformed to find k-candidates. Continues reducing radius whenever the furthest neighbor is replaced, until every entry that overlaps the radius in the query has been tested. KNN requires NN algorithm used in metric tree. A deep search first preformed to find k-candidates. Continues reducing radius whenever the furthest neighbor is replaced, until every entry that overlaps the radius in the query has been tested.
17
OmniR-tree Requires an R tree to store C i Requires an R tree to store C i Requires a page direct access file to store the objects in the dataset. Requires a page direct access file to store the objects in the dataset. When a leaf in R tree is retrieved, and the C i stored in this node qualify objects, the actual distance is calculated. When a leaf in R tree is retrieved, and the C i stored in this node qualify objects, the actual distance is calculated.
18
Graph’s prove intrinsic dimensionality of the data is a good reference for the number of foci.
20
Review Reduce the number of distance calculations Reduce the number of distance calculations The Omni-Family The Omni-Family Select a set of foci Select a set of foci Gauge all other objects with their distance from this set Gauge all other objects with their distance from this set The foci increase the pruning of distance calculations The foci increase the pruning of distance calculations Scalable Scalable
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.