Download presentation
Presentation is loading. Please wait.
Published byAntonia Hodge Modified over 8 years ago
1
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen
2
Metric space A tuple(M,d) M: the domain of objects d: a distance function defines the similarity between the objects in M Function d has four properties: 1. symmetry: d(q,o)=d(o,q) 2. non-negativity: d(q,o)≥ 0 3. identity: d(q,o)=0 if and only if q=o 4. triangle inequality: d(q,o)≤d(q,p)+d(o,p)
3
Metric space For example, using edit distance as the distance function, any English word set can be a metric space. Edit distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
4
Problem formulation
5
MAM metric access methods Compact partitioning methods divide the space into compact regions and try to discard unqualified regions during search Pivot-based methods store pre-computed distances from each object in the database to a set of pivots
6
SPB-tree Space-filling curving and Pivot-based B + - tree Generic it does not rely on the detailed representations of objects, and it can support any distance notion that satisfies the triangle inequality. SPB-Tree Integrate the compact partitioning with a pivot-based approach by utilizing a space-filling curve and a B + - tree Efficient similarity search algorithms effective pivots to reduce significantly the number of distance computations during the search
7
Construction framework Pivot mapping Space-filling curve mapping
8
Pivot mapping and triangle inequality
9
Space-filling curve mapping If the range in metric space is discrete integers, the SFC can directly map to an integer Considering the range of d( ) in a metric space may be continuous real numbers, -approximation is utilized to partition the real range into discrete integers. can be approximated as where the whole vector space can be partitioned into cells
11
Pivot selection The number of pivots the appropriate number of pivots is related to the intrinsic dimensionality of the dataset. Use HF based Incremental pivot selection algorithm (HFI) to find outliers
12
Indexing structure A pivot table stores selected objects (e.g., o 1 and o 6 ) to map a metric space into a vector space. A B+-tree is employed to index the SFC values of objects after a pivot mapping. A RAF to keep objects separately and supports both random access and sequential scan
14
Bulk-loading Operation
15
Similarity search
16
kNN search
17
Cost Models estimated number of distance computations -- EDC The overall distribution of distances from objects in O to a pivot pi is defined as: F pi (r) = Pr {d(o, p i ) ≤ r} F(r 1, r 2,..., r |P| ) = Pr {d(o, p 1 ) ≤ r 1, d(o, p2) ≤ r2,..., d(o, p|P|) ≤ r|P|} EDC = |P| + |O|* Pr (d(q, o) is needed to compute)
18
Cost Models expected number of page accesses -- EPA the expected number of page accesses (EPA) of a similar query can be calculated as
19
Experiments effect of parameters Pivot number
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.