Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen

Metric space  A tuple(M,d) M: the domain of objects d: a distance function defines the similarity between the objects in M  Function d has four properties: 1. symmetry: d(q,o)=d(o,q) 2. non-negativity: d(q,o)≥ 0 3. identity: d(q,o)=0 if and only if q=o 4. triangle inequality: d(q,o)≤d(q,p)+d(o,p)

Metric space  For example, using edit distance as the distance function, any English word set can be a metric space.  Edit distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

Problem formulation

MAM metric access methods  Compact partitioning methods divide the space into compact regions and try to discard unqualified regions during search  Pivot-based methods store pre-computed distances from each object in the database to a set of pivots

SPB-tree Space-filling curving and Pivot-based B + - tree  Generic it does not rely on the detailed representations of objects, and it can support any distance notion that satisfies the triangle inequality.  SPB-Tree Integrate the compact partitioning with a pivot-based approach by utilizing a space-filling curve and a B + - tree  Efficient similarity search algorithms effective pivots to reduce significantly the number of distance computations during the search

Construction framework  Pivot mapping  Space-filling curve mapping

Pivot mapping and triangle inequality

Space-filling curve mapping  If the range in metric space is discrete integers, the SFC can directly map to an integer  Considering the range of d( ) in a metric space may be continuous real numbers, -approximation is utilized to partition the real range into discrete integers. can be approximated as where the whole vector space can be partitioned into cells

Pivot selection  The number of pivots the appropriate number of pivots is related to the intrinsic dimensionality of the dataset.  Use HF based Incremental pivot selection algorithm (HFI) to find outliers

Indexing structure  A pivot table stores selected objects (e.g., o 1 and o 6 ) to map a metric space into a vector space.  A B+-tree is employed to index the SFC values of objects after a pivot mapping.  A RAF to keep objects separately and supports both random access and sequential scan

Bulk-loading Operation

Similarity search

kNN search

Cost Models estimated number of distance computations -- EDC  The overall distribution of distances from objects in O to a pivot pi is defined as: F pi (r) = Pr {d(o, p i ) ≤ r} F(r 1, r 2,..., r |P| ) = Pr {d(o, p 1 ) ≤ r 1, d(o, p2) ≤ r2,..., d(o, p|P|) ≤ r|P|}  EDC = |P| + |O|* Pr (d(q, o) is needed to compute)

Cost Models expected number of page accesses -- EPA  the expected number of page accesses (EPA) of a similar query can be calculated as

Experiments effect of parameters  Pivot number

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Similar presentations

Presentation on theme: "Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Similar presentations

Presentation on theme: "Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen."— Presentation transcript:

Similar presentations

About project

Feedback