Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distance-based Indexing for metric space & almost-metric space Donghui Zhang Northeastern University.

Similar presentations


Presentation on theme: "Distance-based Indexing for metric space & almost-metric space Donghui Zhang Northeastern University."— Presentation transcript:

1 Distance-based Indexing for metric space & almost-metric space Donghui Zhang Northeastern University

2 Problem Statement Given a set S of objects and a metric distance function d(). The similarity search problem is defined as: for an arbitrary object q and a threshold , find { o | o  S  d(o, q)<  } Solution without index: for every o  S, compute d(q,o). Not efficient!

3 Metric Function 1.d(x,x)=0; 2.d(x,y)>0, where x≠y; 3.d(x,y)=d(y,x); 4.d(x,y) ≤ d(x,z)+d(y,z).

4 Spatial-Index Approach If every object can be mapped to a location in space (e.g. 2-D point), there are existing solutions. R-tree, Quad-tree, X-tree, … Idea: break space hierarchically into partitions and store objects that are close to each other in the same partition; at query time, prune whole partitions if possible.

5 Spatial Indexes Do not Apply In our problem, objects can be arbitrary and we only know the distance function. E.g. objects can be pictures, dogs, and so on. How to map a dog as a multi-dim point? Not clear. But suppose we got the “magical” distance function.

6 VP-tree vantage point tree, by Peter N. Yianilos, “Data Structures and Algorithms for Nearest Neighbor Search in General metric Spaces”, Proc. ACM-SIAM Symposium on Discrete Algorithms, 1993. Idea: build a binary search tree, where each node corresponds to an object; the root is randomly picked; the n/2 objects that are close to it are in the left subtree.

7 An Example S={o 1,…,o 10 }. Randomly pick o 1 as root. Compute the distance between o 1 and o i, sort in increasing order of distance: build tree recursively. o3o3 o7o7 o6o6 o9o9 o 10 o2o2 o8o8 o5o5 o4o4 56183496102111300401 o1o1 o 3, o 7, o 6, o 9 o 10, o 2, o 8, o 5, o 4 34 96

8 Query Processing Given object q, compute d(q,root). Intuitively, if it’s small, search the left tree; otherwise, search the right tree. Let maxDL=max{ d(root, o i )|o i  left tree }, (stored in the index) Under what circumstance can we prune the left sub-tree?

9 To Prune the Left Sub-Tree… 1.We need:  o i  left tree, d(q,o i ) ≥ . 2.We know: d(q,o i )+d(o i,root) ≥ d(q,root), or d(q,o i ) ≥ d(q,root) - d(o i,root), which implies: d(q,o i ) ≥ d(q,root) – maxDL. 3.To guarantee (1), it’s sufficient to have: d(q,root) – maxDL ≥ . Summary: given q, compare with tree root. If d(q,root) is so large that (3) is true, we know (1) is true and we can prune the left sub-tree.

10 To Prune the Right Sub-Tree… 1.Similarly, we define minDR=min{ d(root, o i )|o i  right tree }. 2.Given q, compare with tree root. If d(q,root) is so small that minDR - d(q,root) ≥  is true, we can prune the right sub-tree. Note: these prunings are done at each level of the tree.

11 Can we always prune? No. If d(q,root) – maxDL < , cannot prune left; If minDR - d(q,root) < , cannot prune right; Combine together: If minDR -  < d(q,root) < maxDL + , we have to examine both sub-trees.

12 Almost Metric Almost Metric was introduced in the paper “Distance Based Indexing for String Proximity Search”, ICDE’03. It is similar to metric, with the difference that the condition d(x,y) ≤ d(x,z)+d(y,z) is changed to d(x,y) ≤ f * ( d(x,z)+d(y,z) ) for some constant f. Can the VP-tree be used in an almost metric space?

13 A thought on f Must be: f ≥ 1. Why? d(x,y) ≤ f * ( d(x,z)+d(y,z) ) d(x,y) ≤ f * ( d(x,y) )+d(y,y) ) d(x,y) ≤ f * d(x,y) f ≥ 1 let y=z since d(y,y)=0 since d(x,y)≥0

14 To Prune the Right Sub-Tree… 1.We need:  o i  right tree, d(q,o i ) ≥ . 2.We know: d(o i,root) ≤ f * (d(q,o i )+d(q,root)), or d(q,o i ) ≥ d(o i,root) - d(q,root), which implies: d(q,o i ) ≥ minDR - d(q,root). 3.To guarantee (1), it’s sufficient to have: minDR - d(q,root) ≥ . Summary: given q, compare with tree root. If d(q,root) is so small that (3) is true, we know (1) is true and we can prune the right sub-tree.


Download ppt "Distance-based Indexing for metric space & almost-metric space Donghui Zhang Northeastern University."

Similar presentations


Ads by Google