Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical University of Ostrava

2 DATESO 20042 Presentation Outline Similarity search in Metric Spaces M-tree PM-tree  structure  range queries  hyper-ring storage Experimental Results

3 DATESO 20043 Similarity search in Metric Spaces Similarity search – methods for content-based retrieval in multimedia databases (in Information Retrieval resp.) Similarity modelled by metric d: Restriction to metric yields a paradigmatic discrepancy with several similarity theories – nevertheless, the triangular inequality is the basic tool for metric region construction leading to an efficient similarity search Metric queries  range query (specified by pivot object Q and covering radius r Q )  k-NN query (specified by pivot object Q and number of nearest neighbours k)

4 DATESO 20044 Metric Access Methods Designed to search in metric datasets in order to keep the search costs minimal (number of distance computation). When searching large multimedia databases also the I/O search costs have to be minimized. Many MAMs developed so far: M-tree, GH-tree, GNAT, LAESA, D-index, VP-tree, MVP-tree, SAT,... Majority of the MAMs is not suitable for similarity search in large datasets (either a static method or high I/O search costs)  only M-tree and (recently) D-index are suitable candidates

5 DATESO 20045 (euclidean 2D space) range query M-tree  dynamic, balanced, and paged metric tree (like e.g. B+-tree, R-tree)  the leaves are clusters of objects  routing entries in the inner nodes represent metric regions, recursively bounding the object clusters in leaves  during query evaluation, the triangular inequality allows discarding of irrelevant M-tree branches (metric regions resp.)

6 DATESO 20046 PM-tree, motivation metric regions in M-tree are unnecessarily large  indexing of large portions of empty space (the “dead” space)  higher probability of intersection with query region  less efficient search reduction of metric region “volume” should lead to more effective discarding of irrelevant subtrees the way is to specify a metric region bounding all the objects more “tightly”

7 DATESO 20047 PM-tree, structure Pivoting M-tree (PM-tree): a combination of M-tree with the pivot-based methods (LAESA-like) given a fixed set of p pivots P i (selected from the dataset), a PM-tree region is additionaly defined by p hyper-ring regions (P i, HR[i])  each routing entry contains an array HR of p intervals  each interval HR[i] bounds the distances of objects to the respective pivot P i intersection of the hyper-sphere and the hyper-rings forms a smaller region bounding all the objects the more pivots, the more thightly bounded region

8 DATESO 20048 PM-tree region PM-tree, query processing prior to processing of a query (Q,r Q ), distances d(Q, P i ) for all i ≤ p must be computed metric region is relevant to a range query just in case that all the hyper-rings and the hyper-sphere intersect the range query region  the more hyper-rings, the lower probability of intersection with query  no additional distance computations are needed for the intersection test M-tree region query

9 DATESO 20049 PM-tree, hyper-ring storage The routing entries of PM-tree nodes are enlarged by the additional pivot-based information stored in HR arrays To keep the space overhead minimal, a compact storage of HR[i] intervals is necessary A distance histogram for each pivot P i is created, and interval is chosen such that e.g. 90% of distances in the distance histogram fall into that interval Each value HR[i].min, HR[i].max, is scaled to the interval using a single byte, i.e. each hyper-ring HR[i] takes 2 bytes O i, r, ptr(T),... HR[1],HR[2],...,HR[p] storage of HR array

10 DATESO 200410 Experimental results (synthetic) synthetic dataset of 100,000 30-dimensional tuples distributed within 1000 clusters, L 2 distance, query selectivity 50 objs.

11 DATESO 200411 Experimental results (images) collection of 10,000 images represented by 256-dimensional vectors (gray histograms), L 2 distance, query selectivity 50 objs.

12 DATESO 200412 Recent results (not included in proceedings) Cost models for range queries in PM-tree (  ADBIS‘04) Experiments on image dataset (  ADBIS‘04) Optimal k-NN query algorithm for PM-tree + cost models (to be published...)

13 DATESO 200413 Reference [1] Skopal T., Pokorný J., Snášel V.: PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, submitted to ADBIS 2004, Budapest, Hungary [2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná [3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, LNCS 2798, Springer-Verlag, Dresden, Germany

