Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal tomas.skopal@vsb.cz Department of Computer Science, VŠB-Technical University of Ostrava

DATESO 20042 Presentation Outline Similarity search in Metric Spaces M-tree PM-tree  structure  range queries  hyper-ring storage Experimental Results

DATESO 20043 Similarity search in Metric Spaces Similarity search – methods for content-based retrieval in multimedia databases (in Information Retrieval resp.) Similarity modelled by metric d: Restriction to metric yields a paradigmatic discrepancy with several similarity theories – nevertheless, the triangular inequality is the basic tool for metric region construction leading to an efficient similarity search Metric queries  range query (specified by pivot object Q and covering radius r Q )  k-NN query (specified by pivot object Q and number of nearest neighbours k)

DATESO 20044 Metric Access Methods Designed to search in metric datasets in order to keep the search costs minimal (number of distance computation). When searching large multimedia databases also the I/O search costs have to be minimized. Many MAMs developed so far: M-tree, GH-tree, GNAT, LAESA, D-index, VP-tree, MVP-tree, SAT,... Majority of the MAMs is not suitable for similarity search in large datasets (either a static method or high I/O search costs)  only M-tree and (recently) D-index are suitable candidates

DATESO 20045 (euclidean 2D space) range query M-tree  dynamic, balanced, and paged metric tree (like e.g. B+-tree, R-tree)  the leaves are clusters of objects  routing entries in the inner nodes represent metric regions, recursively bounding the object clusters in leaves  during query evaluation, the triangular inequality allows discarding of irrelevant M-tree branches (metric regions resp.)

DATESO 20046 PM-tree, motivation metric regions in M-tree are unnecessarily large  indexing of large portions of empty space (the “dead” space)  higher probability of intersection with query region  less efficient search reduction of metric region “volume” should lead to more effective discarding of irrelevant subtrees the way is to specify a metric region bounding all the objects more “tightly”

DATESO 20047 PM-tree, structure Pivoting M-tree (PM-tree): a combination of M-tree with the pivot-based methods (LAESA-like) given a fixed set of p pivots P i (selected from the dataset), a PM-tree region is additionaly defined by p hyper-ring regions (P i, HR[i])  each routing entry contains an array HR of p intervals  each interval HR[i] bounds the distances of objects to the respective pivot P i intersection of the hyper-sphere and the hyper-rings forms a smaller region bounding all the objects the more pivots, the more thightly bounded region

DATESO 20048 PM-tree region PM-tree, query processing prior to processing of a query (Q,r Q ), distances d(Q, P i ) for all i ≤ p must be computed metric region is relevant to a range query just in case that all the hyper-rings and the hyper-sphere intersect the range query region  the more hyper-rings, the lower probability of intersection with query  no additional distance computations are needed for the intersection test M-tree region query

DATESO 20049 PM-tree, hyper-ring storage The routing entries of PM-tree nodes are enlarged by the additional pivot-based information stored in HR arrays To keep the space overhead minimal, a compact storage of HR[i] intervals is necessary A distance histogram for each pivot P i is created, and interval is chosen such that e.g. 90% of distances in the distance histogram fall into that interval Each value HR[i].min, HR[i].max, is scaled to the interval using a single byte, i.e. each hyper-ring HR[i] takes 2 bytes O i, r, ptr(T),... HR[1],HR[2],...,HR[p] storage of HR array

DATESO 200410 Experimental results (synthetic) synthetic dataset of 100,000 30-dimensional tuples distributed within 1000 clusters, L 2 distance, query selectivity 50 objs.

DATESO 200411 Experimental results (images) collection of 10,000 images represented by 256-dimensional vectors (gray histograms), L 2 distance, query selectivity 50 objs.

DATESO 200412 Recent results (not included in proceedings) Cost models for range queries in PM-tree (  ADBIS‘04) Experiments on image dataset (  ADBIS‘04) Optimal k-NN query algorithm for PM-tree + cost models (to be published...)

DATESO 200413 Reference [1] Skopal T., Pokorný J., Snášel V.: PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, submitted to ADBIS 2004, Budapest, Hungary [2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná [3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, LNCS 2798, Springer-Verlag, Dresden, Germany

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

Similar presentations

Presentation on theme: "Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

Similar presentations

Presentation on theme: "Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical."— Presentation transcript:

Similar presentations

About project

Feedback