NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces Tomáš Skopal Jakub Lokoč Charles University in Prague Department of Software Engineering Czech Republic DEXA 2008, Turin, Italy, Sep 1-5
Presentation outline Metric similarity search Semimetric tuning M-Tree NM-Tree Experimental results Discussion
How to search in multimedia databases (MDB)? textual annotation is expensive and ambiguous MDB objects are not structured (we cannot use a structured query language, like SQL) solution: content based similarity searching; similarity between a query and DB object is interpreted as a relevance similarity is often modelled by a distance function d satisfying metric properties -> metric searching -> metric access methods (MAMs) the distance function d is supposed to be computationally expensive -> sequential search is unfeasible -> external indexing structures metric properties are advantageous for indexing, but may be unsuitable for domain experts mainly the triangle inequality -> let's d be relaxed to a semimetric (i.e., reflexive, non-negative, symmetric distance) we use MAMs also with semimetric distances, but we have to take more or less incorrect behavior into account false dismissals & false positives for semimetrics Metric similarity search query object
semimetric brings less limitations for domain experts, but… semimetric doesn’t guarantee triangle inequality for every triplet of objects in the database -> a lot of non-triangle triplets causes a lot of false dismissals in MAMs and vice versa exact metric searching = all distance triplets must be triangle triplets non-triangle triplets cause only approximate but faster search semimetric tuning = changing the proportion of non-triangle triplets (generated by d s ) by applying modifier f (real function) on original semimetric, e.g. d s* = f (d s ) d s* should - satisfy triangle inequality to some extent (controlled precision of searching) - generate lower-dimensional distance distributions (faster searching) Semimetric tuning Triangle triplet: a + b >= c Non-triangle triplet: a + b < c c a b c a b
Properties of the modifier f 1.f is increasing – preserves original query orderings 2.triangle-generating f = concave f = turns more distance triplets into triangle ones = slow but precise searching 3.triangle-violating f = convex f = turns more distance triplets into non-triangle ones = fast but approximate searching 4.parametric T-bases 5.TriGEN – an algorithm for finding an f that satisfies the user-requested retrieval error e and maximizes the search efficiency (lowest intrinsic dimensionality), see [2] Modified distances may form the triangle triplet guarantees the fish is always more similar to sea-maid than to girl f dsds d s* f dsds
M-tree dynamic, balanced, and paged tree structure (like e.g. B + -tree, R-tree) the leaves are clusters of indexed objects O j (ground objects) routing entries in the inner nodes represent hyper-spherical metric regions (O i, r Oi ), recursively bounding the object clusters in leaves the triangle inequality allows to discard irrelevant M-tree branches (metric regions resp.) during query evaluation (euclidean 2D space) range query Q
M-tree filtering a) basic filtering (expensive) b) parent filtering (cheap) d(R, Q) > r R + r Q |d(P, Q) – d(P, R)| > r R + r Q
NM-tree motivation separated usage of TriGen and M-tree - limitations M-tree hierarchy depends on the topology of d s (specific to f used) For another measure d s* the database must be re-indexed To provide user with choice of precision/efficiency tradeoff we need to maintain more M-trees - each for particular dissimilarity measure !!! T-bases (returned by TriGEN) natively support inverse modification (as proposed in [2]) d s = f e ( f e (d s, w), -w) notation : f e -1 ~ f e (d s, -w) We can mimic multiple M-trees with just one M-tree and an appropriate set of modifiers f e -> NM-tree d s*
NM-tree setbacks Native combination of M-tree and TriGEN brings some problems… determining modifiers for an empty index (no data received) solution : gather first k objects into a sequential file then find modifiers guarantee of exact searching solution : use metric d m = f m (d s ) for inserting -> it is necessary to find f m modifier in the initial phase aggregated distances stored in M-tree as covering radii cannot be correctly remodified by f solution : approximate search only at the preleaf & leaf level
NM-tree structure the same structure as for the M-tree maintains modifiers f e for distance modification construction Initial phase inserting into the sequential file until sufficient number of objects is gathered then finding modifiers for requested retrieval errors including f m for error = 0 all objects from the sequential file are inserted into the NM-tree Second phase inserting into the NM-tree under d m = f m (d s ) -> this makes possible exact searching querying additional query parameter – an error threshold e (such that f e is available in NM-tree) Exact search – NM-tree querying under d m (+ result distances remodified by f m -1 ) Approximate search – for stored values (in the index) it is necessary to make conversion from d m to d s using f m -1 and subsequently conversion from d s to desired d s* using f e
NM-tree approximate querying For upper levels the metric search is performed For the preleaf and leaf level all the distances used for pruning are modified to the desired semimetric depending on the user- defined error threshold e entry modification d2p* = f e (f m -1 (d2p)) e_radius* = f e (f m -1 (e_radius)) entry modification d2p* = f e (f m -1 (d2p)) e_radius* = f e (f m -1 (e_radius)) query modification q_radius* = f e (q_radius) e2qd* = f e (e2qd) query modification q_radius* = f e (q_radius) e2qd* = f e (e2qd) dmdm d s*
NM-tree querying example
Experiments We have compared multiple M-trees (each for semimetric determined by user defined error) with NM-tree We have performed our tests on two databases 68, dimensional Corel features 250,000 synthetic 2D polygons, each consisting of 5 to 15 vertices We have tested one semimetric and one metric on each database COREL – L0.75 and L2 Polygons – DTW and Hausdorff For TriGEN we have selected 10 modifiers for each dissimilarity measure and T-errors (values correlated with retrieval error) within [0 – 0,32]
Experimental results
References [1] Ciaccia P., Patella M., Zezula P. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: VLDB 1997, pp. 426–435 (1997) [2] Skopal T. Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Transactions on Database Systems 32(4), 1–46 (2007) [3] Chávez E., Navarro G. A Probabilistic Spell for the Curse of Dimensionality. In: Buchsbaum, A.L., Snoeyink, J. (eds.) ALENEX LNCS, vol. 2153, pp. 147–160. Springer, Heidelberg (2001)
... Thank you for attention Questions ??