NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces Tomáš Skopal Jakub Lokoč Charles University in Prague Department of Software.

Slides:

Advertisements

Similar presentations

Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

Advertisements

On Reinsertions in M-tree Jakub Lokoč Tomáš Skopal Charles University in Prague Department of Software Engineering Czech Republic.

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.

Fast Algorithms For Hierarchical Range Histogram Constructions

Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.

On Fast Non-Metric Similarity Search by Metric Access Methods Tomáš Skopal Charles University in Prague Faculty of Mathematics and Physics.

Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub Lokoč Department of Software Engineering, FMP, Charles University in Prague.

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

Chapter 8 File organization and Indices.

CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.

Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.

Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.

Techniques and Data Structures for Efficient Multimedia Similarity Search.

Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.

Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

David Hoksza, Supervisor: Tomáš Skopal, KSI MFF UK Similarity Search in Protein Databases.

Database Management 9. course. Execution of queries.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

M- tree: an efficient access method for similarity search in metric spaces Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Efficient Processing of Top-k Spatial Preference Queries

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.

Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.

Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.

Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.

Efficient Semantic Web Service Discovery in Centralized and P2P Environments Dimitrios Skoutas 1,2 Dimitris Sacharidis.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

High-Dimensional Data. Topics Motivation Similarity Measures Index Structures.

Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.

Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.

Fast Subsequence Matching in Time-Series Databases.

Spatial Data Management

Mehdi Kargar Department of Computer Science and Engineering

Database System Architecture and Implementation

SIMILARITY SEARCH The Metric Space Approach

Indexing Structures for Files and Physical Database Design

Indexing and hashing.

RE-Tree: An Efficient Index Structure for Regular Expressions

Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.

Spatio-temporal Pattern Queries

Spatial Online Sampling and Aggregation

Native Multidimensional Indexing in Relational Databases

Chapter 11: Indexing and Hashing

15-826: Multimedia Databases and Data Mining

Distributed Probabilistic Range-Aggregate Query on Uncertain Data

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Native Multidimensional Indexing in Relational Databases

Chapter 11: Indexing and Hashing

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces Tomáš Skopal Jakub Lokoč Charles University in Prague Department of Software Engineering Czech Republic DEXA 2008, Turin, Italy, Sep 1-5

Presentation outline Metric similarity search Semimetric tuning M-Tree NM-Tree Experimental results Discussion

How to search in multimedia databases (MDB)? textual annotation is expensive and ambiguous MDB objects are not structured (we cannot use a structured query language, like SQL) solution: content based similarity searching; similarity between a query and DB object is interpreted as a relevance similarity is often modelled by a distance function d satisfying metric properties -> metric searching -> metric access methods (MAMs) the distance function d is supposed to be computationally expensive -> sequential search is unfeasible -> external indexing structures metric properties are advantageous for indexing, but may be unsuitable for domain experts mainly the triangle inequality -> let's d be relaxed to a semimetric (i.e., reflexive, non-negative, symmetric distance) we use MAMs also with semimetric distances, but we have to take more or less incorrect behavior into account false dismissals & false positives for semimetrics Metric similarity search query object

semimetric brings less limitations for domain experts, but… semimetric doesn’t guarantee triangle inequality for every triplet of objects in the database -> a lot of non-triangle triplets causes a lot of false dismissals in MAMs and vice versa exact metric searching = all distance triplets must be triangle triplets non-triangle triplets cause only approximate but faster search semimetric tuning = changing the proportion of non-triangle triplets (generated by d s ) by applying modifier f (real function) on original semimetric, e.g. d s* = f (d s ) d s* should - satisfy triangle inequality to some extent (controlled precision of searching) - generate lower-dimensional distance distributions (faster searching) Semimetric tuning Triangle triplet: a + b >= c Non-triangle triplet: a + b < c c a b c a b

Properties of the modifier f 1.f is increasing – preserves original query orderings 2.triangle-generating f = concave f = turns more distance triplets into triangle ones = slow but precise searching 3.triangle-violating f = convex f = turns more distance triplets into non-triangle ones = fast but approximate searching 4.parametric T-bases 5.TriGEN – an algorithm for finding an f that satisfies the user-requested retrieval error e and maximizes the search efficiency (lowest intrinsic dimensionality), see [2] Modified distances may form the triangle triplet guarantees the fish is always more similar to sea-maid than to girl f dsds d s* f dsds

M-tree dynamic, balanced, and paged tree structure (like e.g. B + -tree, R-tree) the leaves are clusters of indexed objects O j (ground objects) routing entries in the inner nodes represent hyper-spherical metric regions (O i, r Oi ), recursively bounding the object clusters in leaves the triangle inequality allows to discard irrelevant M-tree branches (metric regions resp.) during query evaluation (euclidean 2D space) range query Q

M-tree filtering a) basic filtering (expensive) b) parent filtering (cheap) d(R, Q) > r R + r Q |d(P, Q) – d(P, R)| > r R + r Q

NM-tree motivation separated usage of TriGen and M-tree - limitations M-tree hierarchy depends on the topology of d s (specific to f used) For another measure d s* the database must be re-indexed To provide user with choice of precision/efficiency tradeoff we need to maintain more M-trees - each for particular dissimilarity measure !!! T-bases (returned by TriGEN) natively support inverse modification (as proposed in [2]) d s = f e ( f e (d s, w), -w) notation : f e -1 ~ f e (d s, -w) We can mimic multiple M-trees with just one M-tree and an appropriate set of modifiers f e -> NM-tree d s*

NM-tree setbacks Native combination of M-tree and TriGEN brings some problems… determining modifiers for an empty index (no data received) solution : gather first k objects into a sequential file then find modifiers guarantee of exact searching solution : use metric d m = f m (d s ) for inserting -> it is necessary to find f m modifier in the initial phase aggregated distances stored in M-tree as covering radii cannot be correctly remodified by f solution : approximate search only at the preleaf & leaf level

NM-tree structure the same structure as for the M-tree maintains modifiers f e for distance modification construction Initial phase inserting into the sequential file until sufficient number of objects is gathered then finding modifiers for requested retrieval errors including f m for error = 0 all objects from the sequential file are inserted into the NM-tree Second phase inserting into the NM-tree under d m = f m (d s ) -> this makes possible exact searching querying additional query parameter – an error threshold e (such that f e is available in NM-tree) Exact search – NM-tree querying under d m (+ result distances remodified by f m -1 ) Approximate search – for stored values (in the index) it is necessary to make conversion from d m to d s using f m -1 and subsequently conversion from d s to desired d s* using f e

NM-tree approximate querying For upper levels the metric search is performed For the preleaf and leaf level all the distances used for pruning are modified to the desired semimetric depending on the user- defined error threshold e entry modification d2p* = f e (f m -1 (d2p)) e_radius* = f e (f m -1 (e_radius)) entry modification d2p* = f e (f m -1 (d2p)) e_radius* = f e (f m -1 (e_radius)) query modification q_radius* = f e (q_radius) e2qd* = f e (e2qd) query modification q_radius* = f e (q_radius) e2qd* = f e (e2qd) dmdm d s*

NM-tree querying example

Experiments We have compared multiple M-trees (each for semimetric determined by user defined error) with NM-tree We have performed our tests on two databases 68, dimensional Corel features 250,000 synthetic 2D polygons, each consisting of 5 to 15 vertices We have tested one semimetric and one metric on each database COREL – L0.75 and L2 Polygons – DTW and Hausdorff For TriGEN we have selected 10 modifiers for each dissimilarity measure and T-errors (values correlated with retrieval error) within [0 – 0,32]

Experimental results

References [1] Ciaccia P., Patella M., Zezula P. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: VLDB 1997, pp. 426–435 (1997) [2] Skopal T. Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Transactions on Database Systems 32(4), 1–46 (2007) [3] Chávez E., Navarro G. A Probabilistic Spell for the Curse of Dimensionality. In: Buchsbaum, A.L., Snoeyink, J. (eds.) ALENEX LNCS, vol. 2153, pp. 147–160. Springer, Heidelberg (2001)

... Thank you for attention Questions ??