A Metric Cache for Similarity Search fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order Edgar Chávez Karina Figueroa Gonzalo Navarro UNIVERSIDAD MICHOACANA, MEXICO.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Evaluating Sketch Query Interfaces for a 3D Model Search Engine Patrick Min Joyce Chen, Tom Funkhouser Princeton Workshop on Shape-Based Retrieval and.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Comparison Methodologies. Evaluating the matching characteristics Properties of the similarity measure Robustness of the similarity measure – Low variation.
Data-driven Visual Similarity for Cross-domain Image Matching
PHP-based Image Recognition and Retrieval of Late 18th Century Artwork Ben Goodwin Handouts are available for students writing summaries for class assignments.
Mining Time Series.
Metric Inverted - An efficient inverted indexing method for metric spaces Benjamin Sznajder Jonathan Mamou Yosi Mass Michal Shmueli-Scheuer IBM Research.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Information Retrieval
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
Indexing Techniques Mei-Chen Yeh.
Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Content-Based Image Retrieval
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.
The Simigle Image Search Engine Wei Dong
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Imaged Document Text Retrieval without OCR IEEE Trans. on PAMI vol.24, no.6 June, 2002 報告人:周遵儒.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Doug Raiford Phage class: introduction to sequence databases.
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras
Clustering Data Streams
SIMILARITY SEARCH The Metric Space Approach
Lecture 18: Uniformity Testing Monotonicity Testing
Spatial Online Sampling and Aggregation
Query Caching in Agent-based Distributed Information Retrieval
Data Mining Chapter 6 Search Engines
Panagiotis G. Ipeirotis Luis Gravano
Minwise Hashing and Efficient Search
Presentation transcript:

A Metric Cache for Similarity Search fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego

Similarity Search in Databases o Objects are “unknown”, only distances are “well known” o Metric Space assumption: Identity Symmetry Triangular inequality o Distance functions include: Minkowski distances, edit and Jaccard distance,... o Applications include: Images, 3D shapes, medical data, text, dna sequences, graphs, etc. o Metric space indexing works better than multidimensional indexing. Content-Based Image Retrieval query:

Distributed Similarity Search System Index of MM objects Unit 1 Parallel & Distributed CBIR System Front-end of the CBIR System Index of MM objects Unit 2 Index of MM objects Unit n Top-K queries Search cost is close to O( |DB| ) !!!!

Distributed Similarity Search System Index of MM objects Unit 1 Parallel & Distributed CBIR System Front-end of the CBIR System Index of MM objects Unit 2 Index of MM objects Unit n Top-K queries & Cached Metric Cache

Distributed Similarity Search System Index of MM objects Unit 1 Parallel & Distributed CBIR System Front-end of the CBIR System Index of MM objects Unit 2 Index of MM objects Unit n Top-K queries & Cached Metric Cache What’s different in ?

o The cache stores result-objects, not only result-pointers e.g.: documents vs. documents ids o The cache is a peculiar sample of the whole dataset the set of objects most recently seen by the users (= most interesting !?!) o Claim: An interesting object may be used to answer approximately if it is sufficiently similar to the query. What’s different in ? Metric Cache

…and… o Queries may be approximate ! o [Zobel et al. CIVR 07] At least 8% of the images in the web are near-duplicates. Most of them are due to cropping, contrast adjustment, etc. o Requirement: the system must be robust w.r.t. near-duplicate queries. What’s different in ? Metric Cache

What’s different in ? Metric Cache Front-end of the CBIR System Parallel & Distributed CBIR System q1q1 Exact answer Approx. answer q2q2 q3q3 Exact answer

Cache Miss Approximate Hit Cache Hit Two algorithms: RCache vs. QCache RCache(q,k) If q  Cache return R R = Cache.knn(q,k) If quality(R) >  return R else R = DB.knn(q,k) Cache.add(q,R) return R In case of approximate hit, the cached query q’, being the closest to q, is marked as used. The Least Recently Used query and its results are evicted.

Cache Miss Approximate Hit Cache Hit Costs of RCache vs. QCache RCache(q,k) Hash table access : O(1) Search among all the result objects: O( | Cache | ) Search among all the objects in the database: O( | DB | ) |Cache| i s the number of cached objects, and |DB| is the size of the database.

Cache Miss Approximate Hit Cache Hit Two algorithms: RCache vs. QCache RCache(q,k) If q  Cache return R R = Cache.knn(q,k) If quality(R) >  return R else R = DB.knn(q,k) Cache.add(q,R) return R QCache(q,k) If q  Cache return R Q * = Cache.knn(q,  ) R * = {results in Q * } R = R *.knn(q,k) If quality( R ) >  return R else R = DB.knn(q,k) Cache.add(q,R) return R In case of approximate hit, the cached query q’, being the closest to q, is marked as used. The Least Recently Used query and its results are evicted.

Cache Miss Approximate Hit Cache Hit Costs of RCache vs. QCache RCache(q,k) Hash table access : O(1) Search among all the result objects: O( | Cache | ) Search among all the objects in the database: O( | DB | ) QCache(q,k) Search among the query objects: O( |Cache|/k ) Supposing k results are stored for each query. |Cache| i s the number of cached objects, and |DB| is the size of the database.

Approximation & Guarantees o Let the safe range be: s = r* - d( q, q * ) o The cached k* objects within distance s, are the true top-k* of the new query. o Every cached query may provide some additional guarantee. q* r* q s

Experimental setup o A collection of 1,000,000 images downloaded from Flickr: we extracted 5 MPEG-7 descriptors, which were used to measure similarity. o A query log of 100,000 images: a random sample with replacement, using image views 20% training – 80% testing o k = 20,  = 10 o Quality function is: Safe range  0

Hit ratio

Throughput

Approximation quality I

Approximation quality II

What we also did... o Take queries from a different collection. o Inject duplicates in the query log. o Use an expectation of RES as quality measure.

Approximation quality III o Bug: RES=0.07 o Portrait: RES=0.09 o Sunset: RES=0.12

Approximation quality III o Bug: RES=0.07 o Portrait: RES=0.09 o Sunset: RES=0.12

Acknowledgements o The European Project SAPIR o P. Zezula and his colleagues for the M-Tree implementation o The dataset used is available at cophir.isti.cnr.it

Thank you.

Backup slides