Metric Inverted - An efficient inverted indexing method for metric spaces Benjamin Sznajder Jonathan Mamou Yosi Mass Michal Shmueli-Scheuer IBM Research.

Slides:

Advertisements

Similar presentations

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.

Advertisements

Multimedia Database Systems

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.

Information Retrieval Review

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Video retrieval using inference network A.Graves, M. Lalmas In Sig IR 02.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Chorus cluster meeting, Vilamoura April SAPIR Search in Audio-visual content using P2p IR Yosi Mass, Raul Santos.

Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li

Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

Video/Image Fingerprinting & Search Naren Chittar CS 223-B project, Winter 2008.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,

Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.

Indexing Techniques Mei-Chen Yeh.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

What Is the Most Efficient Way to Select Nearest Neighbor Candidates for Fast Approximate Nearest Neighbor Search? Masakazu Iwamura, Tomokazu Sato and.

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Characterizing activity in video shots based on salient points Nicolas Moënne-Loccoz Viper group Computer vision & multimedia laboratory University of.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

Answering Similar Region Search Queries Chang Sheng, Yu Zheng.

Chapter 23: Probabilistic Language Models April 13, 2004.

Web- and Multimedia-based Information Systems Lecture 2.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Content-Based Image Retrieval Using Color Space Transformation and Wavelet Transform Presented by Tienwei Tsai Department of Information Management Chihlee.

A Metric Cache for Similarity Search fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.

Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.

Information Storage and Retrieval Fall Lecture 1: Introduction and History.

Search Engine Architecture

Implementation Issues & IR Systems

Metric Learning for Clustering

Multimedia Information Retrieval

Machine Learning Feature Creation and Selection

Multimedia Information Retrieval

CSE 635 Multimedia Information Retrieval

Panagiotis G. Ipeirotis Luis Gravano

CS246: Information Retrieval

Information Retrieval and Web Design

Information Retrieval and Web Design

Introduction to Search Engines

Presentation transcript:

Metric Inverted - An efficient inverted indexing method for metric spaces Benjamin Sznajder Jonathan Mamou Yosi Mass Michal Shmueli-Scheuer IBM Research - Haifa Presented by: Shai Erera

Outline Motivation Problem Definition Metric Inverted Index Retrieval Experiments Conclusions

Outline Motivation Problem Definition Metric Inverted Index Retrieval Experiments Conclusions

Motivation Web 2.0 enables mass multimedia productions Still, search is limited to manually added metadata State of the art solutions for CBIR (Content Based Image Retrieval) do not scale – Reveal linear scalability in the collection size due to large number of distance computations Can we use textIR methods to scale up CBIR?

Outline Motivation Problem Definition Metric Inverted Index Retrieval Experiments Conclusions

Problem definition Low level image features can be generalized to Metric Spaces Metric Space: An ordered pair (S,d), where S is a domain and d a distance function d: S x S  R such that – d satisfies non-negativity, reflexibility, symmetry and triangle inequality The best-k results for a query in a metric space are the k objects with the smallest distance to the query – Convert distances to scores (small distance – high score) between [0,1]

Problem definition Top-K Problem: – Assume m metric spaces, a Query Q, an aggregate function f and a score function sd(): – Retrieve the best k objects D with highest f(sd 1 (Q,D), sd 2 (Q,D)…sd m (Q,D)) q k=5

Outline Motivation Problem Definition Metric Inverted Index Retrieval Experiments Conclusions

Metric Inverted Index Assume a collection of objects each having m features – Object D = {F 1 :v 1, F 2 :v 2,…, F m :v m } – m metric spaces Indexing steps – Lexicon creation (select candidates) – Invert objects (canonization to lexicon terms)

Metric inverted indexing – Lexicon creation Number of different features too large Need to select candidates – Naïve solution: Lexicon of fixed size l Select randomly l/m documents and extract their features These l features form our lexicon – Improvement Replace the random choice by clustering (K-Means etc.) Keep the lexicon in an M-Tree structure

Metric inverted indexing – invert objects Given object D = {F 1 :v 1, F 2 :v 2,…, F m :v m } Canonization – map features (F i :v i ) to lexicon entries – For each feature select the n nearest lexicon terms – D’ = {F 1 :v 11, F 1 :v 12, …F 1 :v 1n, F 2 :v 21, F 2 :v 22, …F 2 :v 2n, … F m :v m1, F m :v m2, …F m :v mn } Index D’ in the relevant posting-lists

Outline Motivation Problem Definition Metric Inverted Index Retrieval Experiments Conclusions

Retrieval stage – term selection Given Q = {F 1 :qv 1, F 2 :qv 2,…, F m :qv m } Canonization – For each feature select the n nearest lexicon terms – Q’ = {F 1 :qv 11, F 1 :qv 12, …F 1 :qv 1n, F 2 :qv 21, F 2 :qv 22, …F 2 :qv 2n, … F m :qv m1, F m :qv m2, …F m :qv mn }

Retrieval stage – Boolean Filtering These m*n posting-lists will be queried via a Boolean Query Two possible modes: – Strict-query-mode: – Fuzzy-query-mode:

Retrieval stage – Scoring Documents retrieved by the Boolean Query are fully scored Return the best k objects with the highest aggregate score f(sd_1(Q,D),sd_2(Q,D),…,sd_m(Q,D))

Outline Motivation Problem Definition Metric Inverted Index Retrieval Experiments Conclusions

Experiments Focus on: – Efficiency – Effectiveness Collection of 160,000 images from Flickr 3 features are extracted from each image – EdgeHistogram, ScalableColor and ColorLayout 180 queries – Fuzzy-Query-Mode – Sampled from the collection of images Compared to M-tree data-structure

Experiments – Measures Used Effectiveness: MAP is a natural candidate for measuring – Problem: In Image Retrieval, no document is irrelevant – Solution: we defined as relevant the k highest scored documents in the collection (according to the M-Tree computation) – MAP computed on relevant and retrieved lists of size k

Experiments – Measures Used contd. Efficiency: we compute the number of computations per query – A computation unit (cu) is a distance computation call between two feature values

Effectiveness MAP vs. number of Nearest Terms size of the lexicon = 12000

Effectiveness MAP vs. lexicon size Number Nearest Terms =30

Effectiveness vs. Efficiency MAP vs. number of comparisons Number Nearest Terms =30

M-Tree vs. Metric Inverted Number of comparisons vs. top-k Number Nearest Terms =30

Outline Motivation Problem Definition Metric Inverted Index Retrieval Experiments Conclusions

We reduce the gap between Text IR and Multimedia Retrieval Our method achieves very good approximation (MAP = 98%) Our method improves drastically the efficiency (90%) over state-of-the-art methods