SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
A Machine Learning Approach for Improved BM25 Retrieval
CpSc 881: Information Retrieval
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Information Retrieval Models: Probabilistic Models
Hinrich Schütze and Christina Lioma
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
Ch 4: Information Retrieval and Text Mining
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Querying Structured Text in an XML Database By Xuemei Luo.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.
1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Information Retrieval Models: Vector Space Models
831 Determine the relationship among the mass of objects, the distance between these objects, and the amount of gravitational attraction.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
A Study of Poisson Query Generation Model for Information Retrieval
SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Chapter 7. Classification and Prediction
A Formal Study of Information Retrieval Heuristics
Lecture 13: Language Models for IR
Probabilistic Retrieval Models
Reading Notes Wang Ning Lab of Database and Information Systems
An Empirical Study of Learning to Rank for Entity Search
CSCI 5417 Information Retrieval Systems Jim Martin
Information Retrieval Models: Probabilistic Models
Structure and Content Scoring for XML
Structure and Content Scoring for XML
CS 430: Information Discovery
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia From:

SIGIR’2005 Background Document: Query: A core problem in Information Retrieval (IR): Determine the relevance of a document to a query Relevant? How relevant?

SIGIR’2005 IR Models & Perspectives  IR models define the representation of documents, queries, and the relevance relationship between them  The key behind all IR models is primary perspectives on information retrieval ModelPerspective Boolean modelSet theory and Boolean algebra Vector space modelVector and linear algebra Probabilistic model Probabilistic Language model …… Background

SIGIR’2005 Hard questions  What is the essence of information retrieval?  What is the right perspective of it?  Till now, we know more about IR each time when a new perspective is adopted  It would also be helpful to view IR problems from more new perspectives  We try to view IR from the perspective of physics Background

SIGIR’2005 From: Background (1687 AD.)

SIGIR’2005 From Background

SIGIR’2005  We are living in a physical world which is dominated by fundamental physics laws.  Can we get help from “the God” in acquiring deeper understanding of information retrieval?  Simply start from Newton’s Universal Law of Gravitation… Background

SIGIR’2005  We build a new IR model GBM from which many effective ranking functions can be derived  The BM25 formula can be derived from our model, so we give an intuitive physical interpretation of this powerful and robust function.  A more reasonable approach for structured document retrieval can be obtained directly from the model. This approach is not only highly effective but also robust to be used in various conditions. Preliminary Achievements It is encouraging that we can really benefit from the nature. With the new perspective, we get the following preliminary achievements, First discovered by Robertson et al, inspired by the shape of a complex formula derived from a probabilistic model under the 2-Poisson assumption. Amati and Rijsbergen proposed a probabilistic framework with which the BM25 function with some special parameters ( k 1 =1.2, b=0.75; or k 1 =2, b=0.75 ) can be approximated numerically We lack a complete derivation of BM25 formula in theory.

SIGIR’2005 Outline Background  Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval Summary

SIGIR’2005 Document: Query: A mapping is need to be build from concepts of information retrieval to those of physics GBM: Initial Idea Relevance score Attractive force IR concepts & notations: |D| Document length df(t) Document frequency of t avdl Average document length in a collection N Total number of documents c(t,D) Times of occurrences of t in D (or written as tf(t,D)) Physics concepts mass distance … …

SIGIR’2005 Particle  (=atom): Basic element of any object  A particle has two attributes: mass and type  Type: Determined by the term object it composes GBM: Notations & Basic Concepts

SIGIR’2005 GBM: Notations & Basic Concepts Two natural assumptions: H(D): Hidden terms in document D A term object has 4 attributes: type, shape, mass, and diameter

SIGIR’2005 Notation List

SIGIR’2005 Background  Gravitation-based Model Notations & Basic Concepts  Discrete GBM Model Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval Summary Outline

SIGIR’2005 Discrete GBM Model Key Points: 1. Under the attraction of query terms, the structure of each document would be adjusted to an optimized-term-placement state. 2. The relevance between a document and a query is defined by the attractive force between them when the document is in its optimized-term- placement state. Optimized-term-placement state A state where the aggregated force between the document and the query gets maximized

SIGIR’2005 Term Weighting Formula The maximal (optimized) gravitational force between t and D: The force between query term t and its i-th nearest occurrence in D: The attractive force between D and Q: Unknown expressions: m(t,Q), m(t,D), and di(t,D) Need: Mass and diameter estimation

SIGIR’2005 (Assumption-3) Mass and Diameter Estimation (Assumption-1) (Assumption-2) (Assumption-4) For any two terms, their mass ratio in any document is equal to the ratio of their average masses in the whole collection. Assume that all terms in the same document have equal diameters Define a document-independent mass for each (type of) term. It denotes the average mass of term t in the whole collection.

SIGIR’2005 Ultimate Discrete GBM Formula The ultimate term-weighting function: whereand The average (document- independent) mass of term t in the collection The mass of a document is a measure of its quality, which depends on how informative and important it is. Relationship with PageRank?

SIGIR’2005 Then a special case of the term-weighting function: where Two parameters: If m(D) = const, di(D) = const, and Ultimate Discrete GBM Formula

SIGIR’2005 Background  Gravitation-based Model Notations & Basic Concepts Discrete GBM Model  Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval Summary Outline

SIGIR’2005 Document D is now in its optimized-term- placement state Continuous GBM Model Term shape: Ideal cylinder

SIGIR’2005 Term Weighting Formula The maximal (optimized) gravitational force between t and D: The force between query term t and its i-th nearest occurrence in D:

SIGIR’2005 Ultimate Continuous GBM Formula By doing mass and diameter estimation, we have the ultimate term-weighting function: where and Then a special case of the above term-weighting function: (Two parameters: ) If: m(D) = const, di(D) = const, and

SIGIR’2005 Background  Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model  Model analysis GBM Model for Structured Document Retrieval Summary Outline

SIGIR’2005 A special case of the continuous GBM term-weighting function: where Continuous GBM Formula vs. BM25 BM25 term-weighting function

SIGIR’2005 Other Ranking Formulas Derived Ranking formulas (highly simplified version) derived from the continuous GBM model with various gravitational-field-functions

SIGIR’2005 [Fang et al, SIGIR’04]: Some heuristic constraints related to TF, IDF, and document length that all reasonable ranking formulas should satisfy  TFC1, TFC2  TDC  M-TDC  LNC1, LNC2  TF-LNC All our derived term weighting functions satisfy all the above constraints. Check with Heuristic Constraints

SIGIR’2005 Experimental Setup Preliminary Experiments Corpora characteristics Query-sets used in the experiments

SIGIR’2005 Preliminary Experiments Optimal performance comparison among some formulas over various corpora and tasks (measure: mean average precision) Experimental Results

SIGIR’2005 Background Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis  GBM Model for Structured Document Retrieval skipskip Summary Outline

SIGIR’2005 A document is said to be structured here when it contains multiple fields. Current approaches for structured document retrieval  Score combination  The most commonly used and well-studied approach  Rank combination is a special case of score combination  Term-frequency combination  [Robertson et al, CIKM’04]: An extension of BM25  [Ogilvie et al, SIGIR’03]: Linearly combining language models Each approach works moderately well, but… Structured Document Retrieval

SIGIR’2005 For a multi-term query, a document matching a single query term over many fields could get unreasonably higher score than another document which matches all the query terms in a few fields. (See discussions in [Robertson et al, CIKM’04]) Score Combination Issues score(d 1 ) = s + s + s + … + s = 8s score(d 2 ) = 2s + 2s … + 0 = 4s score(d 1 ) > score(d 2 ) Unreasonable

SIGIR’2005 Consider a single-term query Q=t, and some documents with two fields (F 1, F 2 ). Assuming: w 1 = weight(F 1 ) = 5; w 2 = weight(F 2 ) = 1 TF Combination Issues Example-1 (assuming |d 1 |=|d 2 |) Example-2 (assuming |d 3 |=|d 4 |) tf(t,d 1 ) = w 1 * 1 + w 2 * 0 = 5 tf(t,d 2 ) = w 1 * 0 + w 2 * 6 = 6 tf(t,d 3 ) = w 1 * 1 + w 2 * 8 = 13 tf(t,d 4 ) = w 1 * 0 + w 2 * 14 = 19 score(d 1 ) < score(d 2 ) Reasonable score(d 3 ) < score(t,d 4 ) Unreasonable Larger w 1 ? Can’t remove this issue Potential risk of making the case of example-1 unreasonable

SIGIR’2005 Structured Document Retrieval by GBM

SIGIR’2005 Experimental Results Performance comparison of different approaches for the combination of body and title fields

SIGIR’2005 Background Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval  Summary Outline

SIGIR’2005 Viewing IR from a different viewpoint is the same important as going deeper from traditional perspectives. This paper may be a first step to take a physics viewpoint It is encouraging that we can really benefit from the nature A family of effective ranking functions derived Give BM25 a physics interpretation A more reasonable approach for structured document retrieval obtained Summary

SIGIR’2005 Sorry, Sir Isaac Newton. Hope I am not abusing your laws.

SIGIR’2005 The End Gravitation-Based Model for Information Retrieval Please send your comments to: