SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia From:
SIGIR’2005 Background Document: Query: A core problem in Information Retrieval (IR): Determine the relevance of a document to a query Relevant? How relevant?
SIGIR’2005 IR Models & Perspectives IR models define the representation of documents, queries, and the relevance relationship between them The key behind all IR models is primary perspectives on information retrieval ModelPerspective Boolean modelSet theory and Boolean algebra Vector space modelVector and linear algebra Probabilistic model Probabilistic Language model …… Background
SIGIR’2005 Hard questions What is the essence of information retrieval? What is the right perspective of it? Till now, we know more about IR each time when a new perspective is adopted It would also be helpful to view IR problems from more new perspectives We try to view IR from the perspective of physics Background
SIGIR’2005 From: Background (1687 AD.)
SIGIR’2005 From Background
SIGIR’2005 We are living in a physical world which is dominated by fundamental physics laws. Can we get help from “the God” in acquiring deeper understanding of information retrieval? Simply start from Newton’s Universal Law of Gravitation… Background
SIGIR’2005 We build a new IR model GBM from which many effective ranking functions can be derived The BM25 formula can be derived from our model, so we give an intuitive physical interpretation of this powerful and robust function. A more reasonable approach for structured document retrieval can be obtained directly from the model. This approach is not only highly effective but also robust to be used in various conditions. Preliminary Achievements It is encouraging that we can really benefit from the nature. With the new perspective, we get the following preliminary achievements, First discovered by Robertson et al, inspired by the shape of a complex formula derived from a probabilistic model under the 2-Poisson assumption. Amati and Rijsbergen proposed a probabilistic framework with which the BM25 function with some special parameters ( k 1 =1.2, b=0.75; or k 1 =2, b=0.75 ) can be approximated numerically We lack a complete derivation of BM25 formula in theory.
SIGIR’2005 Outline Background Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval Summary
SIGIR’2005 Document: Query: A mapping is need to be build from concepts of information retrieval to those of physics GBM: Initial Idea Relevance score Attractive force IR concepts & notations: |D| Document length df(t) Document frequency of t avdl Average document length in a collection N Total number of documents c(t,D) Times of occurrences of t in D (or written as tf(t,D)) Physics concepts mass distance … …
SIGIR’2005 Particle (=atom): Basic element of any object A particle has two attributes: mass and type Type: Determined by the term object it composes GBM: Notations & Basic Concepts
SIGIR’2005 GBM: Notations & Basic Concepts Two natural assumptions: H(D): Hidden terms in document D A term object has 4 attributes: type, shape, mass, and diameter
SIGIR’2005 Notation List
SIGIR’2005 Background Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval Summary Outline
SIGIR’2005 Discrete GBM Model Key Points: 1. Under the attraction of query terms, the structure of each document would be adjusted to an optimized-term-placement state. 2. The relevance between a document and a query is defined by the attractive force between them when the document is in its optimized-term- placement state. Optimized-term-placement state A state where the aggregated force between the document and the query gets maximized
SIGIR’2005 Term Weighting Formula The maximal (optimized) gravitational force between t and D: The force between query term t and its i-th nearest occurrence in D: The attractive force between D and Q: Unknown expressions: m(t,Q), m(t,D), and di(t,D) Need: Mass and diameter estimation
SIGIR’2005 (Assumption-3) Mass and Diameter Estimation (Assumption-1) (Assumption-2) (Assumption-4) For any two terms, their mass ratio in any document is equal to the ratio of their average masses in the whole collection. Assume that all terms in the same document have equal diameters Define a document-independent mass for each (type of) term. It denotes the average mass of term t in the whole collection.
SIGIR’2005 Ultimate Discrete GBM Formula The ultimate term-weighting function: whereand The average (document- independent) mass of term t in the collection The mass of a document is a measure of its quality, which depends on how informative and important it is. Relationship with PageRank?
SIGIR’2005 Then a special case of the term-weighting function: where Two parameters: If m(D) = const, di(D) = const, and Ultimate Discrete GBM Formula
SIGIR’2005 Background Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval Summary Outline
SIGIR’2005 Document D is now in its optimized-term- placement state Continuous GBM Model Term shape: Ideal cylinder
SIGIR’2005 Term Weighting Formula The maximal (optimized) gravitational force between t and D: The force between query term t and its i-th nearest occurrence in D:
SIGIR’2005 Ultimate Continuous GBM Formula By doing mass and diameter estimation, we have the ultimate term-weighting function: where and Then a special case of the above term-weighting function: (Two parameters: ) If: m(D) = const, di(D) = const, and
SIGIR’2005 Background Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval Summary Outline
SIGIR’2005 A special case of the continuous GBM term-weighting function: where Continuous GBM Formula vs. BM25 BM25 term-weighting function
SIGIR’2005 Other Ranking Formulas Derived Ranking formulas (highly simplified version) derived from the continuous GBM model with various gravitational-field-functions
SIGIR’2005 [Fang et al, SIGIR’04]: Some heuristic constraints related to TF, IDF, and document length that all reasonable ranking formulas should satisfy TFC1, TFC2 TDC M-TDC LNC1, LNC2 TF-LNC All our derived term weighting functions satisfy all the above constraints. Check with Heuristic Constraints
SIGIR’2005 Experimental Setup Preliminary Experiments Corpora characteristics Query-sets used in the experiments
SIGIR’2005 Preliminary Experiments Optimal performance comparison among some formulas over various corpora and tasks (measure: mean average precision) Experimental Results
SIGIR’2005 Background Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval skipskip Summary Outline
SIGIR’2005 A document is said to be structured here when it contains multiple fields. Current approaches for structured document retrieval Score combination The most commonly used and well-studied approach Rank combination is a special case of score combination Term-frequency combination [Robertson et al, CIKM’04]: An extension of BM25 [Ogilvie et al, SIGIR’03]: Linearly combining language models Each approach works moderately well, but… Structured Document Retrieval
SIGIR’2005 For a multi-term query, a document matching a single query term over many fields could get unreasonably higher score than another document which matches all the query terms in a few fields. (See discussions in [Robertson et al, CIKM’04]) Score Combination Issues score(d 1 ) = s + s + s + … + s = 8s score(d 2 ) = 2s + 2s … + 0 = 4s score(d 1 ) > score(d 2 ) Unreasonable
SIGIR’2005 Consider a single-term query Q=t, and some documents with two fields (F 1, F 2 ). Assuming: w 1 = weight(F 1 ) = 5; w 2 = weight(F 2 ) = 1 TF Combination Issues Example-1 (assuming |d 1 |=|d 2 |) Example-2 (assuming |d 3 |=|d 4 |) tf(t,d 1 ) = w 1 * 1 + w 2 * 0 = 5 tf(t,d 2 ) = w 1 * 0 + w 2 * 6 = 6 tf(t,d 3 ) = w 1 * 1 + w 2 * 8 = 13 tf(t,d 4 ) = w 1 * 0 + w 2 * 14 = 19 score(d 1 ) < score(d 2 ) Reasonable score(d 3 ) < score(t,d 4 ) Unreasonable Larger w 1 ? Can’t remove this issue Potential risk of making the case of example-1 unreasonable
SIGIR’2005 Structured Document Retrieval by GBM
SIGIR’2005 Experimental Results Performance comparison of different approaches for the combination of body and title fields
SIGIR’2005 Background Gravitation-based Model Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis GBM Model for Structured Document Retrieval Summary Outline
SIGIR’2005 Viewing IR from a different viewpoint is the same important as going deeper from traditional perspectives. This paper may be a first step to take a physics viewpoint It is encouraging that we can really benefit from the nature A family of effective ranking functions derived Give BM25 a physics interpretation A more reasonable approach for structured document retrieval obtained Summary
SIGIR’2005 Sorry, Sir Isaac Newton. Hope I am not abusing your laws.
SIGIR’2005 The End Gravitation-Based Model for Information Retrieval Please send your comments to: