Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the information need represented by q, i.e. P(R|q,d) Compared to previous models: Boolean and Vector Models: Ranking based on relevance value which is inter- preted as a similarity measure between q and d Probabilistic Models: Ranking based on estimated likelihood of d being relevant for query q
Probabilistic IR Models Again: Documents and queries are represented as vectors with binary weights (i.e. w ij = 0 or 1) Relevance is seen as a relationship between an information need (expressed as query q) and a document. A document d is relevant if and only if a user with information need q "wants" d. Relevance is a function of various parameters, is subjective, can not always be exactly specified, Hence: Probabilistic description of relevance, i.e. instead of a vector space, we operate in an event space Q x D (Q = set of possible queries, D = set of all docs in collection) Interpretation: If a user with info need q draws a random document d from the collection, how big is its likelihood of being relevant, i.e. P(R | q,d)?
The Probability Ranking Principle Probability Ranking Principle (Robertson, 1977): Optimal retrieval performance can be achieved when documents are ranked according to their probabilities of being judged relevant to a query. (Informal definition) Involves two assumptions: 1. Dependencies between docs are ignored 2. It is assumed that the probabilities can be estimated in the best possible way Main task: Estimation of probability P(R|q,d) for every document d in the document collection D REFERENCE: F. CRESTANI ET AL. IS THIS DOCUMENT RELEVANT?... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN IR [3]
Probabilistic Modeling Given: Documents d j = (t 1, t 2,..., t n ), queries q i (n = no of docs in collection) We assume similar dependence between d and q as before, i.e. relevance depends on term distribution (Note: Slightly different notation here than before!) Estimating P(R|d,q) directly often impossible in practice. Instead: Use Bayes Theorem, i.e. or
Probab. Modeling as Decision Strategy Decision about which docs should be returned based on a threshold calculated with a cost function C j Example: C j (R, dec)RetrievedNot retrieved Relevant Doc.0 1 Non-Rel. Doc. 2 0 Decision based on risk function that minimizes costs:
Probabilistic Modeling as Decision Strategy (Cont.)
Probability Estimation Different approaches to estimate P(d|R) exist: Binary Independence Retrieval Model (BIR) Binary Independence Retrieval Model (BII) Darmstadt Indexing Approach (DIA) Generally we assume stochastic independence between the terms of one document, i.e.
Binary Independence Retr. Model (BIR) Learning : Estimation of probability distribution based on - a query q k - a set of documents d j - respective relevance judgments Application : Generalization to different documents from the collection (but restricted to same query and terms from training) DOCS TERMS QUERIES LEARNINGAPPLICATION BIR
Binary Indep. Indexing Model (BII) Learning : Estimation of probability distribution based on - a document d j - a set of queries q k - respective relevance judgments Application : Generalization to different queries (but restricted to same doc. and terms from training) DOCS TERMS QUERIES APPLICA- TION BII LEARNING
Learning : Estimation of probability distribution based on - a set of queries q k - an abstract description of a set of documents d j - respective relevance judgments Application : Generalization to different queries and documents Darmstadt Indexing Approach (DIA) DOCS TERMS QUERIES APPLICA- TION DIA LEARNING
DIA - Description Step Basic idea: Instead of term-document pairs, consider relevance descriptions x(t i, d m ) These contain the values of certain attributes of term t i, document d m and their relation to each other Examples: - Dictionary information about t i (e.g. IDF) - Parameters describing d m (e.g. length or no. of unique terms) - Information about the appearance of t i in d m (e.g. in title, abstract), its frequency, the distance between two query terms, etc. REFERENCE: FUHR, BUCKLEY [4]
DIA - Decision Step Estimation of probability P(R | x(t i, d m )) P(R | x(t i, d m )) is the probability of a document d m being relevant to an arbitrary query given that a term common to both document and query has a relevance description x(t i, d m ). Advantages: - Abstraction from specific term-doc pairs and thus generalization to random docs and queries - Enables individual, application-specific relevance descriptions
DIA - (Very) Simple Example RELEVANCE DESCRIPTION: x(t i, d m ) = (x 1, x 2 ) with QUERYDOC.REL.TERMx q1q1 d1d1 REL.t1t2t3t1t2t3 (1,1) (0,1) (1,2) q1q1 d2d2 NOT REL. t1t3t4t1t3t4 (0,2) (1,1) (0,1) q2q2 d1d1 REL.t2t5t6t7t2t5t6t7 (0,2) (1,1) (1,2) q2q2 d3d3 NOT REL. t5t7t5t7 (0,1) xExEx 1/4 (0,2)2/3 (1,1)2/3 (1,2)1 TRAINING SET: q 1, q 2, d 1, d 2, d 3 EVENT SPACE: 1, if t i title of d m 0, otherwise 1, if t i d m once 2, if t i d m at least twice x 1 = x 2 =