Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Probabilistic Information Retrieval Part I: Survey Alexander Dekhtyar department of Computer Science University of Maryland.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Introduction to Information Retrieval (Part 2) By Evren Ermis.
CpSc 881: Information Retrieval
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Probabilistic Ranking Principle
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Information Retrieval Models: Probabilistic Models
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
Chapter 7 Retrieval Models.
Hinrich Schütze and Christina Lioma
ISP 433/533 Week 2 IR Models.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Vector Space Models.
Web-Mining Agents Probabilistic Information Retrieval Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Information Retrieval Models: Probabilistic Models
John Lafferty, Chengxiang Zhai School of Computer Science
5. Vector Space and Probabilistic Retrieval Models
CS 430: Information Discovery
Information Retrieval and Web Design
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the information need represented by q, i.e. P(R|q,d) Compared to previous models: Boolean and Vector Models: Ranking based on relevance value which is inter- preted as a similarity measure between q and d Probabilistic Models: Ranking based on estimated likelihood of d being relevant for query q

Probabilistic IR Models Again: Documents and queries are represented as vectors with binary weights (i.e. w ij = 0 or 1) Relevance is seen as a relationship between an information need (expressed as query q) and a document. A document d is relevant if and only if a user with information need q "wants" d. Relevance is a function of various parameters, is subjective, can not always be exactly specified, Hence: Probabilistic description of relevance, i.e. instead of a vector space, we operate in an event space Q x D (Q = set of possible queries, D = set of all docs in collection) Interpretation: If a user with info need q draws a random document d from the collection, how big is its likelihood of being relevant, i.e. P(R | q,d)?

The Probability Ranking Principle Probability Ranking Principle (Robertson, 1977): Optimal retrieval performance can be achieved when documents are ranked according to their probabilities of being judged relevant to a query. (Informal definition) Involves two assumptions: 1. Dependencies between docs are ignored 2. It is assumed that the probabilities can be estimated in the best possible way Main task: Estimation of probability P(R|q,d) for every document d in the document collection D REFERENCE: F. CRESTANI ET AL. IS THIS DOCUMENT RELEVANT?... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN IR [3]

Probabilistic Modeling Given: Documents d j = (t 1, t 2,..., t n ), queries q i (n = no of docs in collection) We assume similar dependence between d and q as before, i.e. relevance depends on term distribution (Note: Slightly different notation here than before!) Estimating P(R|d,q) directly often impossible in practice. Instead: Use Bayes Theorem, i.e. or

Probab. Modeling as Decision Strategy Decision about which docs should be returned based on a threshold calculated with a cost function C j Example: C j (R, dec)RetrievedNot retrieved Relevant Doc.0 1 Non-Rel. Doc. 2 0 Decision based on risk function that minimizes costs:

Probabilistic Modeling as Decision Strategy (Cont.)

Probability Estimation Different approaches to estimate P(d|R) exist: Binary Independence Retrieval Model (BIR) Binary Independence Retrieval Model (BII) Darmstadt Indexing Approach (DIA) Generally we assume stochastic independence between the terms of one document, i.e.

Binary Independence Retr. Model (BIR) Learning : Estimation of probability distribution based on - a query q k - a set of documents d j - respective relevance judgments Application : Generalization to different documents from the collection (but restricted to same query and terms from training) DOCS TERMS QUERIES LEARNINGAPPLICATION BIR

Binary Indep. Indexing Model (BII) Learning : Estimation of probability distribution based on - a document d j - a set of queries q k - respective relevance judgments Application : Generalization to different queries (but restricted to same doc. and terms from training) DOCS TERMS QUERIES APPLICA- TION BII LEARNING

Learning : Estimation of probability distribution based on - a set of queries q k - an abstract description of a set of documents d j - respective relevance judgments Application : Generalization to different queries and documents Darmstadt Indexing Approach (DIA) DOCS TERMS QUERIES APPLICA- TION DIA LEARNING

DIA - Description Step Basic idea: Instead of term-document pairs, consider relevance descriptions x(t i, d m ) These contain the values of certain attributes of term t i, document d m and their relation to each other Examples: - Dictionary information about t i (e.g. IDF) - Parameters describing d m (e.g. length or no. of unique terms) - Information about the appearance of t i in d m (e.g. in title, abstract), its frequency, the distance between two query terms, etc. REFERENCE: FUHR, BUCKLEY [4]

DIA - Decision Step Estimation of probability P(R | x(t i, d m )) P(R | x(t i, d m )) is the probability of a document d m being relevant to an arbitrary query given that a term common to both document and query has a relevance description x(t i, d m ). Advantages: - Abstraction from specific term-doc pairs and thus generalization to random docs and queries - Enables individual, application-specific relevance descriptions

DIA - (Very) Simple Example RELEVANCE DESCRIPTION: x(t i, d m ) = (x 1, x 2 ) with QUERYDOC.REL.TERMx q1q1 d1d1 REL.t1t2t3t1t2t3 (1,1) (0,1) (1,2) q1q1 d2d2 NOT REL. t1t3t4t1t3t4 (0,2) (1,1) (0,1) q2q2 d1d1 REL.t2t5t6t7t2t5t6t7 (0,2) (1,1) (1,2) q2q2 d3d3 NOT REL. t5t7t5t7 (0,1) xExEx 1/4 (0,2)2/3 (1,1)2/3 (1,2)1 TRAINING SET: q 1, q 2, d 1, d 2, d 3 EVENT SPACE: 1, if t i  title of d m 0, otherwise 1, if t i  d m once 2, if t i  d m at least twice x 1 = x 2 =