The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Probabilistic Information Retrieval Part I: Survey Alexander Dekhtyar department of Computer Science University of Maryland.
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Information Retrieval Models
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
What is coming… n Today: u Probabilistic models u Improving classical models F Latent Semantic Indexing F Relevance feedback (Chapter 5) n Monday Feb 5.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
C.Watterscsci64031 Probabilistic Retrieval Model.
Information Retrieval Chap. 02: Modeling - Part 2 Slides from the text book author, modified by L N Cassel September 2003.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
Web-Mining Agents Probabilistic Information Retrieval Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Latent Semantic Indexing
Information Retrieval Models: Probabilistic Models
Recuperação de Informação B
CS 430: Information Discovery
Recuperação de Informação B
Information Retrieval and Web Search
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
Presentation transcript:

The Probabilistic Model

Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an ideal answer set; n Querying as specification of the properties of this ideal answer set (clustering); n But, what are these properties? n Guess at the beginning what they could be (i.e., guess initial description of ideal answer set); n Improve by iteration.

Probabilistic Model n An initial set of documents is retrieved somehow; n User inspects these docs looking for the relevant ones (only top need to be inspected); n IR system uses this information to refine description of ideal answer set; n By repeating this process, it is expected that the description of the ideal answer set will improve; n Have always in mind the need to guess at the very beginning the description of the ideal answer set; n Description of ideal answer set is modeled in probabilistic terms.

Probabilistic Ranking Principle n Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant); n The model assumes that this probability of relevance depends on the query and the document representations only; n Ideal answer set is referred to as R and should maximize the probability of relevance; n Documents in the set R are predicted to be relevant.

Probabilistic Ranking Principle n But, u how to compute probabilities? u what is the sample space?

The Ranking n Probabilistic ranking computed as: u sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) u This is the odds of the document dj being relevant; u Taking the odds minimize the probability of an erroneous judgment; n Definition: u wij  {0,1} u P(R | dj) : probability that given doc is relevant; u P(  R | dj) : probability doc is not relevant.

The Ranking n sim(dj,q) = P(R | dj) / P(  R | dj) = [P(dj) | R) * P(R)] (Bayes` rule) [P(dj) |  R) * P(  R)] ~ P(dj) | R) P(dj) |  R) n P(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents.

The Ranking n sim(dj,q)~ P(dj) | R) P(dj) |  R) ~ [  P(ki | R)] * [  P(  ki | R)] [  P(ki |  R)] * [  P(  ki |  R)] n P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents.

The Ranking n sim(dj,q) ~  wiq * wij * (log P(ki | R) + log P(  ki |  R) ) P(  ki | R) P(ki |  R) where: P(  ki | R) = 1 - P(ki | R) P(  ki |  R) = 1 - P(ki |  R)

The Initial Ranking n How we can compute the probabilities P(ki | R) and P(ki |  R) ? n Estimation based on assumptions: u P(ki | R) = 0.5 u P(ki |  R) = ni / N where ni is the number of docs that contain ki; u Use this initial guess to retrieve an initial ranking; u Improve upon this initial ranking.

Improving the Initial Ranking n Let u V : set of docs initially retrieved u Vi : subset of docs retrieved that contain ki n Reevaluate estimates: u P(ki | R) = Vi / V u P(ki |  R) = (ni – Vi)/(N – V) n Repeat recursively

Improving the Initial Ranking n To avoid problems with V=1 and Vi=0: u P(ki | R) = (Vi + 0.5) / (V + 1); u P(ki |  R) = (ni - Vi + 0.5) / (N - V + 1); n Also, u P(ki | R) = Vi + (ni/N) V + 1 u P(ki |  R) = ni - Vi + (ni/N) N - V + 1

Pluses and Minuses n Advantages: u Docs ranked in decreasing order of probability of relevance; n Disadvantages: u need to guess initial estimates for P(ki | R); u method does not take into account tf and idf factors.

Brief Comparison of Classic Models n Boolean model does not provide for partial matches and is considered to be the weakest classic model; n Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections; n This seems also to be the view of the research community.

Alternative models n Models based on fuzzy sets; n Extensions of the Boolean model: continuous weights belonging to [0, 1] interval; n Models based in latent semantic analysis (LSA); n Models based on neural networks; n Models based on Bayesian networks; n Models for structured documents; n Models for browsing; n …