IR Models: Review Vector Model and Probabilistic.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
Information Retrieval Models
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
What is coming… n Today: u Probabilistic models u Improving classical models F Latent Semantic Indexing F Relevance feedback (Chapter 5) n Monday Feb 5.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Introduction to Digital Libraries Searching
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Web Searching & Ranking Zachary G. Ives University of Pennsylvania CIS 455/555 – Internet and Web Systems October 25, 2015 Some content based on slides.
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
Information Retrieval Chap. 02: Modeling - Part 2 Slides from the text book author, modified by L N Cassel September 2003.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
Presentation transcript:

IR Models: Review Vector Model and Probabilistic

IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext Classic Models boolean vector probabilistic Retrieval: Adhoc Filtering Browsing

Classic IR Models - Basic Concepts Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document main themes The importance of the index terms is represented by weights associated to them Let –k i be an index term –d j be a document –w ij is a weight associated with (k i,d j ) The weight w ij quantifies the importance of the index term for describing the document contents

Vector Model Similarity Sim(q,d j ) = cos(  ) = [vec(d j )  vec(q)] / |d j | * |q| = [  w ij * w iq ] / |d j | * |q| TF-IDF term-weighting scheme –w ij = freq(i,q) / max(freq(l,q)]) * log(N/n i ) Default query term weights –w iq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/n i ) i j dj q 

Example 1: no weights d1 d2 d3 d4d5 d6 d7 k1 k2 k3 k1k2k3q * dj d11012 d21001 d30112 d41001 d51113 d61102 d70101 q111

Example 2: query weights d1 d2 d3 d4d5 d6 d7 k1 k2 k3 k1k2k3q * dj d11014 d21001 d30115 d41001 d51116 d61103 d70102 q123

Example 3: query and document weights d1 d2 d3 d4d5 d6 d7 k1 k2 k3 k1k2k3q * dj d12015 d21001 d d42002 d d61205 d q123

Summary of Vector Space Model Advantages: –term-weighting improves quality of the answer set –partial matching allows retrieval of docs that approximate the query conditions –cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: –assumes independence of index terms (??); not clear that this is bad though

Probabilistic Model Objective: to capture the IR problem using a probabilistic framework Given a user query, there is an ideal answer set Querying as specification of the properties of this ideal answer set (clustering) –But, what are these properties? Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) Improve by iteration

Probabilistic Model An initial set of documents is retrieved somehow User inspects these docs looking for the relevant ones (in truth, only top need to be inspected) IR system uses this information to refine description of ideal answer set By repeating this process, it is expected that the description of the ideal answer set will improve Description of ideal answer set is modeled in probabilistic terms

Probabilistic Ranking Principle Given a user query q and a document d j, the probabilistic model tries to estimate the probability that the user will find the document d j relevant. The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. But, –how to compute probabilities?

The Ranking Probabilistic ranking computed as: –sim(q,d j ) = P(d j relevant-to q) / P(d j non-relevant-to q) –This is the odds of the document d j being relevant –Taking the odds minimizes the probability of an erroneous judgement Definition: –w ij  {0,1}, w iq  {0,1} –P (R | vec(d j )) : probability that given doc is relevant –P (  R | vec(d j )) : probability doc is not relevant

The Ranking sim(d j,q) = P(R | vec(d j )) / P(  R | vec(d j )) = [P(vec(d j ) | R) * P(R)] [P(vec(d j ) |  R) * P(  R)] ~ P(vec(d j ) | R) P(vec(d j ) |  R) P(vec(d j ) | R) : probability of randomly selecting the document d j from the set R of relevant documents

The Ranking sim(dj,q)~ P(vec(d j ) | R) P(vec(d j ) |  R) ~ [  indoc P(k i | R)] x [   indoc P(  k i | R)] [  indoc P(k i |  R)] x [   indoc P(  k i |  R)] P(k i | R) : probability that the index term k i is present in a document randomly selected from the set R of relevant documents Assumes the independence of terms.

The Ranking sim(d j,q) ~ [  indoc P(k i | R)] x [   indoc P(  k i | R)] [  indoc P(k i |  R)] x [   indoc P(  k i |  R)] math happens... ~  w iq * w ij * (log P( k i | R) + log P(  k i |  R) ) P(  k i | R) P(k i |  R) where P(  k i | R) = 1 - P(k i | R) P(  k i |  R) = 1 - P(k i |  R)

The Initial Ranking sim(d j,q) ~ ~  w iq * w ij * (log P( k i | R) + log P(  k i |  R) ) P(  k i | R) P(k i |  R) Probabilities P(k i | R) and P(k i |  R) ? Estimates based on assumptions: –P(k i | R) = 0.5 –P(k i |  R) = n i N where n i is the number of docs that contain k i –Use this initial guess to retrieve an initial ranking –Improve upon this initial ranking

Improving the Initial Ranking sim(d j,q) ~ ~  w iq * w ij * (log P( k i | R) + log P(  k i |  R) ) P(  k i | R) P(k i |  R) Let –V : set of docs initially retrieved These are considered relevant even if we are not sure –V i : subset of docs retrieved that contain k i Reevaluate estimates: –P(k i | R) = V i V –P(k i |  R) = n i - V i N - V Repeat recursively

Improving the Initial Ranking sim(d j,q) ~ ~  w iq * w ij * (log P( k i | R) + log P(  k i |  R) ) P(  k i | R) P(k i |  R) Need to avoid problems with small known document sets (e.g. when V=1 and V i =0): –P(k i | R) = V i V + 1 –P(k i |  R) = n i - V i N - V + 1 But we can use frequency in corpus instead, –P(k i | R) = V i + n i /N V + 1 –P(k i |  R) = n i - V i + n i /N N - V + 1

Probabilistic Model Advantages: –Documents are ranked in decreasing order of probability of relevance Disadvantages: –need to guess initial estimates for P(k i | R) –method does not take into account tf and idf factors New Use for Model: –Meta search Use probabilities to model the value of different search engines for different topics

Brief Comparison of Classic Models Boolean model does not provide for partial matches and is considered to be the weakest classic model Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections This seems also to be the view of the research community