Modeling Modern Information Retrieval

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Modern Information Retrieval Chapter 1: Introduction
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Probabilistic Ranking Principle
Information Retrieval Models: Probabilistic Models
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 7 Retrieval Models.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Advanced information retrieval Chapter. 02: Modeling (Set Theoretic Models) – Fuzzy model.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
The Boolean Model Simple model based on set theory
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Modeling in Information Retrieval - Classical Models
Advanced information retrieval
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, 1999. (Chapter 2)

Introduction Ranking algorithms Taxonomy of IR Models The central problem regarding IR systems is the issue of predicting which documents are relevant and which are not. Taxonomy of IR Models Boolean: set theoretic Vector: algebraic Probabilistic

Retrieval Ad hoc Filtering (Routing) the documents in the collection remain relatively static while new queries are submitted to the system Filtering (Routing) the queries remain relatively static while new documents come into the system construction of user profile

Basic Concepts In the classic models each document is described by a set of representative keywords called index terms index terms are mainly nouns distinct index terms have varying relevance index term weights are usually assumed to be mutually independent

Boolean Model Binary decision criterion Data retrieval model A query is a Boolean expression which can be represented as a disjunction of conjunctive vectors Advantage clean formalism, simplicity Disadvantage exact matching may lead to retrieval of too few or too many documents

Vector Model (1/4) Index terms are assigned non-binary weights Term weights are used to compute the degree of similarity between documents and the user query Then, retrieved documents are sorted in decreasing order. Definition For the vector model, the weight wi,j is associated with term ki and document dj

Vector Model (2/4) Degree of similarity

Vector Model (3/4) Salton Definition IR vs. clustering intra-clustering similarity: tf factor (term frequency) inter-cluster dissimilarity: idf factor (inverse document frequency) Definition normalized frequency inverse document fequency term-weighting schemes query-term weights

Vector Model (4/4) Advantages Disadvantage its term-weighting scheme improves retrieval performance its partial matching strategy allows retrieval of documents that approximate the query conditions its cosine ranking formula sorts the documents according to their degree of similarity to the query Disadvantage The assumption of mutual independence between index terms

Probabilistic Model (1/7) Introduced by Roberston and Sparck Jones, 1976 Also called binary independence retrieval (BIR) model Idea: Given a user query q, and the ideal answer set of the relevant documents, the problem is to specify the properties for this set. i.e.the probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)

Probabilistic Model (2/7) Definition All index term weights are all binary i.e., wi,j  {0,1} Let R be the set of documents know to be relevant to query q Let be the complement of R Let be the probability that the document dj is relevant to the query q Let be the probability that the document dj is nonelevant to query q

Probabilistic Model (3/7) The similarity sim(dj,q) of the document dj to the query q is defined as the ratio Using Bayes’ rule, P(R) stands for the probability that a document randomly selected from the entire collection is relevant stands for the probability of randomly selecting the document dj from the set R of relevant documents

Probabilistic Model (4/7) Assuming independence of index terms and given q=(d1, d2, …, dt),

Probabilistic Model (5/7) Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R stands for the probability that the index term ki is not present in a document randomly selected from the set R let Pr(ki |R)=pi di is either 0 or 1 0: di is absent from q 1: di is present in q

Probabilistic Model (6/7)

Probabilistic Model (7/7) The retrieval value of each ki present in a document (i.e., di=1) is term relevance weight pj = 0.5, qj = dfj / N

Estimation of Term Relevance