Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Basic IR: Modeling Basic IR Task: Slightly more complex:
Modern Information Retrieval Chapter 1: Introduction
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Modern Information Retrieval Chapter 1: Introduction
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 1 Introduction.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Multimedia Information Retrieval
Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSE 635 Multimedia Information Retrieval
Boolean and Vector Space Retrieval Models
Recuperação de Informação B
Information Retrieval and Web Design
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Recuperação de Informação
Information Retrieval and Web Design
Advanced information retrieval
Presentation transcript:

Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz

What is Information Retrieval Information Retrieval deals with information items in terms of Representation Storage Organization Access An IR system should provide access to the information that the user is interested in.

Example of an Information need “Find all documents containing information on college tennis teams which: (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. To be relevant, the document must include information on the national ranking of the team in the last three years and the or phone number of the team coach.”

The user must translate his/her information needs into a query. Most commonly a query is a set of keywords that summarizes the information needs.

Information Retrieval vs. Data Retrieval Data retrieval consists of determining which documents of the collection contain the keywords in the user query. relevant Information retrieval should “interpret” the contents of the documents in the collection and retrieve all the documents that are relevant to the user query while retrieving a few non relevant documents as possible.

Basic Concepts Effective retrieval of relevant information is affected by: the user task the logical view of the documents

The User Task Retrieval Browsing Database

Logical View of a Document Documents can be represented as: A set of keywords or indexing terms Full text

Logical View of the Document Document Text + Structure Accents, Spacing, ect. stopwords Noun groups stemming Automatic or manual indexing. Structure Recognition Text Full text Structure Index term

The Retrieval Process User Interface Text Operations Query Operations Searching Indexing DB Manager Module Index Text Database Ranking User’s need Ranked Docs Retrieved Docs Query User’s feedback Text Logical view Inverted file

IR Models An IR model is a quadruple [D,Q, F, R(q i,d j )] D: set of logical representations of the documents Q: set of logical representation of the queries F : framework for modeling document representations, queries, and their relationships R(q i,d j ): ranking function that defines an association between the query and the documents. This ranking defines an ordering among the documents regarding the query.

UserTasksUserTasks Browsing Classic Models Boolean Vector space Probabilistic Structured Models Non Overlapping Lists Proximal nodes Browsing Flat Structure Guided Hypertext Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network Retrieval: Adhoc Filtering

Retrieval Models and Logical View of Documents Structured Guided Hypertext Flat Hypertext FlatBrowsing StructuredClassic Set Theoretic Algebraic Probabilistic Classic Set Theoretic Algebraic Probabilistic Retrieval Full Text + Structure Full TextIndex Terms

IR Models Basic concepts: Each document is described as a set of representative keywords called index terms. An index term is a word (which can be in the document) that helps in remembering the document’s main themes. Index terms are used to index and summarize the document contents

IR Models Basic concepts (cont.) Index terms have varying relevance when used to describe the document contents. This effect is captured by assigning numerical weights to each index term in the document. A weight is a positive value associated with each index term in the document.

IR Models The Boolean Model is a simple retrieval model based on set theory and Boolean algebra. Documents are represented by the index terms assigned to the document. There is no indication on which terms are more important than others ( weights are binary either 0 or 1)

IR Models Boolean Model (cont.) The Boolean operators used are Conjunction (AND,  ) Disjunction (OR,  ) Negation (NOT,  ) Queries are specified as conventional Boolean expressions which can be represented as a disjunction of conjunctive forms vectors (disjunctive normal form - DNF)

IR Models Boolean model Examples: Q = Safety  (Car   Industry) Q dnf = (1,1,1)  (1,1,0)  (1,0,0)

IR Models Boolean Model AdvantagesDisadvantages Clean formalism Easy to implement Exact matching may retrieve too few or too many documents. Expressing an information need as a Boolean expression might be challenging

IR Models Vector Space Model Documents and queries are expressed using a vector whose components are all the possible index terms(t). Each index term has an associated weight that indicates the importance of the index term in the document (or query).

IR Models In other words, the document d j and the query q are represented as t- dimensional vectors.  djdj q

IR Model The vector space model proposes to evaluate the degree of similarity of document dj with regard to the query q as the correlation between the two vectors d j and q.

IR Models This correlation can be quantified in different way, for example by the cosine of the angle between these two vectors.

IR Models Since w i,j  0 and W i,q  0, sim(d j,q) varies between 0 to +1. The vector space model assumes that the similarity value is an indication of the relevance of the document to the given query. Thus space models ranks the retrieved documents by the similarity value.

IR Models How can we compute the values of the weights w i,j ? One of the most popular methods is based on combining two factors: The importance of each index term in the document The importance of the index term in the collection

IR Models Importance of the index term in the document: This can be measured by the number of times that the term appears in the document. The higher the number of times that it is mentioned in the document the better the term is. This is called the term frequency which is denoted by the symbol tf.

IR Models The importance of the index term in the collection: An index term that appears in every document in the collection is not very useful, but a term that occurs in only a few documents may indicate that these few documents could be relevant to a query that uses this term.

IR Model In other words, the importance of an index term in the collection is quantified by the inverse of the frequency that this term has among the documents in the collection. This factor is usually called the inverse document frequency or the idf factor.

IR Models Mathematically this can be expressed as: Where: N= number of documents in the collection n i = number of document that contain the term i

IR Models Combining these two factors we can obtain the weight of an index term i as: Also called the tf-idf weighting scheme

IR Models Vector Space Model AdvantagesDisadvantages Its term weighting scheme can improve retrieval performance Allows partial matching Retrieved documents are sorted according to their degree of similarity Terms are assumed to be mutually independent. In some cases this might hurt performance.

IR Models Probabilistic model Assumption: given a query q and a document d j in the collection, this model tries to estimate the probability that the user will find the document d j relevant. The model assumes that there exists an ideal subset R of the document collection that contains only relevant documents.

Similarity in the probabilistic model is computed as: Where: P(k i |R) is the probability that index term ki is present in a document randomly selected from the set R IR Models

Estimation of P(k i |R) and P(k i |R): Initial constant values: Iterative process to improve estimates _ V is the set of documents retrieved V i is the set of documents in V that contain k i

IR Models Probabilistic Model AdvantagesDisadvantages Retrieved documents are ranked according to their probability of being relevant. The need to get the initial separation of docs in R and non-R The method does not take into account frequency of index terms Independence assumption.

UserTasksUserTasks Browsing Classic Models Boolean Vector space Probabilistic Structured Models Non Overlapping Lists Proximal nodes Browsing Flat Structure Guided Hypertext Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network Retrieval: Adhoc Filtering

Retrieval Evaluation Collection Relevant docs in the Answer set |Ra| Rel. docs |R| Answer set |A|

Retrieval Evaluation Pooling System 1 System 2 System 3 … pooling Pool of combined results Use top K documents from each run User evaluation

Retrieval Evaluation Relevance: Strength: it involves people (users) as judges of effectiveness of performance. Weakness: It involves judgments by people, with all the associated problems of subjectivity and variability.

Retrieval Evaluation Types of relevance: (Saracevic, 1999) System or algorithmic relevance: relation between the query and information objects retrieved. Topical or Subject relevance: relation between the topic expressed in the query and the topic covered in the retrieved documents. Cognitive relevance or pertinence: relation between the state of knowledge and cognitive information need of a user and documents retrieved. Situational relevance or utility: relation between the task or problem at hand ant and the documents retrieved. Motivational or affective relevance: relation between the intents, goals and motivations of a user and the documents retrieved.

Retrieval Evaluation T. Joachims. Evaluating retrieval performance using clickthrough data. In Text Mining: theoretical aspects and applications. Physca-Verlag, 79-96, Available online: blications/joachims_02b.pdf blications/joachims_02b.pdf

Clickthrough for IR Performance Main idea: Use a unified interface to submit results to two search engines. Estimate relevance information from links visited from the results returned by a search engine.

Clickthrough for IR Performance Regular clickthrough data: User types a query into a unified interface. The query is sent to both search engines A & B. Randomly select one list of results and present it to the user Ranks of links clicked by user are recorded

Clickthrough for IR Performance Unbiased clickthrough data for comparing retrieval functions: Blind test: Interface should hide the random variables underlying the hypothesis test to avoid biasing the user response. (placebo effect) Click  Preference: design the interface so that interaction with the system demonstrates a particular judgement by the user. Low usability impact

Clickthrough for IR Performance Input: ranking A = (a 1, a 2,…), ranking B= (b 1, b 2,…) Call: combine(A,B,0,0,  ) Output:combined ranking D Combine (A,B,ka,kb,D) if (ka = kb) { if(A[ka+1]  D) { D:= D+A[ka+1];} combine (A,B,ka+1,kb,D); } else { if (B[kb+1]  D) { D:= D+B[kb+1];} combine (A,B,ka,kb+1,D); }

Clickthrough for IR Performance Hypothesis test: Use a binomial sign test (i.e., McNemar’s test) to detect significant deviations from the median

Clickthrough for IR Performance Experimental design: Interface that sends query to two systems: Google and MSNsearch. Google and Default (50 links from MSNsearch in reverse orde) MSNsearch and Default

Clickthrough for IR Performance Does the clickthrough evaluation agree with the relevance judgements? Their conclusion in the small experiment is that there is a string correlation between relevance judgements and clickthrough data. Can this be a generalized conclusion?