Ranking in IR and WWW Modern Information Retrieval: A Brief Overview

Slides:

Advertisements

Similar presentations

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Chapter 5: Introduction to Information Retrieval

The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.

CpSc 881: Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

IR Models: Overview, Boolean, and Vector

Hinrich Schütze and Christina Lioma

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Modern Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Modeling Modern Information Retrieval

Hinrich Schütze and Christina Lioma

Evaluating the Performance of IR Sytems

Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.

Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Chapter 5: Information Retrieval and Web Search

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Vector Space Models.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.

Automated Information Retrieval

Sampath Jayarathna Cal Poly Pomona

Lecture 13: Language Models for IR

Lecture 12: Relevance Feedback & Query Expansion - II

Information Retrieval and Web Search

COMP791A: Statistical Language Processing

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Multimedia Information Retrieval

Information Retrieval Models: Probabilistic Models

Special Topics on Information Retrieval

Basic Information Retrieval

6. Implementation of Vector-Space Retrieval

Chapter 5: Information Retrieval and Web Search

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

5. Vector Space and Probabilistic Retrieval Models

Boolean and Vector Space Retrieval Models

CS 430: Information Discovery

INF 141: Information Retrieval

Recuperação de Informação B

Retrieval Performance Evaluation - Measures

Recuperação de Informação B

Information Retrieval and Web Design

Term Frequency–Inverse Document Frequency

ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH

CS 430: Information Discovery

Presentation transcript:

Ranking in IR and WWW Modern Information Retrieval: A Brief Overview -Amit Singhal Presented by Parin Sangoi February 15, 2005

Outline Introduction History Models and Implementation Evaluation Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Outline Introduction History Models and Implementation Vector Space Model Probabilistic Models Inference Network Model Implementation Evaluation February 15, 2005

Outline (contd…) Key Techniques Other Techniques and Applications Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Outline (contd…) Key Techniques Term Weighting Query Modification Other Techniques and Applications References February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Introduction What is Information Retrieval? An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request. -F.W. Lancaster February 15, 2005

Introduction(contd…) Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Introduction(contd…) A typical IR system February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) History Vannevar Bush(1945) As We May Think(gave birth to the idea of automatic access) H.P.Luhn(1957) Indexing units for documents and measuring word overlap as a criterion for retrieval. Gerald Salton and students SMART system(improve search quality) Cyril Cleverdon Cranfield evaluations which is still in use in IR systems. February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) History(contd…) 1970s and 1980s Lot of developments based on the advances of the 60s. Development of lots of models. Effective on small text collections. 1992 Text Retrieval Conference (TREC) Aims at encouraging research in IR from large text collections. Old techniques modified and new techniques developed. February 15, 2005

Models and Implementation Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Models and Implementation Boolean systems ANDs, ORs, and NOTs No ranking and difficult for a user to form a good search request. Many users still use Boolean systems as they think they are more in control of the retrieval process February 15, 2005

Models and Implementation(contd…) Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Models and Implementation(contd…) Vector Space Model Text represented by a vector of terms. If words are chosen as terms, then every word can be represented as a vector. A non-zero value is assigned to a text vector if the term belongs to the text. Text vectors are very sparse as there can be millions of term in a vocabulary. Similarity between the query vector and the document vector is used to assign a numeric score to a document for a query. February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Vector Space Model(contd…) The angle between the two vectors is used as a measure of divergence between vectors, and cosine of an angle is used as the numeric similarity( cosine is 1 for identical vectors and 0 for orthogonal vectors). Alternatively the dot product between the two vectors can be used to measure similarity. If all the vectors are of unit length, then the cosine of the angle is the same as the dot products. If Vq is the document vector and Vd is the document vector,then the similarity of document D to Query q is: Sim(Vd,Vq)=∑Wti(Vq) . Wti(Vd) where Wti(Vq)is the ith component in the query vector Vq and Wti(Vd) is the ith component in the document vector Vd. Vd Vq February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Probabilistic Models One of the main principles of an IR system is that it should be ranked. The probabilistic model ranks by decreasing probability of their relevance to a query. Let P(R/D) be the probability of relevance of document D. As the ranking criterion is monotonic under log-odds transformation, we can rank documents by log(P(R/D) / P(R/D)) where P(R/D) is the probability that document is non-relevant. Applying Baye’s theorem to this ratio we get log ( (P(D/R).P(R)) / (P(D/R).P(R)) ) February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Probabilistic Model (contd…) Assuming P(R) is independent of the document under consideration and is thus constant, P(R) and P(R) are just scaling factors and can be eliminated. Thus the above formula can be simplified to: log ( P(D/R) / P(D/R)). Independence Assumption If pi denotes P(ti/R) and qi denotes P(ti/R) the above log formula reduces to: February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Probabilistic Model (contd…) Sharp and Harper assume that pi is the same for every query and pi/(1-pi) is a constant and hence can be ignored. Also all the documents in a collection are non-relevant to a query (as the collections are very large) and estimate qi by ni/N where N is the collection size and ni is the number of documents containing the term i. Thus we get a scoring function: which is similar to the IDF function. February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Inference Network Model In the simplest implementation, a document instantiates a term with a certain strength, and the credit from multiple terms is accumulated given a query to compute the equivalent of a numeric score for the document. If the strength is considered to be the weight of the term in the document, then the ranking is similar to the Vector Space Model or the Probabilistic Model. Any form can be used to define the strength of the instantiation, and any formula can be used. February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Implementation Inverted list data structure. Fast access to a list of documents that contain a term along with some additional information (weight, relative position, etc.). Inverted Index Stop Words (ignored) the, in , of, a... Stemming retrieval, retrieve, retrieved, retrieving, retriever… Poor stemming if it returns wrong documents Is it good enough??? Multi Word Phrases February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Evaluation Measurable Quantities: The coverage of the collection, that is, the extent to which the system includes relevant matter The time lag, that is, the average interval between the time the search request is made and the time an answer is given; The form of presentation of the output The effort involved on the part of the user in obtaining answers to his search requests The recall of the system, that is, the proportion of relevant material actually retrieved in answer to a search request The precision of the system, that is, the proportion of retrieved material that is actually relevant. Out of these the last two measure the effectiveness of the retrieval system. February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Precision and Recall Contingency table February 15, 2005

Precision and Recall (contd…) Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Precision and Recall (contd…) Recall is the proportion of relevant documents retrieved by the system. Precision is the proportion of retrieved documents that are relevant. Fallout is the proportion of non-relevant documents retrieved by the system. A good IR system should have a high recall (retrieve as many relevant documents as possible) & have a high precision (retrieve very few non-relevant documents). February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Precision and Recall (contd…) Unfortunately the two goals are quite contradictory. Average Precision February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Key Techniques Term Weighting Both the Probabilistic Model and the Vector Space Model need a weight function to determine the ranked relevance. Three main factors affect the weight formulation: Term Frequency (tf) Words that repeat multiple times in a document are considered salient. Document Frequency (idf) Words that appear in many documents are considered common and are not very indicative of document content. A weighting method based on this, is called inverse document frequency (or idf) weighting. Document Length When collections have documents of varying lengths, longer documents tend to score higher since they contain more words and word repetitions. This effect is usually compensated by normalizing for document lengths in the term weighting method. February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Term Weighting (contd…) After the first TREC researchers realized that raw tf is non-optimal and a dampened frequency (e.g., a logarithmic tf function) is a better weighting metric. February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Query Modification Synonyms Earlier systems relied on thesaurus New ones build their own thesauri by analyzing word co-occurrence Relevance Feedback Users are the best judgers of whether a query is relevant or non-relevant Pseudo-Feedback Relevance feedback on the top few documents to generate a new query February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval: A Brief Overview-Amit Singhal) Other Techniques and Applications Techniques Cluster Hypothesis Documents that are very similar to each other will have a similar relevance profile for a given query. Limited success Aided in the development of browsing and searching interfaces Natural Language Processing Applications Information Filtering Topic Detection and Tracking (TDT) Speech Retrieval Cross Language Retrieval Question Answering ….. February 15, 2005

Ranking in IR and WWW (Modern Information Retrieval: A Brief Overview-Amit Singhal) References Modern Information Retrieval: A Brief Overview -Amit Singhal Information Retrieval -C.J. van Rijsbergen Dr. Gautam Das’s Lecture Notes February 15, 2005