Ranking in IR and WWW Modern Information Retrieval: A Brief Overview -Amit Singhal Presented by Parin Sangoi February 15, 2005
Outline Introduction History Models and Implementation Evaluation Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Outline Introduction History Models and Implementation Vector Space Model Probabilistic Models Inference Network Model Implementation Evaluation February 15, 2005
Outline (contd…) Key Techniques Other Techniques and Applications Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Outline (contd…) Key Techniques Term Weighting Query Modification Other Techniques and Applications References February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Introduction What is Information Retrieval? An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request. -F.W. Lancaster February 15, 2005
Introduction(contd…) Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Introduction(contd…) A typical IR system February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) History Vannevar Bush(1945) As We May Think(gave birth to the idea of automatic access) H.P.Luhn(1957) Indexing units for documents and measuring word overlap as a criterion for retrieval. Gerald Salton and students SMART system(improve search quality) Cyril Cleverdon Cranfield evaluations which is still in use in IR systems. February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) History(contd…) 1970s and 1980s Lot of developments based on the advances of the 60s. Development of lots of models. Effective on small text collections. 1992 Text Retrieval Conference (TREC) Aims at encouraging research in IR from large text collections. Old techniques modified and new techniques developed. February 15, 2005
Models and Implementation Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Models and Implementation Boolean systems ANDs, ORs, and NOTs No ranking and difficult for a user to form a good search request. Many users still use Boolean systems as they think they are more in control of the retrieval process February 15, 2005
Models and Implementation(contd…) Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Models and Implementation(contd…) Vector Space Model Text represented by a vector of terms. If words are chosen as terms, then every word can be represented as a vector. A non-zero value is assigned to a text vector if the term belongs to the text. Text vectors are very sparse as there can be millions of term in a vocabulary. Similarity between the query vector and the document vector is used to assign a numeric score to a document for a query. February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Vector Space Model(contd…) The angle between the two vectors is used as a measure of divergence between vectors, and cosine of an angle is used as the numeric similarity( cosine is 1 for identical vectors and 0 for orthogonal vectors). Alternatively the dot product between the two vectors can be used to measure similarity. If all the vectors are of unit length, then the cosine of the angle is the same as the dot products. If Vq is the document vector and Vd is the document vector,then the similarity of document D to Query q is: Sim(Vd,Vq)=∑Wti(Vq) . Wti(Vd) where Wti(Vq)is the ith component in the query vector Vq and Wti(Vd) is the ith component in the document vector Vd. Vd Vq February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Probabilistic Models One of the main principles of an IR system is that it should be ranked. The probabilistic model ranks by decreasing probability of their relevance to a query. Let P(R/D) be the probability of relevance of document D. As the ranking criterion is monotonic under log-odds transformation, we can rank documents by log(P(R/D) / P(R/D)) where P(R/D) is the probability that document is non-relevant. Applying Baye’s theorem to this ratio we get log ( (P(D/R).P(R)) / (P(D/R).P(R)) ) February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Probabilistic Model (contd…) Assuming P(R) is independent of the document under consideration and is thus constant, P(R) and P(R) are just scaling factors and can be eliminated. Thus the above formula can be simplified to: log ( P(D/R) / P(D/R)). Independence Assumption If pi denotes P(ti/R) and qi denotes P(ti/R) the above log formula reduces to: February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Probabilistic Model (contd…) Sharp and Harper assume that pi is the same for every query and pi/(1-pi) is a constant and hence can be ignored. Also all the documents in a collection are non-relevant to a query (as the collections are very large) and estimate qi by ni/N where N is the collection size and ni is the number of documents containing the term i. Thus we get a scoring function: which is similar to the IDF function. February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Inference Network Model In the simplest implementation, a document instantiates a term with a certain strength, and the credit from multiple terms is accumulated given a query to compute the equivalent of a numeric score for the document. If the strength is considered to be the weight of the term in the document, then the ranking is similar to the Vector Space Model or the Probabilistic Model. Any form can be used to define the strength of the instantiation, and any formula can be used. February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Implementation Inverted list data structure. Fast access to a list of documents that contain a term along with some additional information (weight, relative position, etc.). Inverted Index Stop Words (ignored) the, in , of, a... Stemming retrieval, retrieve, retrieved, retrieving, retriever… Poor stemming if it returns wrong documents Is it good enough??? Multi Word Phrases February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Evaluation Measurable Quantities: The coverage of the collection, that is, the extent to which the system includes relevant matter The time lag, that is, the average interval between the time the search request is made and the time an answer is given; The form of presentation of the output The effort involved on the part of the user in obtaining answers to his search requests The recall of the system, that is, the proportion of relevant material actually retrieved in answer to a search request The precision of the system, that is, the proportion of retrieved material that is actually relevant. Out of these the last two measure the effectiveness of the retrieval system. February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Precision and Recall Contingency table February 15, 2005
Precision and Recall (contd…) Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Precision and Recall (contd…) Recall is the proportion of relevant documents retrieved by the system. Precision is the proportion of retrieved documents that are relevant. Fallout is the proportion of non-relevant documents retrieved by the system. A good IR system should have a high recall (retrieve as many relevant documents as possible) & have a high precision (retrieve very few non-relevant documents). February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Precision and Recall (contd…) Unfortunately the two goals are quite contradictory. Average Precision February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Key Techniques Term Weighting Both the Probabilistic Model and the Vector Space Model need a weight function to determine the ranked relevance. Three main factors affect the weight formulation: Term Frequency (tf) Words that repeat multiple times in a document are considered salient. Document Frequency (idf) Words that appear in many documents are considered common and are not very indicative of document content. A weighting method based on this, is called inverse document frequency (or idf) weighting. Document Length When collections have documents of varying lengths, longer documents tend to score higher since they contain more words and word repetitions. This effect is usually compensated by normalizing for document lengths in the term weighting method. February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Term Weighting (contd…) After the first TREC researchers realized that raw tf is non-optimal and a dampened frequency (e.g., a logarithmic tf function) is a better weighting metric. February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval:A Brief Overview-Amit Singhal) Query Modification Synonyms Earlier systems relied on thesaurus New ones build their own thesauri by analyzing word co-occurrence Relevance Feedback Users are the best judgers of whether a query is relevant or non-relevant Pseudo-Feedback Relevance feedback on the top few documents to generate a new query February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval: A Brief Overview-Amit Singhal) Other Techniques and Applications Techniques Cluster Hypothesis Documents that are very similar to each other will have a similar relevance profile for a given query. Limited success Aided in the development of browsing and searching interfaces Natural Language Processing Applications Information Filtering Topic Detection and Tracking (TDT) Speech Retrieval Cross Language Retrieval Question Answering ….. February 15, 2005
Ranking in IR and WWW (Modern Information Retrieval: A Brief Overview-Amit Singhal) References Modern Information Retrieval: A Brief Overview -Amit Singhal Information Retrieval -C.J. van Rijsbergen Dr. Gautam Das’s Lecture Notes February 15, 2005