Download presentation
Presentation is loading. Please wait.
1
Special Topics on Information Retrieval
Manuel Montes y Gómez
2
Introduction
3
Special Topics on Information Retrieval
Content of the section Definition of the task The vector space model Performance evaluation Main problems and basic solutions Query expansion Relevance feedback Clustering (documents or results) Special Topics on Information Retrieval
4
Special Topics on Information Retrieval
Initial questions What is an information retrieval system? What is its goal? What is inside it? Any sub-procceses? How to evaluate its performance? Why results are not always relevant? Special Topics on Information Retrieval
5
General scheme of the IR process
Task Conception Info Need Formulation User Query Search Corpus Refinement Results Special Topics on Information Retrieval
6
Special Topics on Information Retrieval
More definitions IR deals with the representation, storage, organization of, and access to information items. R. Baeza-Yates and B. Ribeiro-Neto, 1999 The task of an IR system is to retrieve documents or texts with information content that is relevant to a user’s information need. Spark Jones & Willett, 1997 Special Topics on Information Retrieval
7
Special Topics on Information Retrieval
Typical IR system Indexing Retrieval IR Model Document Collection Preprocessing Storing Index Query Querying Results Retrieving Special Topics on Information Retrieval
8
Special Topics on Information Retrieval
Vector space model Documents are represented as vectors in a N-dimensional space N is the number of terms in the collection Term is different than word Query is treated as any other document Relevance – measured by similarity: A document is relevant to the query if its vector is similar to the query’s vector . Special Topics on Information Retrieval
9
Special Topics on Information Retrieval
Preprocessing Eliminate information about style, such as html or xml tags. For some applications this information may be useful. For instance, only index some document sections. Remove stop words Functional words such as articles, prepositions, conjunctions are not useful (do not have an own meaning). Perform stemming or lemmatization The goal is to reduce inflectional forms, and sometimes derivationally related forms. am, are, is → be car, cars, car‘s → car Special Topics on Information Retrieval
10
Special Topics on Information Retrieval
Representation Whole vocabulary of the collection (all different terms) t1 … tn d1 d2 : wi,j dm All documents (one vector per document) Weight indicating the contribution of term j in document i. Special Topics on Information Retrieval
11
Term weighting - two main ideas
The importance of a term increases proportionally to the number of times it appears in the document. It helps to describe document’s content. The general importance of a term decreases proportionally to its occurrences in the entire collection. Common terms are not good to distinguish relevant from non-relevant documents Special Topics on Information Retrieval
12
Term weighting – main approaches
Binary weights: wi,j = 1 iff document di contains term tj , otherwise 0. Term frequency (tf): wi,j = (no. of occurrences of tj in di) tf x idf weighting scheme: wi,j = tf(tj, di) × idf(tj), where: tf(tj, di) indicates the ocurrences of tj in document di idf(tj) = log [N/df(tj)], where df(tj) is the number of documets that contain the term tj. Special Topics on Information Retrieval
13
Special Topics on Information Retrieval
Similarity measure Relevance – similarity between document’s vectors and the query’s vector. Measured by means of the cosine measure. The closer the vectors (small their angle), the greater the document similarity. i j a1 d1 q d2 a2 Special Topics on Information Retrieval
14
Vector space model − Pros & Cons
Easily explain Mathematically sound Approximate query matching Cons Need term weighting Hard to model structured queries Normalization increases computational costs Most commonly used IR model; it is considered superior to others due to its simplicity and elegancy. Special Topics on Information Retrieval
15
Special Topics on Information Retrieval
Other IR models Boolean model (±1950) Document similarity (±1957) Probabilistic indexing (±1960) Vector space model (±1970) Probabilistic retrieval (±1976) Fuzzy set models (±1980) Inference networks (±1992) Language models (±1998) Special Topics on Information Retrieval
16
Special Topics on Information Retrieval
IR evaluation Why is evaluation important? Which characteristics we need to evaluate? How can we evaluate the performance of IR systems? Given several systems, which one is the best? What things (resources) are necessary to evaluate an IR system? Is IR evaluation subjective or objective? Special Topics on Information Retrieval
17
Special Topics on Information Retrieval
Several perspectives In order to answer "How well does the system work?“, we can investigated several options: Processing: Time and space efficiency Search: Effectiveness of results System: Satisfaction of the user We will focus on evaluating retrieval effectiveness, How to measure the other aspects? Special Topics on Information Retrieval
18
Difficulties in evaluating an IR system
Effectiveness is related to the relevancy of retrieved items. Relevancy is not typically binary but continuous. Even if relevancy is binary, it can be a difficult judgment to make. Relevancy, from a human standpoint, is: Subjective: Depends upon a specific user’s judgment. Situational: Relates to user’s current needs. Cognitive: Depends on human perception and behavior. Dynamic: Changes over time. Special Topics on Information Retrieval
19
Special Topics on Information Retrieval
Main requirements It is necessary to have a test collection A lot of documents (the bigger the better) Several queries Relevance judgments for all queries Binary assessment of either relevant or non relevant for each query-document pair. Methods/systems must be evaluated using the same evaluation measure. Constructing a test collection requires considerable human effort Special Topics on Information Retrieval
20
Standard test collections
TREC (Text Retrieval Conference) National Institute of Standards and Technology In total, 1.89 million documents and relevance judgments for 450 information needs CLEF (Cross Language Evaluation Forum) This evaluation series has concentrated on European languages and cross-language information retrieval Last Adhoc English monolingual IR task: 169,477 documents and 50 queries. Special Topics on Information Retrieval
21
Retrieval effectiveness
In response to a query, an IR system searches a document collection and returns a ordered list of responses. Measure the quality of a set/list of responses a better search strategy yields a better result list Better result lists help the user fill their information need Two kinds of measures: set based and ranked-list based Special Topics on Information Retrieval
22
Relevant Retrieved documents Special Topics on Information Retrieval
Set based measures Collection Query Relevant Retrieved documents Relevant Documents Retrieved Documents Results Special Topics on Information Retrieval
23
Precision, recall and F-measure
Precision (P) The ability to retrieve top-ranked documents that are mostly relevant. Recall (R) The ability of the search to find all of the relevant items in the corpus. F-measure (F) Harmonic mean of recall and precision
24
Ranked-list based measures
Average Recall/Precision Curve Plots average precision at each standard recall level across all queries. MAP (mean average precision) Provides a single-figure measure of quality across recall levels R-prec Precision at the R-th position in the ranking of results for a query that has R relevant documents Special Topics on Information Retrieval
25
Special Topics on Information Retrieval
Recall/Precision Curve (from Mooney’s IR course at the University of Texas at Austin) What is the curve of an ideal system? Special Topics on Information Retrieval
26
Special Topics on Information Retrieval
MAP Average precision is the average of the precision scores at the rank locations of each relevant document. Mean Average Precision is the mean of the Average Precision scores for a group of queries. N is the number of retrieved documents, P(i) is the precision of the first i documents, and rel(i) is a binary function indicating if document at i-position is relevant or not. Special Topics on Information Retrieval
27
Special Topics on Information Retrieval
Illustrative example (from IR course of Northeastern University, College of Computer and Information Science) MAP1 = 0.622 MAP2 = 0.52 Special Topics on Information Retrieval
28
Special Topics on Information Retrieval
Common problems Why not all retrieved documents are relevant?, why it is too difficult to get 100% of precision? Consider the query “jaguar” Why it is complex to retrieve all relevant documents (get 100% of recall)? Consider the query “religion” What to do in order to tackle these problems? Special Topics on Information Retrieval
29
Special Topics on Information Retrieval
Query expansion It is the process of adding terms to a user’s (weighted) query. Its goal is to improve precision and/or recall. Example: User Query: “car” Expanded Query: “car cars automobile automobiles auto” etc… How to do it? Ideas? Special Topics on Information Retrieval
30
Special Topics on Information Retrieval
Main approaches By means of a thesaurus Thesauri may be manually or automatically constructed. By means of (user) relevance feedback Automatic query expansion Local query expansion (blind feedback) Global query expansion (using word associations) Special Topics on Information Retrieval
31
Thesaurus-based query expansion
A thesaurus provides information on synonyms and semantically related words Expansion procedure: For each term t in a query, expand the query with synonyms and related words of t. Generally increases recall. May significantly decrease precision, particularly with ambiguous terms. “interest rate” “interest rate fascinate evaluate” Special Topics on Information Retrieval
32
Special Topics on Information Retrieval
Relevance feedback Basic procedure: The user creates their initial query which returns an initial result set. The user selects a list of documents that are relevant to their search. The system then re-weights and/or expands the query based upon the terms in the documents Significant improvement in recall and precision over early query expansion work Special Topics on Information Retrieval
33
Standard Rochio Method
The idea is to move the query in direction closer to the relevant documents, and farther away from the irrelevant ones. : Tunable weight for initial query : Tunable weight for relevant documents : Tunable weight for irrelevant documents Special Topics on Information Retrieval
34
Pseudo relevance feedback
Users do not like to give manual feedback to the system Use relevance feedback methods without explicit user input. Just assume the top m retrieved documents are relevant, and use them to reformulate the query. Relies largely on the systems ability to initially retrieve relevant documents. Special Topics on Information Retrieval
35
Automatic global analysis
Determine term similarity through a pre-computed statistical analysis of the complete corpus. Compute association matrices which quantify term correlations in terms of how frequently they co-occur. Expand queries with statistically most similar terms. The same information for all queries. It is an offline process Special Topics on Information Retrieval
36
Clustering in information retrieval
Cluster hypothesis: Documents in the same cluster behave similarly with respect to relevance to information needs. If there is a document from a cluster that is relevant to a search request, then it is likely that other documents from the same cluster are also relevant. Two main uses: Collection clustering Higher efficiency: faster search Tends to improves recall Search results clustering More effective information presentation to user Special Topics on Information Retrieval
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.