Download presentation
Presentation is loading. Please wait.
1
Computer comunication B Information retrieval
2
Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant information in big numbers of documents These documents have to be elaborated by computers Many times there are too many hits: 1.530.000 Dutch pages for the entry “insurance”
3
Information retrieval: introduction 1 Sometimes there are ambiguities that have to be solved: for example the acronym LSA can stand for: Linguistic Society of America, and what else? Let’s google it Information retrieval (IR) searches for relevant documents for a specific topic in a large number of documents
4
Information retrieval: introduction 2 Search engines are a sort of IR-systems There are two characteristics that differentiate IR from simply searching in databases Vagueness: the user cannot express and formalize in a refined way her/his information requirements Uncertainty: the system does not have any knowledge about the content of the documents Difference with Information Extraction (IE): extraction of relevant information for a specific topic in a large number of documents The authors of the documents and their users are very often separate groups
5
Information retrieval: introduction 3 The search does not go directly through documents but the search looks for index-terms (or descriptors) What captures the essence of the topic of a document It is a sort of keyword that is used in the search) Steps for the preparation: building the search index Determine relevant terms and their occurrence in the document Terms are nor only a group of signs between spaces (otherwise string search would be enough) Save this in an index Both branches are quite developed
6
Information retrieval: introduction 3 Search instruction are translated as index-terms They are evaluated on the basis of the index (not the documents) A index is useful to optimize the search, Therefore what makes the answer efficient
7
Information retrieval: introduction 4 A index is statistical. It does not change automatic when documents are added or are taken away (or disappear). Results of a search are arranged according to their relevance The search procedure (formalized in an algorithm) has to evaluate the relevance of a document in a search The algorithms for the creation of the ranks can be “misused” to push WebPages in front of the search (“search engine optimization” SEO) The higher the position of the page in the search, the higher the numbers of times that it will get visited. Advantage! An example: insurance pages
8
Information retrieval: Vector space models 1 Documents are characterized/evaluated according to their index-terms Each document is identified with a vector The dimensions of the vector are the index-terms. The dimensions of a document can be therefore several. The value regarding an index is the number of times a specific term appears (sometimes the value is 0) A metrics for the similarity between two documents is the co-sinus of the angle between their vectors Searches are interpreted as well in terms of vectors
9
Vector space models 2 An example of a vector-space model with only 2 index- terms Booleans search methods have a stronger macroscopic perspective (documents are compared and not their index- terms
10
Vector space models 3 Therefore, the more a term appears in a document the more important it will be for that document But raw weights for terms (term frequency: tf t,d ) suggest that all terms have the same importance (i.e. have the same weight) Therefore there can be a bias due to the difference in frequency among terms Therefore it is analysed how many documents in the whole collection of documents D contain a certain term t (df t : document frequency) With df we can calculate the inverse document-frequency, i.e. idf t with the formula The weight of a term in a document is calculated therefore with the tf-idf formula
11
Information retrieval: evaluation 1 The success if IR has several parts Precision: how many of the found documents are relevant to the search? Formula: P = ׀ found ∩ relevant ׀ ----------------------------- ׀ found ׀
12
Information retrieval: evaluation 1 Recall how many of the relevant documents are found to the search? Formula: R = ׀ found ∩ relevant ׀ ----------------------------- ׀ relevant ׀
13
Information retrieval: evaluation 1 Fall-out how many of the irrelevant documents are found to the search? Formula: F = ׀ found ∩ irrelevant ׀ ----------------------------- ׀ir relevant ׀ The is an inverse correlation between precision and recall
14
Example: 20 found documents, 18 relevant, 3 relevant documents are not found, 27 irrelevant are as well not found Precision: 18/20= 90% Recall: 18/21= 85.7% Fall-out: 2/29= 6.9% First attempt for a metrics that gets together precision and recall: accuracy How many documents are correctly classified (relevant and found/irrelevant and not found) In our example: (18+27)/50= 90% But given the large majority of not found irrelevant documents (in true systems above 99%) leads to the fact that accuracy is not a good evaluation Information retrieval: evaluation 2
15
Second attempt: F-value When precision and recall are balanced: the mean in between Formula: F= 2PR/(P+R) In our example: F= [2(18/20 *18/21)]/(18/20 +18/21)= 0.87% Another metrics looks at the order of the found documents: are the most important documents cited first? Information retrieval: evaluation 3
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.