Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

Similar presentations


Presentation on theme: "Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant."— Presentation transcript:

1 Computer comunication B Information retrieval

2 Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant information in big numbers of documents  These documents have to be elaborated by computers Many times there are too many hits: 1.530.000 Dutch pages for the entry “insurance”

3 Information retrieval: introduction 1 Sometimes there are ambiguities that have to be solved: for example the acronym LSA can stand for:  Linguistic Society of America, and what else? Let’s google it Information retrieval (IR) searches for relevant documents for a specific topic in a large number of documents

4 Information retrieval: introduction 2 Search engines are a sort of IR-systems There are two characteristics that differentiate IR from simply searching in databases  Vagueness: the user cannot express and formalize in a refined way her/his information requirements  Uncertainty: the system does not have any knowledge about the content of the documents Difference with Information Extraction (IE): extraction of relevant information for a specific topic in a large number of documents The authors of the documents and their users are very often separate groups

5 Information retrieval: introduction 3 The search does not go directly through documents but the search looks for index-terms (or descriptors)  What captures the essence of the topic of a document It is a sort of keyword that is used in the search) Steps for the preparation: building the search index  Determine relevant terms and their occurrence in the document  Terms are nor only a group of signs between spaces (otherwise string search would be enough)  Save this in an index  Both branches are quite developed

6 Information retrieval: introduction 3 Search instruction are translated as index-terms They are evaluated on the basis of the index (not the documents) A index is useful to optimize the search, Therefore what makes the answer efficient

7 Information retrieval: introduction 4 A index is statistical. It does not change automatic when documents are added or are taken away (or disappear). Results of a search are arranged according to their relevance The search procedure (formalized in an algorithm) has to evaluate the relevance of a document in a search  The algorithms for the creation of the ranks can be “misused” to push WebPages in front of the search (“search engine optimization” SEO)  The higher the position of the page in the search, the higher the numbers of times that it will get visited. Advantage!  An example: insurance pages

8 Information retrieval: Vector space models 1 Documents are characterized/evaluated according to their index-terms Each document is identified with a vector The dimensions of the vector are the index-terms. The dimensions of a document can be therefore several. The value regarding an index is the number of times a specific term appears (sometimes the value is 0) A metrics for the similarity between two documents is the co-sinus of the angle between their vectors Searches are interpreted as well in terms of vectors

9 Vector space models 2 An example of a vector-space model with only 2 index- terms Booleans search methods have a stronger macroscopic perspective (documents are compared and not their index- terms

10 Vector space models 3 Therefore, the more a term appears in a document the more important it will be for that document But raw weights for terms (term frequency: tf t,d ) suggest that all terms have the same importance (i.e. have the same weight)  Therefore there can be a bias due to the difference in frequency among terms  Therefore it is analysed how many documents in the whole collection of documents D contain a certain term t (df t : document frequency)  With df we can calculate the inverse document-frequency, i.e. idf t with the formula  The weight of a term in a document is calculated therefore with the tf-idf formula

11 Information retrieval: evaluation 1 The success if IR has several parts  Precision: how many of the found documents are relevant to the search?  Formula: P = ׀ found ∩ relevant ׀ ----------------------------- ׀ found ׀

12 Information retrieval: evaluation 1 Recall  how many of the relevant documents are found to the search?  Formula: R = ׀ found ∩ relevant ׀ ----------------------------- ׀ relevant ׀

13 Information retrieval: evaluation 1 Fall-out  how many of the irrelevant documents are found to the search?  Formula: F = ׀ found ∩ irrelevant ׀ ----------------------------- ׀ir relevant ׀ The is an inverse correlation between precision and recall

14 Example: 20 found documents, 18 relevant, 3 relevant documents are not found, 27 irrelevant are as well not found  Precision: 18/20= 90%  Recall: 18/21= 85.7%  Fall-out: 2/29= 6.9% First attempt for a metrics that gets together precision and recall: accuracy  How many documents are correctly classified (relevant and found/irrelevant and not found)  In our example: (18+27)/50= 90%  But given the large majority of not found irrelevant documents (in true systems above 99%) leads to the fact that accuracy is not a good evaluation Information retrieval: evaluation 2

15 Second attempt: F-value  When precision and recall are balanced: the mean in between  Formula: F= 2PR/(P+R)  In our example: F= [2(18/20 *18/21)]/(18/20 +18/21)= 0.87% Another metrics looks at the order of the found documents: are the most important documents cited first? Information retrieval: evaluation 3


Download ppt "Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant."

Similar presentations


Ads by Google