Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
Evaluating the Performance of IR Sytems
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Web- and Multimedia-based Information Systems Lecture 2.
Search Engines By: Faruq Hasan.
Vector Space Models.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Automated Information Retrieval
Text Based Information Retrieval
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Multimedia Information Retrieval
Representation of documents and queries
Introduction into Knowledge and information
From frequency to meaning: vector space models of semantics
Chapter 5: Information Retrieval and Web Search
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Computer comunication B Information retrieval

Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant information in big numbers of documents  These documents have to be elaborated by computers Many times there are too many hits: Dutch pages for the entry “insurance”

Information retrieval: introduction 1 Sometimes there are ambiguities that have to be solved: for example the acronym LSA can stand for:  Linguistic Society of America, and what else? Let’s google it Information retrieval (IR) searches for relevant documents for a specific topic in a large number of documents

Information retrieval: introduction 2 Search engines are a sort of IR-systems There are two characteristics that differentiate IR from simply searching in databases  Vagueness: the user cannot express and formalize in a refined way her/his information requirements  Uncertainty: the system does not have any knowledge about the content of the documents Difference with Information Extraction (IE): extraction of relevant information for a specific topic in a large number of documents The authors of the documents and their users are very often separate groups

Information retrieval: introduction 3 The search does not go directly through documents but the search looks for index-terms (or descriptors)  What captures the essence of the topic of a document It is a sort of keyword that is used in the search) Steps for the preparation: building the search index  Determine relevant terms and their occurrence in the document  Terms are nor only a group of signs between spaces (otherwise string search would be enough)  Save this in an index  Both branches are quite developed

Information retrieval: introduction 3 Search instruction are translated as index-terms They are evaluated on the basis of the index (not the documents) A index is useful to optimize the search, Therefore what makes the answer efficient

Information retrieval: introduction 4 A index is statistical. It does not change automatic when documents are added or are taken away (or disappear). Results of a search are arranged according to their relevance The search procedure (formalized in an algorithm) has to evaluate the relevance of a document in a search  The algorithms for the creation of the ranks can be “misused” to push WebPages in front of the search (“search engine optimization” SEO)  The higher the position of the page in the search, the higher the numbers of times that it will get visited. Advantage!  An example: insurance pages

Information retrieval: Vector space models 1 Documents are characterized/evaluated according to their index-terms Each document is identified with a vector The dimensions of the vector are the index-terms. The dimensions of a document can be therefore several. The value regarding an index is the number of times a specific term appears (sometimes the value is 0) A metrics for the similarity between two documents is the co-sinus of the angle between their vectors Searches are interpreted as well in terms of vectors

Vector space models 2 An example of a vector-space model with only 2 index- terms Booleans search methods have a stronger macroscopic perspective (documents are compared and not their index- terms

Vector space models 3 Therefore, the more a term appears in a document the more important it will be for that document But raw weights for terms (term frequency: tf t,d ) suggest that all terms have the same importance (i.e. have the same weight)  Therefore there can be a bias due to the difference in frequency among terms  Therefore it is analysed how many documents in the whole collection of documents D contain a certain term t (df t : document frequency)  With df we can calculate the inverse document-frequency, i.e. idf t with the formula  The weight of a term in a document is calculated therefore with the tf-idf formula

Information retrieval: evaluation 1 The success if IR has several parts  Precision: how many of the found documents are relevant to the search?  Formula: P = ׀ found ∩ relevant ׀ ׀ found ׀

Information retrieval: evaluation 1 Recall  how many of the relevant documents are found to the search?  Formula: R = ׀ found ∩ relevant ׀ ׀ relevant ׀

Information retrieval: evaluation 1 Fall-out  how many of the irrelevant documents are found to the search?  Formula: F = ׀ found ∩ irrelevant ׀ ׀ir relevant ׀ The is an inverse correlation between precision and recall

Example: 20 found documents, 18 relevant, 3 relevant documents are not found, 27 irrelevant are as well not found  Precision: 18/20= 90%  Recall: 18/21= 85.7%  Fall-out: 2/29= 6.9% First attempt for a metrics that gets together precision and recall: accuracy  How many documents are correctly classified (relevant and found/irrelevant and not found)  In our example: (18+27)/50= 90%  But given the large majority of not found irrelevant documents (in true systems above 99%) leads to the fact that accuracy is not a good evaluation Information retrieval: evaluation 2

Second attempt: F-value  When precision and recall are balanced: the mean in between  Formula: F= 2PR/(P+R)  In our example: F= [2(18/20 *18/21)]/(18/20 +18/21)= 0.87% Another metrics looks at the order of the found documents: are the most important documents cited first? Information retrieval: evaluation 3