Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Multimedia Database Systems
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
The Vector Space Model …and applications in Information Retrieval.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Natural Language Processing Topics in Information Retrieval August, 2002.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Search Engines Session 5 INST 301 Introduction to Information Science.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Basic Information Retrieval
Representation of documents and queries
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

Menu Information Retrieval (IR) basics An example Evaluation: Precision and Recall

Information Retrieval (IR) The field of information retrieval deals with the representation, storage, organization of, access to information items. Search Engines: Search a large collection of documents to find the ones that satisfy an information need, e,g, find relevant documents Indexing Document representation Comparison with query Evaluation/feedback

IR basics Indexing Manual indexing: using controlled vocabularies, e,g, libraries, early version of Yahoo Automatic indexing: indexing program assigns keywords, phrases or other features, e.g. words from text of document Popular Retrieval Models Boolean: exact match Vector Space: best match Citation analysis models: best match Probabilistic models: best match COMP 307 1:4

Vector space Model Any text object can be represented by a term vector. Similarity is determined by distance in a vector space. Example Doc1: 0.3, 0.1, 0.4 Doc2: 0.8, 0.5, 0.6 Query: 0.0, 0.2,0.0 Vector Space Similarity: Cosine of the angle between the two vectors COMP 307 1:5

Text representation Term weights Term weights reflect the estimated importance of each term The more often a word occurs in a document, the better that term is in describing what the document is about But terms that appear in many documents in the collection are not very useful for distinguishing a relevant document from a non-relevant one COMP 307 1:6

Term weights: TF.IDF Term weight X i = TF * IDF TF: Term Frequency IDF: Inverse Document Frequency

An example Document1: cat eat mouse, mouse eat cocolate Document 2: cat eat mouse Document 3: mouse eat chocolate mouse Indexing terms: Vector representation for each document Query: cat Vector representation of query Which document is more relevant?

example

Evaluation Test collection TREC Parameters Recall: Percentage of all relevant documents that are found by a search Precision: Percentage of retrieved documents that are relevant

Discussion How to compare two IR systems Which is more important in Web search: precision or recall? What leads to Google’s success for IR