1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Multimedia Database Systems
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Information Retrieval in Practice
1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
CS 430 / INFO 430 Information Retrieval
Evaluating the Performance of IR Sytems
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 1 Overview of Information Discovery.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
CS 430 / INFO 430 Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Overview of Search Engines
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
CS 430: Information Discovery
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Multimedia Information Retrieval
Thanks to Bill Arms, Marti Hearst
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Introduction to Information Retrieval
Lecture 8 Information Retrieval Introduction
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval

2 Course Administration Please send all questions about the course to: The message will be sent to (Bill Arms) (Manpreet Singh ) (Sid Anand) (Martin Guerrero)

3 Course Administration Programming in Perl Assignments 2, 3 and 4 require programs to be written in Perl. An introduction to programming in Perl will be given at 7:30 p.m. on Wednesdays September 19 and October 3. These classes are optional. There will not be regular discussion classes on these dates. Materials about Perl and further information about these classes will be posted on the course web site.

4 Course Administration Discussion class, Wednesday, September 4 Read and be prepared to discuss: Harman, D., Fox, E., Baeza-Yates, R.A., Inverted files. (Frakes and Baeza-Yates, Chapter 3) Phillips Hall 101, 7:30 to 8:30 p.m.

5 Classical Information Retrieval media type textimage, video, audio, etc. searchingbrowsing linking statistical user-in-loop catalogs, indexes (metadata) CS 502 natural language processing CS 474

6 Recall and Precision If information retrieval were perfect... Every hit would be relevant to the original query, and every relevant item in the body of information would be found. Precision: percentage of the hits that are relevant, the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query. Recall: percentage of the relevant items that are found by the query, the extent to which the query found all the items that satisfy the requirement.

7 Recall and Precision: Example Collection of 10,000 documents, 50 on a specific topic Ideal search finds these 50 documents and reject others Actual search identifies 25 documents; 20 are relevant but 5 were on other topics Precision: 20/ 25 = 0.8 Recall: 20/50 = 0.4

8 Measuring Precision and Recall Precision is easy to measure: A knowledgeable person looks at each document that is identified and decides whether it is relevant. In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure: To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria. In the example, all 10,000 documents must be examined.

9 Relevance and Ranking Precision and recall assume that a document is either relevant to a query or not relevant. Often a user will consider a document to be partially relevant. Ranking methods: measure the degree of similarity between a query and a document. RequestsDocuments Similar Similar: How similar is document to a request?

10 Documents A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: Free text, also known as unstructured text, which is a continuous sequence of tokens. Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. [Methods of markup, e.g., XML, are covered in CS 502.]

11 Word Frequency Observation: Some words are more common than others. Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: influence the effectiveness and efficiency of data structures used to index documents many retrieval models rely on them The following example is taken from: Jamie Callan, Characteristics of Text,

12 Rank Frequency Distribution For all the words in a collection of documents, for each word w f(w) is the frequency that w appears r(w) is rank of w in order of frequency, e.g., the most commonly occurring word has rank 1 f r w has rank r and frequency f

13 f f f the from96900or54958 of he about to million market a year they in its this and be would that was you for company83070 which48273 is an bank said has stock it are trade on have his by but more as will who at say one mr new their with share 63925

14 Zipf's Law If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of distinct words in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949

*rf/n 1000*rf/n 1000*rf/n the59from92or101 of 58 he 95 about 102 to 82 million 98 market 101 a 98 year 100 they 103 in 103 its 100 this 105 and 122 be 104 would 107 that 75 was 105 you 106 for 84 company 109 which107 is72 an 105 bank 109 said 78 has 106 stock 110 it 78 are 109 trade 112 on 77 have 112 his 114 by 81 but 114 more 114 as 80 will 117 who 106 at 80 say 113 one 107 mr 86 new 112 their 108 with 91 share 114

16 Luhn's Proposal "It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements." Luhn, H.P., The automatic creation of literature abstracts, IBM Journal of Research and Development, 2, (1958)

17 Methods that Build on Zipf's Law Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Stop lists: Ignore the most frequent words (upper cut-off) Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off)

18 Cut-off Levels for Significance Words f r Upper cut-off Lower cut-off Resolving power of significant words Significant words from: Van Rijsbergen, Ch. 2

19 Approaches to Weighting Boolean information retrieval: Weight of term i in document j: w(i, j) = 1 if term i occurs in document j w(i, j) = 0 otherwise Vector space methods Weight of term i in document j: 0 < w(i, j) <= 1 if term i occurs in document j w(i, j) = 0 otherwise

20 Functional View of Information Retrieval Requests Documents Index database Similar: mechanism for determining the similarity of the request representation to the information item representation.

21 Major Subsystems Indexing subsystem: Receives incoming documents, converts them to the form required for the index and adds them to the index database. Search subsystem: Receives incoming requests, converts them to the form required for searching the index and searches the database for matching documents. The index database is the central hub of the system.

22 Example: Indexing Subsystem Documents break into words stoplist stemming* term weighting* Index database text non-stoplist words words stemmed words terms with weights *Indicates optional operation. from Frakes, page 7 assign document IDs documents document numbers and *field numbers

23 Example: Search Subsystem Index database query parse query stemming* stemmed words stoplist non-stoplist words query terms Boolean operations ranking* relevance judgments* relevant document set ranked document set retrieved document set *Indicates optional operation.