1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Information Retrieval in Practice
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS 430 / INFO 430 Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Evaluating the Performance of IR Sytems
1 Discussion Class 12 User Interfaces and Visualization.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 Discussion Class 3 Inverse Document Frequency. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1 Information Retrieval LECTURE 1 : Introduction.
Performance Measurement. 2 Testing Environment.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Information Retrieval in Practice
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Multimedia Information Retrieval
Thanks to Bill Arms, Marti Hearst
CS 430: Information Discovery
CS 430: Information Discovery
Discussion Class 7 Lucene.
CS 430: Information Discovery
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Introduction to information retrieval
Discussion Class 9 Google.
Information Retrieval
Discussion Class 7 User Requirements.
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval

2 Course Administration Web site: Notices: See the course web site Sign-up sheet: If you did not sign up at the first class, please sign up now.

3 Course Administration Please send all questions about the course to: The message will be sent to William Arms All Teaching Assistants

4 Course Administration Discussion class, Wednesday, September 1 Upson B17, 7:30 to 8:30 p.m. Prepare for the class as instructed on the course Web site. Participation in the discussion classes is one third of the grade, but tomorrow's class will not be included in the grade calculation.

5 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.

6 Information Retrieval from Collections of Textual Documents Major Categories of Methods 1.Exact matching (Boolean) 2.Ranking by similarity to query (vector space model) 3.Ranking of matches by importance of documents (PageRank) 4.Combination methods Course begins with Boolean, then similarity methods, then importance methods.

7 Text Based Information Retrieval Most matching methods are based on Boolean operators. Most ranking methods are based on the vector space model. Web search methods combine vector space model with ranking based on importance of documents. Many practical systems combine features of several approaches. In the basic form, all approaches treat words as separate tokens with minimal attempt to interpret them linguistically.

8 Documents A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: Free text, also known as unstructured text, which is a continuous sequence of tokens. Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. [Methods of markup, e.g., XML, are covered in CS 431.]

9 Word Frequency Observation: Some words are more common than others. Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: influence the effectiveness and efficiency of data structures used to index documents many retrieval models rely on them

10 Word Frequency Example The following example is taken from: Jamie Callan, Characteristics of Text, 1997 Sample of 19 million words The next slide shows the 50 commonest words in rank order (r), with their frequency (f).

11 f f f the from96900or54958 of he about to million market a year they in its this and be would that was you for company83070 which48273 is an bank said has stock it are trade on have his by but more as will who at say one mr new their with share 63925

12 Rank Frequency Distribution For all the words in a collection of documents, for each word w f is the frequency that w appears r is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.) f r w has rank r and frequency f

13 Rank Frequency Example The next slide shows the words in Callan's data normalized. In this example: r is the rank of word w in the sample. f is the frequency of word w in the sample. n is the total number of word occurrences in the sample.

14 rf*1000/n rf*1000/n rf*1000/n the59from92or101 of 58 he 95 about 102 to 82 million 98 market 101 a 98 year 100 they 103 in 103 its 100 this 105 and 122 be 104 would 107 that 75 was 105 you 106 for 84 company 109 which107 is72 an 105 bank 109 said 78 has 106 stock 110 it 78 are 109 trade 112 on 77 have 112 his 114 by 81 but 114 more 114 as 80 will 117 who 106 at 80 say 113 one 107 mr 86 new 112 their 108 with 91 share 114

15 Zipf's Law If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation: r * f = c Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of word occurrences in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949

16 Methods that Build on Zipf's Law Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems. Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off). Rarely used. Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods.

17 1. Exact Matching (Boolean Model) Query Documents Index database Mechanism for determining whether a document matches a query. Set of hits

18 Evaluation of Matching: Recall and Precision If information retrieval were perfect... Every hit would be relevant to the original query, and every relevant item in the body of information would be found. Precision: percentage (or fraction) of the hits that are relevant, i.e., the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query. Recall: percentage (or fraction) of the relevant items that are found by the query, i.e., the extent to which the query found all the items that satisfy the requirement.

19 Recall and Precision with Exact Matching: Example Collection of 10,000 documents, 50 on a specific topic Ideal search finds these 50 documents and reject all others Actual search identifies 25 documents; 20 are relevant but 5 were on other topics Precision: 20/ 25 = 0.8 (80% of hits were relevant) Recall: 20/50 = 0.4(40% of relevant were found)

20 Measuring Precision and Recall Precision is easy to measure: A knowledgeable person looks at each document that is identified and decides whether it is relevant. In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure: To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria. In the example, all 10,000 documents must be examined.

21 Query A query is a string to match against entries in an index. The string might may contain: search termscomputation operatorscomputation and parallel fieldsauthor = Newton metacharactersb[aeiou]n*g (Metacharacters can be used to build regular expressions, which will be covered later in the course.)

22 Boolean Queries Boolean query: two or more search terms, related by logical operators, e.g., andornot Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor

23 Boolean Diagram A B A and B A or B not (A or B)

24 Adjacent and Near Operators abacus adj actor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacus near 4 actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

25 Evaluation of Boolean Operators Precedence of operators must be defined: adj, nearhigh and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

26 Inverted File Inverted file: A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?" In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.

27 Inverted File -- Basic Concept Word Document abacus actor aspen5 atoll11 34 Stop words are removed before building the index.

28 Inverted List -- Concept Inverted List: All the entries in an inverted file that apply to a specific word, e.g. abacus Posting: Entry in an inverted list, e.g., there are three postings for "abacus".

29 Evaluating a Boolean Query To evaluate the and operator, merge the two inverted lists with a logical AND operation. Examples: abacus and actor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor".

30 Enhancements to Inverted Files -- Concept Location: The inverted file can hold information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. Uses term weighting query processing optimization

31 Inverted File -- Concept (Enhanced) WordPostings DocumentLocation abacus actor aspen atoll

32 Evaluating an Adjacency Operation Examples: abacus adj actor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent