1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Chapter 5: Introduction to Information Retrieval
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
CS 430 / INFO 430 Information Retrieval
Evaluating the Performance of IR Sytems
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
1 CS 430: Information Discovery Lecture 1 Overview of Information Discovery.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
CS 430: Information Discovery
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
1 Information Retrieval LECTURE 1 : Introduction.
Performance Measurement. 2 Testing Environment.
Data Mining: Text Mining
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Automated Information Retrieval
Why indexing? For efficient searching of a document
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Multimedia Information Retrieval
Overview of Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Introduction to Information Retrieval
Information Retrieval and Web Design
Information Retrieval
CS 430: Information Discovery
Presentation transcript:

1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I

2 Administration Change to schedule: Office hours -- Wednesday Lecture on March 14 Mid-term examination: Wednesday, March 8, 7:30 to 8:30 Laptops?

3 Information Discovery People have many reasons to look for information: Known item Facts Introduction or overview Related information Comprehensive search (classical problem of information retireval)

4 Definitions Searching: seeking for specific information within a body of information. The result of a search is a set of hits. Information retrieval: searching within a body of text. Browsing: unstructured exploration of a body of information

5 Recall and Precision If information retrieval were perfect... Every hit would be relevant to the original query, and every relevant item in the body of information would be found. Precision: percentage of the hits that are relevant, the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query. Recall: percentage of the relevant items that are found by the query, the extent to which the query found all the items that satisfy the requirement.

6 Recall and Precision: Example Collection of 10,000 documents, 50 on a specific topic Ideal search finds these 50 documents and reject others Actual search identifies 25 documents; 20 are relevant but 5 were on other topics Precision: 20/ 25 = 0.8 Recall: 20/50 = 0.4

7 Measuring Precision and Recall Precision is easy to measure: A knowledgeable person looks at each document that is identified and decides whether it is relevant. In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure: To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria. In the example, all 10,000 documents must be examined.

8 The Basics of Information Retrieval Query: A string of text, describing the information that the user is seeking. Each word of the query is called a search term. A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols. Full text searching: Methods that compare the query with every word in the text, without distinguishing the function of the various words. Fielded searching: Methods that search on specific bibliographic or structural fields, such as author or heading.

9 Inverted File Inverted file: a list of the words in a set of documents and their locations within those documents. Word DocumentLocation abacus actor aspen 5 43 atoll

10 Inverted List and Stop List Inverted list: All the entries in an inverted file that apply to a specific word, e.g. abacus Stop list: Set of common words that are ignored for searching, e.g., "a", "the", "and", "be", "of",... How do you decide which words to include?

11 Boolean Search (Keyword) Boolean query: two or more search terms, related by logical operators, e.g., andornot adjacentnear Example: "abacus and actor" Process: inverted lists for "abacus" documents 3 and 19 inverted list for "actor" documents 2, 19, and 29 intersection of these two list document 19

12 Boolean Diagram A B A and B A or B not (A or B)

13 Performance Create Inverted Index File size is large (perhaps 50% of document collection) Building and updating sort by word of entire collection -- O(n log n) Query Processing Find a specific inverted list requires a search -- log 2 n (fast if in memory, slow if disk search) Read inverted lists requires disk I/O (slow) Merge inverted lists is within memory (fast), but... very large lists are computationally intensive not operator is potentially expensive

14 Special Techniques Adjacency digital adj libraries Searches for the phrase "digital libraries" Fast operation on inverted lists (near is an extension of adj) Truncation comp? Searches for words that begin "comp..." Finds "computer", "computers", "computing",... but also "compete", "company", "complete", etc.

15 Weaknesses of Boolean Searching Only finds exact matches: library does not match libraries J. Smith does not match John Smith oak does not match tree Long queries usually get no hits (because of and operators) An abstract does not match the documents that it applies to! Specialized search techniques: Require trained specialists Untrained users fail to find what they want Specialists have difficulty adapting to modern search systems

16 Vector Space Methods Problem: Given two section of text, how similar are they? (One text may be a query.) Encourages long queries, which are rich in information. An abstract should be very similar to its source document. Accepts probabilistic aspects of writing and searching. Different words will be used if an author writes the same document twice. [Gerald Salton, Cornell department of Computer Science.]

17 Vector Space Methods: Concept n-dimensional space, where n is the the total number of different words in the set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the number of times that the corresponding word appears in the document. Similarity between two documents is the angle between their vectors.

18 Example D1 -> ant ant bee D2 -> bee hog ant dog D3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog length D1 2 1  5 D  4 D  5 d(D1, D2) = ( )/(  5  4)

19 Example (continued) D1D2D3 D D D Similarity of documents in example: Similarity measures the number of occurrences of words, but not other characteristics of the documents.