CS 430 / INFO 430 Information Retrieval

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
CS 430 / INFO 430 Information Retrieval
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Information Retrieval in Practice
Search Engines and Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
WMES3103 : INFORMATION RETRIEVAL
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
CS 430 / INFO 430 Information Retrieval
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search Engines and Information Retrieval Chapter 1.
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
CS 430: Information Discovery
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 CS 430: Information Discovery Lecture 11 Cranfield and TREC.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Modern Information Retrieval Chapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto.
Information Retrieval in Practice
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Thanks to Bill Arms, Marti Hearst
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
Chapter 7 Lexical Analysis and Stoplists
Information Retrieval and Web Design
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

CS 430 / INFO 430 Information Retrieval Lecture 5 Text Processing Methods 1

Course Administration Discussion classes In preparing for the discussion classes, you might look at last year's web site: http://www.cs.cornell.edu/Courses/cs430/2003fa/index.html Often the reading was set last year and the questions will be similar.

Books on Information Retrieval Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999. Covers most of the standard topics. Chapters vary in quality. William B. Frakes and Ricardo Baeza-Yates, Information Retrieval Data Structures and Algorithms.  Prentice Hall, 1992. Good coverage of algorithms, but out of date. Several good chapters. G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1983. The classic description of the underlying methods.

SMART System An experimental system for automatic information retrieval • automatic indexing to assign terms to documents and queries • collect related documents into common subject classes • identify documents to be retrieved by calculating similarities between documents and queries • procedures for producing an improved search query based on information obtained from earlier searches Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988

Information Retrieval Family Trees Cyril Cleverdon Cranfield Karen Sparck Jones Cambridge Gerald Salton Cornell Keith Van Rijsbergen Glasgow Donna Harman NIST Michael Lesk Bell Labs, etc. Bruce Croft University of Massachusetts, Amherst

Indexing Subsystem documents Documents assign document IDs text document numbers and *field numbers break into tokens tokens stop list* non-stoplist tokens stemming* *Indicates optional operation. stemmed terms term weighting* terms with weights Index database

Search Subsystem query parse query query tokens ranked document set stop list* non-stoplist tokens ranking* stemming* stemmed terms Boolean operations* retrieved document set *Indicates optional operation. Index database relevant document set

Oxford English Dictionary

Lexical Analysis: Tokens What is a token? Free text indexing A token is a group of characters, extracted from the input string, that has some collective significance, e.g., a complete word. Usually, tokens are strings of letters, digits or other specified characters, separated by punctuation, spaces, etc.

Lexical Analysis: Choices Punctuation: In technical contexts, punctuation may be used as a character within a term, e.g., wordlist.txt. Case: Case of letters is usually not significant. Hyphens: (a) Treat as separators: state-of-art is treated as state of art. (b) Ignore: on-line is treated as online. (c) Retain: Knuth-Morris-Pratt Algorithm is unchanged. Digits: Most numbers do not make good tokens, but some are parts of proper nouns or technical terms: CS430, Opus 22.

Lexical Analysis: Choices The modern tendency, for free text searching, is to map upper and lower case letters together in index terms, but otherwise to minimize the changes made at the lexical analysis stage.

Lexical Analysis Example: Query Analyzer A token is a letter followed by a sequence of letters and digits. Upper case letters are mapped into the lower case equivalents. The following characters have significance as operators: ( ) & |

Lexical Analysis: Transition Diagram letter, digit 1 2 space letter ( 3 ) & 4 | 5 other 6 end-of-string 7

Lexical Analysis: Transition Table State space letter ( ) & | other end-of digit string 0 0 1 2 3 4 5 6 7 6 1 1 1 1 1 1 1 1 1 1 States in red are final states.

Changing the Lexical Analyzer This use of a transition table allows the system administrator to establish differ lexical choices for different collections of documents. Example: To change the lexical analyzer to accept tokens that begin with a digit, change the top right element of the table to 1.

Stop Lists Very common words, such as of, and, the, are rarely of use in information retrieval. A stop list is a list of such words that are removed during lexical analysis. A long stop list saves space in indexes, speeds processing, and eliminates many false hits. However, common words are sometimes significant in information retrieval, which is an argument for a short stop list. (Consider the query, "To be or not to be?")

Suggestions for Including Words in a Stop List • Include the most common words in the English language (perhaps 50 to 250 words). • Do not include words that might be important for retrieval (Among the 200 most frequently occurring words in general literature in English are time, war, home, life, water, and world). • In addition, include words that are very common in context (e.g., computer, information, system in a set of computing documents).

Example: Stop List for Assignment 1 a about an and are as at be but by for from has have he his in is it its more new of on one or said say that the their they this to was who which will with you

Example: the WAIS stop list (first 84 of 363 multi-letter words) about above according across actually adj after afterwards again against all almost alone along already also although always among amongst an another any anyhow anyone anything anywhere are aren't around at be became because become becomes becoming been before beforehand begin beginning behind being below beside besides between beyond billion both but by can can't cannot caption co could couldn't did didn't do does doesn't don't down during each eg eight eighty either else elsewhere end ending enough etc even ever every everyone everything

Stop list policies How many words should be in the stop list? • Long list lowers recall Which words should be in list? • Some common words may have retrieval importance: -- war, home, life, water, world • In certain domains, some words are very common: -- computer, program, source, machine, language There is very little systematic evidence to use in selecting a stop list.

Stop Lists in Practice The modern tendency is: have very short stop lists for broad-ranging or multi-lingual document collections, especially when the users are not trained. have longer stop lists for document collections in well-defined fields, especially when the users are trained professional.

Stemming Morphological variants of a word (morphemes). Similar terms derived from a common stem: engineer, engineered, engineering use, user, users, used, using Stemming in Information Retrieval. Grouping words with a common stem together. For example, a search on reads, also finds read, reading, and readable Stemming consists of removing suffixes and conflating the resulting morphemes. Occasionally, prefixes are also removed.

Categories of Stemmer The following diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. Conflation methods Manual Automatic (stemmers) Affix Successor Table n-gram removal variety lookup Longest Simple match removal

Stemming in Practice Evaluation studies have found that stemming can affect retrieval performance, usually for the better, but the results are mixed. • Effectiveness is dependent on the vocabulary. Fine distinctions may be lost through stemming. • Automatic stemming is as effective as manual conflation. • Performance of various algorithms is similar. Porter's Algorithm is entirely empirical, but has proved to be an effective algorithm for stemming English text with trained users.

Selection of tokens, weights, stop lists and stemming Special purpose collections (e.g., law, medicine, monographs) Best results are obtained by tuning the search engine for the characteristics of the collections and the expected queries. It is valuable to use a training set of queries, with lists of relevant documents, to tune the system for each application. General purpose collections (e.g., web search) The modern practice is to use a basic weighting scheme (e.g., tf.idf), a simple definition of token, a short stop list and no stemming except for plurals, with minimal conflation. Web searching combine similarity ranking with ranking based on document importance.