1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Chapter 4: Trees Part II - AVL Tree
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
CS 430 / INFO 430 Information Retrieval
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
WMES3103 : INFORMATION RETRIEVAL
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
CS 430 / INFO 430 Information Retrieval
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
CS 430: Information Discovery
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Data Structure. Two segments of data structure –Storage –Retrieval.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Evidence from Content INST 734 Module 2 Doug Oard.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Automated Information Retrieval
Why indexing? For efficient searching of a document
COMP261 Lecture 23 B Trees.
Text Based Information Retrieval
CS 430: Information Discovery
Lecture 22 Binary Search Trees Chapter 10 of textbook
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Information Retrieval and Web Design
Discussion Class 3 Stemming Algorithms.
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5

2 Course Administration

3 CS 430 / INFO 430 Information Retrieval Completion of Lecture 4

4 Word List On disk If a word list is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that a word list has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

5 File Structures for Inverted Files: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for lexicographic processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

6 File Structures for Inverted Files: Binary Tree elk beehog cat dog fox ant gnu Input: elk, hog, bee, fox, cat, gnu, ant, dog

7 File Structures for Inverted Files: Binary Tree Advantages Can be searched quickly Convenient for batch updating Easy to add an extra term Economical use of storage Disadvantages Less good for lexicographic processing, e.g., comp* Tree tends to become unbalanced If the index is held on disk, important to optimize the number of disk accesses

8 File Structures for Inverted Files: Binary Tree Calculation of maximum depth of tree. Illustrates importance of balanced trees. Worst case: depth = n O(n) Ideal case: depth = log(n + 1)/log 2 O(log n)

9 File Structures for Inverted Files: Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in- order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Can be used for lexicographic processing. A good data structure when held in memory Knuth vol 1, 2.3.1, page 325.

10 File Structures for Inverted Files: Right Threaded Binary Tree dog bee ant cat gnu elk fox hog NULL

11 File Structures for Inverted Files: B-trees B-tree of order m: A balanced, multiway search tree: Each node stores many keys Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. If k i is the i th key in a given internal node -> all keys in the (i-1) th child are smaller than k i -> all keys in the i th child are bigger than k i All leaves are at the same depth

12 File Structures for Inverted Files: B-trees B-tree example (order 2) Every arrow points to a node containing between 2 and 4 keys. A node with k keys has k + 1 pointers

13 File Structures for Inverted Files: B + -tree A B-tree is used as an index Data is stored in the leaves of the tree, known as buckets D 9 D D 54 D D Example: B + -tree of order 2, bucket size 4 (Implementation of B + -trees is covered in CS 432.)

14 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5

15 SMART System An experimental system for automatic information retrieval automatic indexing to assign terms to documents and queries collect related documents into common subject classes identify documents to be retrieved by calculating similarities between documents and queries procedures for producing an improved search query based on information obtained from earlier searches Gerald Salton and colleagues Harvard Cornell

16 Indexing Subsystem Documents break into tokens stop list* stemming* term weighting* Index database text non-stoplist tokens tokens stemmed terms terms with weights *Indicates optional operation. assign document IDs documents document numbers and *field numbers

17 Search Subsystem Index database query parse query stemming* stemmed terms stop list* non-stoplist tokens query tokens Boolean operations* ranking* relevant document set ranked document set retrieved document set *Indicates optional operation.

18 Decisions in Building the Word List: What is a Term? Underlying character set, e.g., printable ASCII, Unicode, UTF8. Is there a controlled vocabulary? If so, what words are included? List of stopwords. Rules to decide the beginning and end of words, e.g., spaces or punctuation. Character sequences not to be indexed, e.g., sequences of numbers.

19 Lexical Analysis: Term What is a term? Free text indexing A term is a group of characters, extracted from the input string, that has some collective significance, e.g., a complete word. Usually, terms are strings of letters, digits or other specified characters, separated by punctuation, spaces, etc.

20 Oxford English Dictionary

21 Lexical Analysis: Choices Punctuation: In technical contexts, punctuation may be used as a character within a term, e.g., wordlist.txt. Case: Case of letters is usually not significant. Hyphens: (a) Treat as separators: state-of-art is treated as state of art. (b) Ignore: on-line is treated as online. (c) Retain: Knuth-Morris-Pratt Algorithm is unchanged. Digits: Most numbers do not make good terms, but some are parts of proper nouns or technical terms: CS430, Opus 22.

22 Lexical Analysis: Choices The modern tendency, for free text searching, is to map upper and lower case letters together in index terms, but otherwise to minimize the changes made at the lexical analysis stage.

23 Lexical Analysis Example: Query Analyzer A term is a letter followed by a sequence of letters and digits. Upper case letters are mapped into the lower case equivalents. The following characters have significance as operators: ( ) & |

24 Lexical Analysis: Transition Diagram letter ( ) & | other end-of-string space letter, digit

25 Lexical Analysis: Transition Table Statespaceletter()&| other end-of digit string States in red are final states.

26 This use of a transition table allows the system administrator to establish differ lexical choices for different collections of documents. Example: To change the lexical analyzer to accept tokens that begin with a digit, change the top right element of the table to 1. Changing the Lexical Analyzer

27 Very common words, such as of, and, the, are rarely of use in information retrieval. A stop list is a list of such words that are removed during lexical analysis. A long stop list saves space in indexes, speeds processing, and eliminates many false hits. However, common words are sometimes significant in information retrieval, which is an argument for a short stop list. (Consider the query, "To be or not to be?") Stop Lists

28 Suggestions for Including Words in a Stop List Include the most common words in the English language (perhaps 50 to 250 words). Do not include words that might be important for retrieval (Among the 200 most frequently occurring words in general literature in English are time, war, home, life, water, and world). In addition, include words that are very common in context (e.g., computer, information, system in a set of computing documents).

29 Example: Stop List for Assignment 1 aaboutanand areasatbe butbyforfrom hashavehehis inisitits morenewofon oneorsaidsay thatthetheirthey thistowaswho whichwillwithyou

30 Example: the WAIS stop list (first 84 of 363 multi-letter words) aboutaboveaccordingacrossactually adj after afterwardsagain against all almost alone along alreadyalsoalthough always among amongst an another any anyhow anyone anything anywhere are aren't around at be became because become becomes becoming been before beforehand begin beginning behind being below beside besides between beyond billion both butby can can't cannot caption co could couldn't did didn't do does doesn't don't down during each eg eight eighty either else elsewhere end ending enough etc even ever every everyone everything

31 Stop list policies How many words should be in the stop list? Long list lowers recall Which words should be in list? Some common words may have retrieval importance: -- home, life, water, war, world In certain domains, some words are very common: -- computer, program, source, machine, language There is very little systematic evidence to use in selecting a stop list.

32 Stop Lists in Practice The modern tendency is: (a)have very short stop lists for broad-ranging or multi-lingual document collections, especially when the users are not trained. (b)have longer stop lists for document collections in well-defined fields, especially when the users are trained professional.

33 Stemming Morphological variants of a word (morphemes). Similar terms derived from a common stem: engineer, engineered, engineering use, user, users, used, using Stemming in Information Retrieval. Grouping words with a common stem together. For example, a search on reads, also finds read, reading, and readable Stemming consists of removing suffixes and conflating the resulting morphemes. Occasionally, prefixes are also removed.

34 Categories of Stemmer The following diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. Conflation methods Manual Automatic (stemmers) Affix Successor Table n-gram removal variety lookup Longest Simple match removal

35 Porter Stemmer A multi-step, longest-match stemmer. M. F. Porter, An algorithm for suffix stripping. (Originally published in Program, 14 no. 3, pp , July 1980.) Notation vvowel cconstant (vc) m (vowel followed by a constant) repeated m times Any word can be written: [c](vc) m [v] m is called the measure of the word

36 Porter's Stemmer Porter Stemming Algorithm Complex suffixes Complex suffixes are removed bit by bit in the different steps. Thus: GENERALIZATIONS becomes GENERALIZATION (Step 1) becomes GENERALIZE (Step 2) becomes GENERAL (Step 3) becomes GENER (Step 4).

37 Porter Stemmer: Step 1a Suffix Replacement Examples sses ss caresses -> caress ies i ponies -> poni ties -> ti ss ss caress -> caress s cats -> cat

38 Porter Stemmer: Step 1b ConditionsSuffixReplacementExamples (m > 0)eedeefeed -> feed agreed -> agree (*v*)ednullplastered -> plaster bled -> bled (*v*)ingnullmotoring -> motor sing -> sing *v* - the stem contains a vowel

39 Porter Stemmer: Step 5a Step 5a is defined as follows. (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

40 Stemming in Practice Evaluation studies have found that stemming can affect retrieval performance, usually for the better, but the results are mixed. Effectiveness is dependent on the vocabulary. Fine distinctions may be lost through stemming. Automatic stemming is as effective as manual conflation. Performance of various algorithms is similar. Porter's Algorithm is entirely empirical, but has proved to be an effective algorithm for stemming English text with trained users.

41 Selection of tokens, weights, stop lists and stemming Special purpose collections (e.g., law, medicine, monographs) Best results are obtained by tuning the search engine for the characteristics of the collections and the expected queries. It is valuable to use a training set of queries, with lists of relevant documents, to tune the system for each application. General purpose collections (e.g., news articles) The modern practice is to use a basic weighting scheme (e.g., tf.idf), a simple definition of token, a short stop list and little stemming except for plurals, with minimal conflation. Web searching combine similarity ranking with ranking based on document importance.