CS 430: Information Discovery

CS 430: Information Discovery
Lecture 2 Introduction to Text Based Information Retrieval

Course Administration
• Campus store has run out of text books. More are on order. Reading for next week will be changed to not require the text book. • New Teaching Assistant, Yukiko Yamashita • Please send all questions about the course to:

Classical Information Retrieval
media type text image, video, audio, etc. linking searching browsing CS 502 natural language processing catalogs, indexes (metadata) user-in-loop statistical CS 474

Documents A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. [Methods of markup, e.g., XML, are covered in CS 502.]

Word Frequency Observation: Some words are more common than others.
Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: • influence the effectiveness and efficiency of data structures used to index documents • many retrieval models rely on them The following example is taken from: Jamie Callan, Characteristics of Text,

Rank Frequency Distribution
For all the words in a collection of documents, for each word w f(w) is the frequency that w appears r(w) is rank of w in order of frequency, e.g., the most commonly occurring word has rank 1 f w has rank r and frequency f r

f f f the from or 54958 of he about to million market a year they in its this and be would that was you for company which 48273 is an bank said has stock it are trade on have his by but more as will who at say one mr new their with share

Zipf's Law If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of words in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Adison-Wesley, 1949

1000*rf/n *rf/n *rf/n the 59 from 92 or 101 of he about 102 to million market 101 a year they 103 in its this 105 and be would 107 that was you 106 for company which 107 is 72 an bank 109 said has stock 110 it are trade 112 on have his 114 by but more 114 as will who 106 at say one 107 mr new their 108 with share 114

Methods that Build on Zipf's Law
Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Stop lists: Ignore the most frequent words (upper cut-off) Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off)

Luhn's Proposal "It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements." Luhn, H.P., The automatic creation of literature abstracts, IBM Journal of Research and Development, 2, (1958)

Cut-off Levels for Significance Words
Upper cut-off Lower cut-off Resolving power of significant words Significant words r from: Van Rijsbergen, Ch. 2

Information Retrieval Overview
Similar Requests Documents Similar: mechanism for determining which information items meet the requirements of a given request.

Functional View of Information Retrieval
Similar: mechanism for determining the similarity of the request representation to the information item representation. Documents Requests Index database

Major Subsystems Indexing subsystem: Receives incoming documents, converts them to the form required for the index and adds them to the index database. Search subsystem: Receives incoming requests, converts them to the form required for searching the index and searches the database for matching documents. The index database is the central hub of the system.

Example: Indexing Subsystem for Boolean Searching
documents Documents assign document IDs text document numbers and *field numbers break into words words stoplist non-stoplist words stemming* *Indicates optional operation. stemmed words term weighting* terms with weights Index database from Frakes, page 7

Example: Search Subsystem for Boolean Searching
query parse query query terms ranked document set stoplist non-stoplist words ranking* stemming* stemmed words relevance judgments* Boolean operations retrieved document set Index database *Indicates optional operation. relevant document set

CS 430: Information Discovery

Similar presentations

Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 430: Information Discovery

Similar presentations

Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

Similar presentations

About project

Feedback