Text Based Information Retrieval

Text Based Information Retrieval

Information Retrieval from Collections of Textual Documents
Major Categories of Methods Exact matching (Boolean) Ranking by similarity to query (vector space model) Ranking of matches by importance of documents (PageRank) Combination methods Course begins with Boolean, then similarity methods, then importance methods.

Text Based Information Retrieval
Most matching methods are based on Boolean operators. Most ranking methods are based on the vector space model. Web search methods combine vector space model with ranking based on importance of documents. Many practical systems combine features of several approaches. In the basic form, all approaches treat words as separate tokens with minimal attempt to interpret them linguistically.

Documents A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup.

Word Frequency Observation: Some words are more common than others.
Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: • influence the effectiveness and efficiency of data structures used to index documents • many retrieval models rely on them

Word Frequency Example The following example is taken from:
Jamie Callan, Characteristics of Text, 1997 Sample of 19 million words The next slide shows the 50 commonest words in rank order (r), with their frequency (f).

f f f the from or 54958 of he about to million market a year they in its this and be would that was you for company which 48273 is an bank said has stock it are trade on have his by but more as will who at say one mr new their with share

Rank Frequency Distribution
For all the words in a collection of documents, for each word w f is the frequency that w appears r is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.) f w has rank r and frequency f r

Rank Frequency Example
The next slide shows the words in Callan's data normalized. In this example: r is the rank of word w in the sample. f is the frequency of word w in the sample. n is the total number of word occurrences in the sample.

rf*1000/n rf*1000/n rf*1000/n the 59 from 92 or 101 of he about 102 to million market 101 a year they 103 in its this 105 and be would 107 that was you 106 for company which 107 is 72 an bank 109 said has stock 110 it are trade 112 on have his 114 by but more 114 as will who 106 at say one 107 mr new their 108 with share 114

Zipf's Law If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation: r * f = c Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of word occurrences in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949

Methods that Build on Zipf's Law
Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems. Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off). Rarely used. Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods.

1. Exact Matching (Boolean Model)
Documents Query Index database Mechanism for determining whether a document matches a query. Set of hits

Evaluation of Matching: Recall and Precision
If information retrieval were perfect ... Every hit would be relevant to the original query, and every relevant item in the body of information would be found. Precision: percentage (or fraction) of the hits that are relevant, i.e., the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query. Recall: percentage (or fraction) of the relevant items that are found by the query, i.e., the extent to which the query found all the items that satisfy the requirement.

Recall and Precision with Exact Matching: Example
Collection of 10,000 documents, 50 on a specific topic Ideal search finds these 50 documents and reject all others Actual search identifies 25 documents; 20 are relevant but 5 were on other topics Precision: 20/ 25 = (80% of hits were relevant) Recall: 20/50 = 0.4 (40% of relevant were found)

Measuring Precision and Recall
Precision is easy to measure: A knowledgeable person looks at each document that is identified and decides whether it is relevant. In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure: To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria. In the example, all 10,000 documents must be examined.

Query A query is a string to match against entries in an index. The string might may contain: search terms computation operators computation and parallel fields author = Newton metacharacters b[aeiou]n*g (Metacharacters can be used to build regular expressions, which will be covered later in the course.)

Boolean Queries Boolean query: two or more search terms, related by
logical operators, e.g., and or not Examples: abacus and actor abacus or actor (abacus and actor) or (abacus and atoll) not actor

Boolean Diagram not (A or B) A and B A B A or B

Adjacent and Near Operators
abacus adj actor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacus near 4 actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

Evaluation of Boolean Operators
Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

Inverted File Inverted file:
A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?" In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.

Inverted File -- Basic Concept
Word Document abacus 19 22 actor 2 29 aspen 5 atoll 11 34 Stop words are removed before building the index.

Inverted List -- Concept
Inverted List: All the entries in an inverted file that apply to a specific word, e.g. abacus 19 22 Posting: Entry in an inverted list, e.g., there are three postings for "abacus".

Evaluating a Boolean Query
Examples: abacus and actor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor". 3 19 22 To evaluate the and operator, merge the two inverted lists with a logical AND operation. 2 19 29

Enhancements to Inverted Files -- Concept
Location: The inverted file can hold information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. term weighting query processing optimization

Inverted File -- Concept (Enhanced)
Word Postings Document Location abacus 22 56 actor aspen atoll 11 70

Evaluating an Adjacency Operation
Examples: abacus adj actor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.

Text Based Information Retrieval

Similar presentations

Presentation on theme: "Text Based Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Based Information Retrieval

Similar presentations

Presentation on theme: "Text Based Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback