Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval and Web Design

Similar presentations


Presentation on theme: "Information Retrieval and Web Design"— Presentation transcript:

1 Information Retrieval and Web Design
Lecture (9) Prepared by Dr. Dunia Hamid Hameed

2 Text and Web Page Pre-Processing
For traditional text documents (no HTML tags), the tasks are stopword removal, stemming, and handling of digits, hyphens, punctuations, and cases of letters. For Web pages, additional tasks such as HTML tag removal and identification of main content blocks also require careful considerations.

3 Stopword Removal Stopwords are frequently occurring and insignificant words in a language that help construct sentences but do not represent any content of the documents. Articles, prepositions and conjunctions and some pronouns are natural candidates.

4 Common stopwords in English include:
a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, to, was, what, when, where, who, will, with Such words should be removed before documents are indexed and stored. Stopwords in the query are also removed before retrieval is performed.

5 Stemming Stemming refers to the process of reducing words to their stems or roots. A stem is the portion of a word that is left after removing its prefixes and suffixes. Stemming in English usually means suffix removal, or stripping.

6 Advantages and Disadvantages of Stemming
Advantages: Stemming increases the recall and reduces the size of the indexing structure. Disadvantages: it can hurt precision because many irrelevant documents may be considered relevant.

7 Digits Pre-processing
Digits: Numbers and terms that contain digits are removed in traditional IR systems except some specific types, e.g., dates, times, and other prespecified types expressed with regular expressions. However, in search engines, they are usually indexed.

8 Hyphens Pre-processing
Breaking hyphens are usually applied to deal with inconsistency of usage. For example, some people use “state-of-the-art”, but others use “state of the art”.

9 Note that there are two types of removal, i. e
Note that there are two types of removal, i.e., (1) each hyphen is replaced with a space. (2) each hyphen is simply removed without leaving a space.

10 Punctuation Marks Pre-processing
Punctuation can be dealt with similarly as hyphens.

11 Case of Letters All the letters are usually converted to either the upper or lower case.


Download ppt "Information Retrieval and Web Design"

Similar presentations


Ads by Google