Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009.

Similar presentations


Presentation on theme: "1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009."— Presentation transcript:

1 1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009

2 2 Chapter 7 Text Operations Part2

3 3 Elimination of Stopwords Words with high frequency are not good discriminators. (in fact a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval ). They are frequently referred to as stopwords filtered out. Examples:  Articles: a, an, the,…  Prepositions: on, in,over,…  Conjunctions: and,or.

4 4 Elimination of Stopwords  (derived from Brown corpus): 425 words: a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, became,...

5 5 Elimination of Stopwords Elimination of stopwords reduces the size of the indexing structure (a bout 40% compression in the size of the indexing structure ). Unfortunately, sometimes elimination of stopwords could eliminate words that have a profound impact on the retrieved documents. Ex:’to be or not to be’ be (is only left). Solution full text index.

6 6 Stemming A stem is the portion of the word which is left after the removal of the affixes ( prefixes and suffixes). They are thought to be useful for improving retrieval performance ( reduce the variants of the same root to a common concept ) connect  connected, connecting, connection, connections.

7 7 Stemming Frakes distinguish 4 types of stemming strategies:  Affix removal  Table lookup: simple, but needs data  Successor variety.  N-gram.

8 8 Stemming Affix removal: intuitive, simple and can be implemented efficiently Table lookup: looking for the stem of a word in a table( simple,but needs data for the whole language and considerable storage space). Successor variety: based on the determination of the morpheme boundaries, uses knowledge from structural Linguistic (complex, expensive maintenance). N-gram: based on the identification of digrams and trigrams, and it is more clustering procedure than a stemming one.(no data, but imprecise).

9 9 Stemming TermStem engineeringengineer engineeredengineerengineer Table lookup

10 10 Successor Variety Definition (successor variety of a string) the number of different characters that follow it in words in some body of text.

11 11 Successor Variety (Continued) Idea The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached, i.e., the successor variety will sharply increase. Example Test word: READABLE Corpus: ABLE, BEATABLE, FIXABLE, READ,READS READABLE, READING, RED, ROPE, RIPE PrefixSuccessor VarietyLetters R3E, O, I RE2A, D REA 1D READ3A, I, S READA1B READAB1L READABL1E READABLE 1blank

12 12 Affix Removal Stemmers procedure Remove suffixes and/or prefixes from terms leaving a stem, and transform the resultant stem. E.g., Porter’s algorithm (Eng Lang.)  Porter algorithm. Martin Porter. Ready code in the web.  Substitution rules: sses  s, s    stresses  stress.

13 13 Affix Removal Stemmers Example: plural forms If a word ends in “ies” but not “eies” or “aies”then “ies” --> “y” ( surgeries--> surgery ). If a word ends in “es” but not “aes”, “ees”, or “oes” then “es” --> “e” ( houses  house ). If a word ends in “s”, but not “us” or “ss” then “s” --> NULL. ( doors--> door ).

14 14 Index Terms Selection A sentence in natural language text is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. Most of the semantics meaning is carried by the noun words.so it is a promising strategy to use the nouns in the text (by eliminating the others like verbs,etc,..).

15 15 Index Terms Selection Since it is common to combine two or three nouns in a single component (ex. Computer science) it makes sense to cluster nouns which appear nearby in the text into a single indexing component (concept). Thus instead of simply using nouns as index terms, we adopt noun groups)a set of nouns whose syntactic distance in the text and does not exceed a predefined threshold (for instance,3)(


Download ppt "1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester 2008-2009."

Similar presentations


Ads by Google