1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
2 Chapter 7 Text Operations Part2
3 Elimination of Stopwords Words with high frequency are not good discriminators. (in fact a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval ). They are frequently referred to as stopwords filtered out. Examples: Articles: a, an, the,… Prepositions: on, in,over,… Conjunctions: and,or.
4 Elimination of Stopwords (derived from Brown corpus): 425 words: a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, became,...
5 Elimination of Stopwords Elimination of stopwords reduces the size of the indexing structure (a bout 40% compression in the size of the indexing structure ). Unfortunately, sometimes elimination of stopwords could eliminate words that have a profound impact on the retrieved documents. Ex:’to be or not to be’ be (is only left). Solution full text index.
6 Stemming A stem is the portion of the word which is left after the removal of the affixes ( prefixes and suffixes). They are thought to be useful for improving retrieval performance ( reduce the variants of the same root to a common concept ) connect connected, connecting, connection, connections.
7 Stemming Frakes distinguish 4 types of stemming strategies: Affix removal Table lookup: simple, but needs data Successor variety. N-gram.
8 Stemming Affix removal: intuitive, simple and can be implemented efficiently Table lookup: looking for the stem of a word in a table( simple,but needs data for the whole language and considerable storage space). Successor variety: based on the determination of the morpheme boundaries, uses knowledge from structural Linguistic (complex, expensive maintenance). N-gram: based on the identification of digrams and trigrams, and it is more clustering procedure than a stemming one.(no data, but imprecise).
9 Stemming TermStem engineeringengineer engineeredengineerengineer Table lookup
10 Successor Variety Definition (successor variety of a string) the number of different characters that follow it in words in some body of text.
11 Successor Variety (Continued) Idea The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached, i.e., the successor variety will sharply increase. Example Test word: READABLE Corpus: ABLE, BEATABLE, FIXABLE, READ,READS READABLE, READING, RED, ROPE, RIPE PrefixSuccessor VarietyLetters R3E, O, I RE2A, D REA 1D READ3A, I, S READA1B READAB1L READABL1E READABLE 1blank
12 Affix Removal Stemmers procedure Remove suffixes and/or prefixes from terms leaving a stem, and transform the resultant stem. E.g., Porter’s algorithm (Eng Lang.) Porter algorithm. Martin Porter. Ready code in the web. Substitution rules: sses s, s stresses stress.
13 Affix Removal Stemmers Example: plural forms If a word ends in “ies” but not “eies” or “aies”then “ies” --> “y” ( surgeries--> surgery ). If a word ends in “es” but not “aes”, “ees”, or “oes” then “es” --> “e” ( houses house ). If a word ends in “s”, but not “us” or “ss” then “s” --> NULL. ( doors--> door ).
14 Index Terms Selection A sentence in natural language text is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. Most of the semantics meaning is carried by the noun words.so it is a promising strategy to use the nouns in the text (by eliminating the others like verbs,etc,..).
15 Index Terms Selection Since it is common to combine two or three nouns in a single component (ex. Computer science) it makes sense to cluster nouns which appear nearby in the text into a single indexing component (concept). Thus instead of simply using nouns as index terms, we adopt noun groups)a set of nouns whose syntactic distance in the text and does not exceed a predefined threshold (for instance,3)(