Download presentation
Presentation is loading. Please wait.
Published byCuthbert Wilson Modified over 9 years ago
1
Text Preprocessing
2
Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding, numbers; –Stop-words elimination; –Stemming; –(other preprocessing procedures...)
3
Generating index terms Logical view of the documents structure Spaces and Signals stopwords Nominal groups stemming Manual indexing Docs Structure Full text Index terms n Stop words elimination; n Nominal groups detection; n Stemming; n Index terms generation; n Other preprocessing procedures: n Synonyms, co-occurrences, latent semantic indexing..
4
Text preprocessing Most common procedures: –“Tokenization”: Identification of text words; Words are defined as “strings with cotinuous alphanumeric characters with no spaces, possibly including hyphens and apostrophes, but no end-of-sentence”; The most employed elements to separate words are the blank, the tab ou the new-line.
5
Problems: a) End-of-sentence x abbreviators; ex. Wash. b) Apostrophes ( ‘ ): “magic words” x contractions; ex. I’ll. c) Hyphens: single words x hyphenised words; ex. e- mail. d) Blank: sometimes does not indicate word separation; ex. database and data base; New York and San Francisco. e) Numbers: 9365 1873. Text preprocessing
6
–Case-Folding: the THE The => THE ; –http://www.delorie.com/gnu/docs/diffutils/diff_6.html –http://curry.edschool.virginia.edu/aace/conf/webnet/h tml/invwitt.htm –http://www.dlib.org/dlib/november96/newzealand/11 witten.html Text preprocessing
7
–Stop-Words removal: an, the, is, are, and, or, so, because,... ; list on the Web (524 palavras) in the BOW library, CMU. –http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic _utils/stop_words –http://searchenginewatch.internet.com/facts/stopwor ds.html –http://pen2.ci.santa- monica.ca.us/city/municode/stopwords.html Text preprocessing
8
–Stemming: compressed, compression, compressed => compress; Porter`s algorithm: –http://maya.cs.depaul.edu/~mobasher/classes/ds599 /porter.html –http://ils.unc.edu/keyes/java/porter –http://maya.cs.depaul.edu/~mobasher/classes/ds599 /porter.html –http://ils.unc.edu/keyes/java/porter Text preprocessing
9
Stemming algorithm [Porter 86] Steps: 1. Plural removal, including special cases such as “sses” “ies”; 2. Union of pattern s with some suffixes such as: “ational" -> "ate", "tional" -> "tion", "enci" -> "ence", "anci" -> "ance", "iser" -> "ize", "abli" -> "able", "alli" -> "al", "entli" -> "ent", "eli" -> "e", "ousli" -> "ous", "ization" -> "ize", "isation" -> "ize", "ation" -> "ate", "ator" -> "ate", "alism" -> "al", "iveness" -> "ive", "fulness" -> "ful", "ousness" -> "ous", "aliti" -> "al", "iviti" -> "ive", "biliti" -> "ble“;
10
Steps: 3. Manipulation of special transformations such as: "icate" -> "ic", "ative" -> "", "alize" -> "al", "alise" -> "al", "iciti" -> "ic", "ical" -> "ic", "ful" -> "", "ness" -> "“ 4. Verification of composite words, including: "al", "ance", "ence", "er", "ic", "able", "ible", "ant", "ement", "ment", "ent", "sion", "tion", "ou", "ism", "ate", "iti", "ous", "ive", "ize", "ise" 5. Verification if the word ends with a vocal: "kilo", "micro", "milli", "intra", "ultra", "mega", "nano", "pico", and "pseudo". Stemming algorithm [Porter 86]
11
Text preprocessing –N-Grams: APPLE => _APP, APPL, PPLE, PLE_ –http://www.cs.umbc.edu/ngram –http://citeseer.nj.nec.com/miller99hidden.html –http://citeseer.nj.nec.com/5655.html
12
Other techniques: Part-of-Speech tagger ( Eric Brill www.cs.jhu.edu/~brill/ ): –Sentence separation in its syntactic or grammatical components (POS tags); –Main use in terms of information content: noums, verbs, adjectives. Text preprocessing
13
Brill POS Tagger Output Input: Mr. Red have a red ball Output: Mr/NNP./. Red/NNP have/VBP a/DT red/JJ ball/NN Part of Speech Tags DT DeterminerNNP Proper noun, singular JJ AdjectiveVBP Verb, non-3rd ps. sing. present NN Noun, singular or mass. Sentence-final punctuation
14
in general indicate generic entities (dog, tree); for the English, consider only the plural noun variation; the plural usually is characterized by the suffix -s (dogs, trees); the plural has exceptions: “es” (speeches) and irregular terms (woman: women); in addition there is the possessive case (woman’s house), called clitic. POS - nouns
15
–Wordnet (Princeton University): http://www.cogsci.princeton.edu/wordnet/current/ Is a database of lexemes [Miller 98]; Contain information about composite expressions (phrasal verbs, collocations, idiomatic phases, etc.); Separate its entries according to their syntactic categories: nouns, verbs, adjectives, …; In a category several semantic relations among words are stored. Text preprocessing
16
WordNet Search for and return of Noun of Verb of Adjective of Adverb
17
WordNet Composition:
18
Wordnet The Wordnet contains the relations hyponym, hypersonic, meronym and holonym: hyponym is a more specific word: cat is a hyponym of animal; hypernym is a more generic word: animal is a hypernym of cat; a part of the whole is a meronym: leaf is a meronym of tree; the whole which corresponds to a part is called holonym.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.