Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,

Text Preprocessing

Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding, numbers; –Stop-words elimination; –Stemming; –(other preprocessing procedures...)

Generating index terms Logical view of the documents structure Spaces and Signals stopwords Nominal groups stemming Manual indexing Docs Structure Full text Index terms n Stop words elimination; n Nominal groups detection; n Stemming; n Index terms generation; n Other preprocessing procedures: n Synonyms, co-occurrences, latent semantic indexing..

Text preprocessing Most common procedures: –“Tokenization”: Identification of text words; Words are defined as “strings with cotinuous alphanumeric characters with no spaces, possibly including hyphens and apostrophes, but no end-of-sentence”; The most employed elements to separate words are the blank, the tab ou the new-line.

Problems: a) End-of-sentence x abbreviators; ex. Wash. b) Apostrophes ( ‘ ): “magic words” x contractions; ex. I’ll. c) Hyphens: single words x hyphenised words; ex. e- mail. d) Blank: sometimes does not indicate word separation; ex. database and data base; New York and San Francisco. e) Numbers: 9365 1873. Text preprocessing

–Case-Folding: the THE The => THE ; –http://www.delorie.com/gnu/docs/diffutils/diff_6.html –http://curry.edschool.virginia.edu/aace/conf/webnet/h tml/invwitt.htm –http://www.dlib.org/dlib/november96/newzealand/11 witten.html Text preprocessing

–Stop-Words removal: an, the, is, are, and, or, so, because,... ; list on the Web (524 palavras) in the BOW library, CMU. –http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic _utils/stop_words –http://searchenginewatch.internet.com/facts/stopwor ds.html –http://pen2.ci.santa- monica.ca.us/city/municode/stopwords.html Text preprocessing

–Stemming: compressed, compression, compressed => compress; Porter`s algorithm: –http://maya.cs.depaul.edu/~mobasher/classes/ds599 /porter.html –http://ils.unc.edu/keyes/java/porter –http://maya.cs.depaul.edu/~mobasher/classes/ds599 /porter.html –http://ils.unc.edu/keyes/java/porter Text preprocessing

Stemming algorithm [Porter 86] Steps: 1. Plural removal, including special cases such as “sses” “ies”; 2. Union of pattern s with some suffixes such as: “ational" -> "ate", "tional" -> "tion", "enci" -> "ence", "anci" -> "ance", "iser" -> "ize", "abli" -> "able", "alli" -> "al", "entli" -> "ent", "eli" -> "e", "ousli" -> "ous", "ization" -> "ize", "isation" -> "ize", "ation" -> "ate", "ator" -> "ate", "alism" -> "al", "iveness" -> "ive", "fulness" -> "ful", "ousness" -> "ous", "aliti" -> "al", "iviti" -> "ive", "biliti" -> "ble“;

Steps: 3. Manipulation of special transformations such as: "icate" -> "ic", "ative" -> "", "alize" -> "al", "alise" -> "al", "iciti" -> "ic", "ical" -> "ic", "ful" -> "", "ness" -> "“ 4. Verification of composite words, including: "al", "ance", "ence", "er", "ic", "able", "ible", "ant", "ement", "ment", "ent", "sion", "tion", "ou", "ism", "ate", "iti", "ous", "ive", "ize", "ise" 5. Verification if the word ends with a vocal: "kilo", "micro", "milli", "intra", "ultra", "mega", "nano", "pico", and "pseudo". Stemming algorithm [Porter 86]

Text preprocessing –N-Grams: APPLE => _APP, APPL, PPLE, PLE_ –http://www.cs.umbc.edu/ngram –http://citeseer.nj.nec.com/miller99hidden.html –http://citeseer.nj.nec.com/5655.html

Other techniques: Part-of-Speech tagger ( Eric Brill www.cs.jhu.edu/~brill/ ): –Sentence separation in its syntactic or grammatical components (POS tags); –Main use in terms of information content: noums, verbs, adjectives. Text preprocessing

Brill POS Tagger Output Input: Mr. Red have a red ball Output: Mr/NNP./. Red/NNP have/VBP a/DT red/JJ ball/NN Part of Speech Tags DT DeterminerNNP Proper noun, singular JJ AdjectiveVBP Verb, non-3rd ps. sing. present NN Noun, singular or mass. Sentence-final punctuation

 in general indicate generic entities (dog, tree);  for the English, consider only the plural noun variation;  the plural usually is characterized by the suffix -s (dogs, trees);  the plural has exceptions: “es” (speeches) and irregular terms (woman: women);  in addition there is the possessive case (woman’s house), called clitic. POS - nouns

–Wordnet (Princeton University): http://www.cogsci.princeton.edu/wordnet/current/ Is a database of lexemes [Miller 98]; Contain information about composite expressions (phrasal verbs, collocations, idiomatic phases, etc.); Separate its entries according to their syntactic categories: nouns, verbs, adjectives, …; In a category several semantic relations among words are stored. Text preprocessing

WordNet Search for and return of Noun of Verb of Adjective of Adverb

WordNet Composition:

Wordnet The Wordnet contains the relations hyponym, hypersonic, meronym and holonym: hyponym is a more specific word: cat is a hyponym of animal; hypernym is a more generic word: animal is a hypernym of cat; a part of the whole is a meronym: leaf is a meronym of tree; the whole which corresponds to a part is called holonym.

Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,

Similar presentations

Presentation on theme: "Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,

Similar presentations

Presentation on theme: "Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,"— Presentation transcript:

Similar presentations

About project

Feedback