Download presentation
Presentation is loading. Please wait.
Published byPhoebe Hicks Modified over 9 years ago
1
1 COMP 791A: Statistical Language Processing Corpus-Based Work Chap. 4
2
2 Using a Corpus To approximate the probability distribution of language events, we use a training corpus Statistical NLP seeks to automatically learn lexical and structural preferences from corpora.
3
3 Corpus Large database of text & speech Many types of text corpora exist plain text, domain specific, tagged, parsed, parallel bilingual, … Major suppliers: Linguistic data Consortium (LDC) -- www.ldc.upenn.eduwww.ldc.upenn.edu European Language resources Associations (ELRA) -- www.icp.grnet.fr/ELRA www.icp.grnet.fr/ELRA To derive the needed probabilities, a corpus needs to be: large a representative sample of the population of interest
4
4 Low-Level Formatting Issues Junk formatting & content Removal of typesetter codes (ex. HTML tags), diagrams, tables, foreign words etc. Also other problems if data was retrieved through OCR (unrecognized words) Uppercase and Lowercase should we keep the case or not? “the”, “The” and “THE” should all be treated the same? but in “George Brown” and “brown dog”, “brown” should be treated separately…
5
5 Finding Tokens and Sentences Tokenization divide the input text into units (called tokens) each token is either a word or something else (ex. a number or a punctuation mark) Mark sentence boundaries Most sentences end with ‘.’, ‘?’ or ‘!’ Can be confused by abbreviations
6
6 Tokenization --What is a word? Graphic word (Kučera and Francis, 1967): “A string of contiguous alphanumeric characters with white spaces on either side; may include hyphens and apostrophes, but no other punctuation marks” But what about: “$22.50”“C++”“ :-)” Main problems: Periods Abbreviation or end of sentence? “etc.” “Calif.” “Wash.” Is the period part of the word or not? “Wash.” (Washington) need to keep the period to distinguish it from “wash” (the verb) Single apostrophes Part of the word or not? “Peter’s sick” --> 1 word? or 2 words? If 1 word, then problems in parsing… S--> NP VP If 2 words, then should “Peter’s house” be considered 2 words?
7
7 Hyphens Line-breaks to improve justification of text or not? Ex: “e-mail” “pro-life” “data-base”/”database”/”data base” Diacritics Remove them? Homographs Should we distinguish 2 words that have the same spelling but with unrelated senses “Bow“: part of a ship / a knot of ribbon “Saw”: instrument / past tense of “to see” Word Segmentation in other languages: Some languages have no whitespaces !!! Ex: East-Asian languages In German: “life insurance company employee” = “Lebenversicherungsgesellschaftsangestellter” Tokenization --What is a word? (con’t)
8
8 Whitespace do not always indicate a word break Ex: Do we really want to separate the phrases “in spite of” “as a matter of fact” “work out” If no, then what do we do with non-adjacent phrasal verbs? “I could not work the answer out” Variant forms of some semantic types Ex. Telephone numbers (514) 848-3074 +1 514 848 3074 +1 (514) 848 3074 Speech corpora More contractions, fillers (ex. “Um” “well” “euh”) Tokenization --What is a word? (con’t)
9
9 Tokenization -- Lemmatizer What about morphological variants? Should “give”, “gives”, “given”, “giver”… be considered different words? Goal: “normalize” similar words Two main approaches: Stemming Morphological Analysis
10
10 Stemming Very “dumb” rules work well (for English and Romance languages) Ex: the Porter stemmer Strips off affixes and leaves the stem give --> g ive, gives --> give + s, given --> give + en, … uses simple rules: IF word ends with “ies” but not with “eies” or “aies” THEN replace “ies” by “y” IF word ends with “es” but not “aes”, “ees” and “oes”THEN replace “es” by ”e” IF word ends with “s” but not “us” or “ss”THEN remove “s” first applicable rule is applied Advantage: Fast Disadvantages: Rules depend on the language Unreadable results: EX: “computers”, “computation”, “computational” --> “comput” May reduce different words to same stem although they are actually distinct stocks --> stock stockings --> stock arm --> arm army --> arm organization --> organ university --> universe
11
11 Morphological Analyzer Apply morphological rules (XXXes,V) --> (XXXe,V) (XXXes,N) --> (XXXe,N) files --> (file,N) (file,V) Check that (file,N) (file,V) is in dictionary Advantages: Identifies the root which is an actual word Fewer errors than stemming Disadvantage: More complex, too slow
12
12 Sentences: What is a sentence? Something ending with a ‘.’, ‘?’ or ‘!’ True in 90% of the cases But, sentences can be split up by other punctuation marks or quotes Ex: nested phrases: “You remind me,” she remarked, “of your mother.” We usually use heuristic methods But hand-coded heuristics… Some effort to use statistical methods for sentence-boundary detection Typical classification problem… Classify a period as a end-of-sentence marker or not Use features such as case, length, POS tag of words preceding the period,… Use decision trees, neural networks… Some techniques go up to 98-99% correct classification
13
13 Marked-Up Data: Mark-up Schemes Schemes developed to mark up the structure of text Different Mark-up schemes: COCOA format older, and rather ad-hoc SGML And other related encodings: HTML, XML
14
14 Input text: Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of Achi, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by army of national liberation (ELN) guerrillas, was found dead today, according to authorities. Castellar was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by a group of armed men, who forced him to accompany them to an undisclosed location. Police sources in Cartagena reported that Castellar's body showed signs of torture and several bullet wounds. Castellar was attacked by ELN guerrillas while he was traveling in a boat down the Cauca river to the tenche area, a region within his jurisdiction. In Cartagena it was reported that Castellar faced a “revolutionary trial” by the ELN and that he was found guilty and executed. Example
15
15 Text with named entity tags: Bogota, 9 jan 90 ( EFE ) -- [text] Ricardo Alfonso Castellar, mayor of Achi, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by army of national liberation (ELN) guerrillas, was found dead today, according to authorities. Castellar was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by a group of armed men, who forced him to accompany them to an undisclosed location.... Example (con’t)
16
16 Text with coreference tags: Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of Achi, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by army of national liberation (ELN) guerrillas, was found dead today, according to authorities. Castellar was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by a group of armed men, who forced him... Example (con’t)
17
17 Interpretation of coreference tags Bogota, 9 jan 90 (EFE) -- [text] Ricardo Alfonso Castellar, mayor of Achi, in the northern department of Bolivar, who was kidnapped on 5 January, apparently by army of national liberation (ELN) guerrillas, was found dead today, according to authorities. Castellar was kidnapped on 5 january on the outskirts of Achi, about 850 km north of Bogota, by a group of armed men, who forced him to accompany them to an undisclosed location. Police sources in Cartagena reported that Castellar's body showed signs of torture and several bullet wounds. Castellar was attacked by ELN guerrillas while he was traveling in a boat down the Cauca river to the tenche area, a region within his jurisdiction. In Cartagena it was reported that Castellar faced a “revolutionary trial” by the ELN and that he was found guilty and executed. Example: (con’t)
18
18 Marked-Up Data: Grammatical Coding to indicate the various parts of speech of tokens Different Tag Sets have been used Brown Tag Set:87/179 tags Penn Treebank (most used): 45 tags London-Lund:197 tags CLAWS1:132 tags CLAWS2:166 tags CLAWS c5:62 tags
19
19 The design of a Tag Set Target feature (classification): Tags are used to tell (the user) useful information about the grammatical class of a word Predictive feature: Tags are used (by the system) to predict the behavior of other words in the context Example, in the Brown tag set: VBG: verb, present participle But they can be used as Gerund or as Noun Gerund: “While purchasing/VBG a gift, I noticed that I was out of money.” Noun: “Concordia’s purchasing/VBG? Department is closed.” 2 conflicting goals: splitting a tag improves prediction but makes classification harder
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.