Presentation is loading. Please wait.

Presentation is loading. Please wait.

Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.

Similar presentations


Presentation on theme: "Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원."— Presentation transcript:

1 Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원

2 2 Abstract  Getting Set Up –Computers, Corpora, Software  Looking at Text –Low-level formatting issues –Tokenization : What is a word? –Morphology –Sentences  Mark-up Data –Markup schemes –Grammatical tagging

3 3 Getting Set up(1/2)  Text corpora are usually big. –major limitation on the use of corpora –Computer 의 발전으로 극복  Corpora –use text corpora distributed by main organization –corpus : special collection of textual material –general issue is representative sample of the population of interest.

4 4 Getting Set up(2/2)  Software –Text editors : shows fairly literally –Regular expressions : find certain pattern –Programming languages : C, C++, Perl –Programming techniques

5 5

6 6 Looking at Text  Text come a row format or marked up.  Markup –a term is used for putting code of some sort into a computer file. –commercial word processing : WYSIWYG  Features of text in human languages –difficulty to process automatically

7 7 Low-level formatting issues  Junk formatting/content –junk : document header, separator, table, diagram, etc. –OCR : deal with only English text -> remove junk (other text)  Uppercase and lowercase –The original Brown corpus : * was used to capital letter –Should we treat brown in Richard Brown and brown paint as the same? –proper name detection : difficult problem

8 8 Tokenization : What is a word?(1)  Tokenization –To divide the input text into unit called token –what is a word? graphic word (Kucera and Francis. 1967) “a string of contiguous alphanumeric characters with space on either side;may include hyphens and apo- strophes, but no other punctuation marks” -> workable definition : $22.50, Micro$oft, C|net

9 9 Tokenization : What is a word?(2)  Period –distinction end of sentence punctuation marks, abbreviation makrs as in etc. or Wash.  Single apostrophes –English contractions : I’ll or isn’t –dog’s : dog is or dog has or genitive case  Hyphenation –line-breaking hyphen is present in typographical source –e-mail, 26-year-old, co-operate

10 10 Tokenization : What is a word?(3)  The same form representing multiple “words” –homographs : ‘saw’ has two lexemes (chap 7)  Word segmentation in other languages – Many languages do not put spaces in between words  Whitespace not indicating a word break –the New York-New Haven railroad  Variant coding of information of a certain seman- tic type

11 11 Morphology  Stemming processing –a process that strips off affixes and leaves you with a stem.  lemmatization –one is attempting to find the lemma or lexeme of which one is looking at an inflected form  IR community has shown that doing stemm- ing does not help the performance

12 12 Sentences  What is a sentence? –something ending with a ‘.’, ‘?’ or ‘!.’ –colon, semicolon, dash is regarded as a sentence  recent research sentence boundary detection –Riley(1989) : statistical classification tree –Palmer and Hearst (1994; 1997) : a neural network to predict sentence boundaries –Mikheev(1998) : Maximum Entropy approaches to the problem

13 13 Mark-up Schemes  early days, markup schemes –including header information in texts (giving author, date, title, etc.)  SGML –general language that lets one define a grammar for texts,  XML –subset of SGML particularly designed for web

14 14 Grammatical tagging  first step of analysis –automatic grammatical tagging for categories –distinguishing comparative and superlative  Tag sets (Table 4.5) –incorporate morphological distinction of a particular language  The design of a tag set –target feature of classification useful information about the grammatical class of a word –predictive feature prediction the behavior of other words in the context

15 15


Download ppt "Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원."

Similar presentations


Ads by Google