Download presentation
Presentation is loading. Please wait.
Published byCecil Garrett Modified over 8 years ago
1
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원
2
2 Abstract Getting Set Up –Computers, Corpora, Software Looking at Text –Low-level formatting issues –Tokenization : What is a word? –Morphology –Sentences Mark-up Data –Markup schemes –Grammatical tagging
3
3 Getting Set up(1/2) Text corpora are usually big. –major limitation on the use of corpora –Computer 의 발전으로 극복 Corpora –use text corpora distributed by main organization –corpus : special collection of textual material –general issue is representative sample of the population of interest.
4
4 Getting Set up(2/2) Software –Text editors : shows fairly literally –Regular expressions : find certain pattern –Programming languages : C, C++, Perl –Programming techniques
5
5
6
6 Looking at Text Text come a row format or marked up. Markup –a term is used for putting code of some sort into a computer file. –commercial word processing : WYSIWYG Features of text in human languages –difficulty to process automatically
7
7 Low-level formatting issues Junk formatting/content –junk : document header, separator, table, diagram, etc. –OCR : deal with only English text -> remove junk (other text) Uppercase and lowercase –The original Brown corpus : * was used to capital letter –Should we treat brown in Richard Brown and brown paint as the same? –proper name detection : difficult problem
8
8 Tokenization : What is a word?(1) Tokenization –To divide the input text into unit called token –what is a word? graphic word (Kucera and Francis. 1967) “a string of contiguous alphanumeric characters with space on either side;may include hyphens and apo- strophes, but no other punctuation marks” -> workable definition : $22.50, Micro$oft, C|net
9
9 Tokenization : What is a word?(2) Period –distinction end of sentence punctuation marks, abbreviation makrs as in etc. or Wash. Single apostrophes –English contractions : I’ll or isn’t –dog’s : dog is or dog has or genitive case Hyphenation –line-breaking hyphen is present in typographical source –e-mail, 26-year-old, co-operate
10
10 Tokenization : What is a word?(3) The same form representing multiple “words” –homographs : ‘saw’ has two lexemes (chap 7) Word segmentation in other languages – Many languages do not put spaces in between words Whitespace not indicating a word break –the New York-New Haven railroad Variant coding of information of a certain seman- tic type
11
11 Morphology Stemming processing –a process that strips off affixes and leaves you with a stem. lemmatization –one is attempting to find the lemma or lexeme of which one is looking at an inflected form IR community has shown that doing stemm- ing does not help the performance
12
12 Sentences What is a sentence? –something ending with a ‘.’, ‘?’ or ‘!.’ –colon, semicolon, dash is regarded as a sentence recent research sentence boundary detection –Riley(1989) : statistical classification tree –Palmer and Hearst (1994; 1997) : a neural network to predict sentence boundaries –Mikheev(1998) : Maximum Entropy approaches to the problem
13
13 Mark-up Schemes early days, markup schemes –including header information in texts (giving author, date, title, etc.) SGML –general language that lets one define a grammar for texts, XML –subset of SGML particularly designed for web
14
14 Grammatical tagging first step of analysis –automatic grammatical tagging for categories –distinguishing comparative and superlative Tag sets (Table 4.5) –incorporate morphological distinction of a particular language The design of a tag set –target feature of classification useful information about the grammatical class of a word –predictive feature prediction the behavior of other words in the context
15
15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.