1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing

2 Introduction Requirements of NLP work –Computers –Corpora –Application/Software This section covers some issues concerning the formats and problems encountered in dealing with raw data Low-level processing before actual work –Word/Sentence extraction

3 Getting Set Up Computers –Memory requirements for large corpora –Statistical NLP methods involve counts required to be accessed speedily Corpora –“A corpus is a special collection of textual material collected according to a certain set of criteria” –Licensing –Most of the time free sources are not linguistically marked-up

4 Corpora –Representative sample What we find for sample also holds for general population –Balanced corpus Each subtype of text matching predetermined criterion of importance Importance in statistical NLP –Representative corpus –In results type/domain of corpus should be included

5 Software –Text editors TextPad, Emacs, BBedit Regular expressions –Patterns as regular language –Programming language C/C++ widely used (Efficient) Pearl for text preparation and formatting Built in database and easy handling of complicated structures makes Prolog important Java as pure Object Oriented gives automatic memory management

6 Looking at Text Either in raw format or marked-up –‘Markup’ is used for putting some codes into data file, giving some information about text Issues in automatic processing –Junk formatting/content (Corpus Cleaning) –Case sensitivity (All capitalize) 1.Proper Nouns? 2.Stress through capitalization Loss of contextual information

7 Tokenization –Text is divided into units called ‘tokens’ –Treatment of punctuation marks? What is a word? –Graphic word (Kucera and Francis 1967) A string of contiguous alphanumeric characters with white space on either side. This is not practical definition even in case of Latin Especially for news corpus some odd entries can be present e.g. Micro$oft, C| net Apart from these oddities there are some other issues

8 Periods –Words are not always bounded by white spaces (commas, semicolons and periods) –Periods are at the end of sentence and also at the end of abbreviations –In abbreviation they should be attached to words (Wash.wash) –When abbreviations occur at the end of sentence there is only one period present, performing both functions Within morphology, this phenomenon is referred as ‘haplology’

9 Single Apostrophes –Difficulties in dealing with constructions such as I’ll or isn’t –The count of graphic word is 1 according to basic definition but should be counted as 2 words 1. S  NP VP 2. if we split then some funny words may occur in collection –End of quotations marks –Possessive form of words ending with ‘s’ or ‘z’ Charles’ LawMuaz’ book

10 Hyphenation –Does sequence of letters with hyphen in- between, count as one or two? –Line ending hyphens Remove hyphen at the end of line and join both parts together If there is some other type of hyphen at end of line (haplology) then? (text-based) –Mostly in electronic text line breaking hyphens are not present, but there are some other issues…….

11 Some things with hyphens are clearly treated as one word –E-mail, A-l-Plus and co-operate Other cases are arguable –Non-lawyer, pro-Arabs and so-called –The hyphens here are called lexical hyphens –Inserted before or after small word formatives to split vowel sequence in some cases Third class of hyphens is inserted to indicate correct grouping –A text-based medium –A final take-it-or-leave-it offer

12 Inconsistencies in hyphenation –Cooperate  Co-operate –So we can have multiple forms treated as either one word or two Lexemes –Single dictionary entry with single meaning Homographs –Two lexemes have overlapping forms/nature Saw

13 Word segmentation in other languages Opposite issue –White spaces but not word boundary –“the New York-New Heaven railroad” –“I couldn’t work the answer out” In spite of, in order to, because of Variant coding of information of certain semantic type –Phone numbers 42-111-128-128 Problem in information extraction

14 Speech Corpora Issues –More contractions –Various phonetic representations –Pronunciation variants –Sentence fragments –Filler words Morphology –Keep various forms separately or collapse them? e.g. sit, sits, sat –Grouping them together and working with lexemes (Initially looks easier)

15 Stemming –Strips off affixes Lemmatization –To extract the lemma or lexeme from inflected form Empirical research within IR shows that stemming does not help in performance 1.Information loss (operating  operate) 2.Closely related tokens are grouped in chunks, which are more useful 3.Not good for morphologically rich languages

16 Sentences –What is a sentence? –In English, something ending with ‘.’, ‘?’ or ‘!’ –Abbreviations issues Other issues –you reminded me, she remarked, of your mother.” –Nested things are classified as ‘clauses’ –Quotation marks after punctuation ‘.’ is not sentence boundary in this case

17 Sentence boundary (SB) detection –Place tentative SB after all occurrences of.?! –Move the boundary after quotation mark (if any) –Disqualify a period boundary in case of Preceded by an abbreviation not at sentence end, and capitalized Prof., Dr. Or not followed by capitalized words like in case of etc., jr. –Disqualify a boundary with ? Or ! If followed by a lower case letter –Regard all other as correct SBs

18 Riley (1989) used classification trees for SB detection –Features of trees included case and length of words preceding or following a period and probabilities of words to occur before and after a sentence boundary –It required large quantity of labeled data Palmer and Hearst used POS of such words and implemented with Neural Networks (98-99% accurate) In other languages?

19 Marked-up Data –Some sort of code is used to provide information (mostly SGML, XML) –It can be done automatically, manually or mixture of both (Semi-Automatic) –Some texts mark up just sentence and paragraph boundaries –Other mark up more than this basic information e.g. Pen Treebank (Full syntactic structure) –Common mark up is POS tagging

20 Grammatical Tagging –Generally done with conventional POS tagging like Noun, Verbs etc. – Also some information regarding nature of the words like Plurality of nouns or Superlative forms of adjectives Tag set –The most influential tag set have been the one used to tag American Brown Corpus and Lancaster-Oslo-Bergen corpus

21 Size of tag sets –Brown 87 179 (Total tags) –Penn45 –Claws1132 Penn tag set is widely used in computational work Tags are different in different tag sets –Larger tag sets obviously have fine-grained distinctions –Detail level is according to domain of corpora

22 The design of tag set –Grammatical class of word –Features to tell the behavior of the word Part of Speech –Semantic grounds –Syntactic distributional grounds –Morphological grounds Splitting tags in further categories gives improved information but makes classification harder There is not a simple relationship between tag set size and performance of taggers

1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

Similar presentations

Presentation on theme: "1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

Similar presentations

Presentation on theme: "1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing."— Presentation transcript:

Similar presentations

About project

Feedback