CSA2050 Introduction to Computational Linguistics Lecture 3 Examples
Mar MRCSA Lecture III: Examples2 Course Contents 1 (MR)Overview 2 (RF)Chomsky Hierarchy 3 (MR)Examples 4 (RF)Grammatical Categories 5, 6 (MR)Tagging 7 (RF)Morphology 8, 9, 10 (MR)Comp Morphology 11 (RF)Syntax 12, 13, 14(MR)Grammar Formalism
Mar MRCSA Lecture III: Examples3 Outline Examples in the areas of Tokenisation Morphological Analysis Tagging Syntactic Analysis
Mar MRCSA Lecture III: Examples4 Information Extraction raw texttokenisation morphological analysis named entity recognition tagged text syntactic analysis
Mar MRCSA Lecture III: Examples5 Tokenisation The basic idea of tokenisation is to identify the basic tokens that are present in a text. Mostly, tokens are the same as words, but not always Why should this be a problem? John’s car cost €10, “And it’s worth every penny”, he exclaimed.
Mar MRCSA Lecture III: Examples6 Tokenisation Problems Punctuation novel forms:.net, Micro$oft, :-) hyphenation: linebreaks vs word-internal: , multi-word: the 90-cent-an-hour raise confusion with dash apostrophes in contractions: we'll periods part of names: Amazon.com numerical expressions: $1.99 abbreviations, end of sentence, haplology commas: 1,000,000
Mar MRCSA Lecture III: Examples7 Other Problems Token-internal whitespace: Interaction: the New York-New Haven railroad Mixed language tokens : u Automated language guesser Token equivalence (when are two tokens the same)? Case-normalization. Sentence boundary detection. Inconsistency: database, data-base, data base Demo: xerox tokeniserxerox tokeniser
Mar MRCSA Lecture III: Examples8 Morphology Simple versus complex words dog dogs Complex words formed by concatenation of morphemes. Morpheme: The smallest unit in a word that bears some meaning, such as dog and s.
Mar MRCSA Lecture III: Examples9 Morphological Analysis Morphological analysis of a word involves a segmentation problem Segmentation: discovery of the component morphemes dogs → dog + s enlargement → en + large + ment Possible ambiguities: enlargement → enlarge + ment → en + largement Role of lexicon
Mar MRCSA Lecture III: Examples10 Morphological Analysis John has a couple of rabbits rabbits → rabbit + s s indicates plural of noun rabbit Is this the only possibility?
Mar MRCSA Lecture III: Examples11 Morphological Analysis John rabbits on and on rabbits → rabbit + s s indicates 3 rd person singular plural of verb rabbit The suffix “s” is a realisation of two entirely different morphemes. The morpheme is something more abstract than the string which realises it.
Mar MRCSA Lecture III: Examples12 Morphological Analysis +PL +3S -s-a suffix world morpheme world
Mar MRCSA Lecture III: Examples13 Morphological Analysis Morphological Parser Input Word rabbits Output Analysis rabbit N PL rabbit V 3S Output is a string of morphemes Morpheme is employed in a loose sense that is useful for further processing
Mar MRCSA Lecture III: Examples14 Morphological Analysis: ENGTWOL & Xerox Atro Voutilainen, Juha Heikkilä, Timo Järvinen and Lingsoft, Inc ENGTWOL demo Xerox morphological analysis
Mar MRCSA Lecture III: Examples15 Morphological Synthesis Morphological Parser Output Word rabbits Input rabbit N PL rabbit V 3S Input is a string of morphemes Ouput is a word
Mar MRCSA Lecture III: Examples16 Reversibility Lookup APPLY UP> left left leave+Verb+PastBoth+123SP left left+Adv left left+Adj left left+Noun+Sg Lookdown APPLY DOWN> leave+Adj left
Mar MRCSA Lecture III: Examples17 POS Tagging In POS tagging, the task is to assign the most appropriate morphosyntactic label from amongst those listed in the lexicon, given the context. John leaves presents. Proper Names
Mar MRCSA Lecture III: Examples18 Semantic Tagging Named Entity Recognition Basic idea is to recognise and tag named entities and classify them as being of type Persons Locations Organisations Named Entity Recognition - Demo
Mar MRCSA Lecture III: Examples19 Syntactic Analysis Problem: given sentence and grammar/lexicon, discover assigned tree structure. XIP Parser Demo