LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15
Part-of-Speech (POS) Tagging Basic Idea: –assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a word –useful for shallow parsing –or as first stage of a deeper/more sophisticated system Question: –Is it a hard task? i.e. can’t we just look the words up in a dictionary? Answer: –Yes. Ambiguity. –No. POS tagging programs typically claim 95%+ accuracy
POS Tagging Task: –assign the right part-of-speech tag to a word in context –not always easy Example: walk –the walk : noun I took … –I walk : verb 2 miles every day Example: still: noun, adjective, adverb, verb –the still of the night, a glass still –still waters –stand still –still struggling –Still, I didn’t give way –still your fear of the dark (transitive) –the bubbling waters stilled (intransitive)
POS Tagging Issues/Questions: –What are the parts of speech and subclasses that we might want to tag? –What does a typical tagset look like? –What methods can we use to assign tags?
Parts-of-Speech Divide words into classes based on grammatical function –nouns (open-class: unlimited set) referential items (denoting objects/concepts etc.) –proper nouns: John –pronouns: he, him, she, her, it –anaphors: himself, herself (reflexives) –common nouns: dog, dogs, water »number: dog (singular), dogs (plural) »count-mass distinction: many dogs, *many waters –eventive nouns: dismissal, concert, playback, destruction (deverbal) nonreferential items –it as in it is important to study –there as in there seems to be a problem –some languages don’t have these: e.g. Japanese open-class –factoid, , bush-ism
Parts-of-Speech Pronouns: 1.it 2.I 3.he 4.you 5.his 6.they 7.this 8.that 9.she 10.her 11.we 12.all 13.which 14.their 15.what
Parts-of-Speech Divide words into classes based on grammatical function –verbs (closed-class: fixed set) auxiliaries –be(passive, progressive) –have (pluperfect tense) –do(what did John buy?, Did Mary win?) –modals: can, could, would, will, may Irregular: –is, was, were, does, did
Parts-of-Speech Divide words into classes based on grammatical function –verbs (open-class: unlimited set) Intransitive –unaccusatives: arrive (achievement) –unergatives: run, jog (activities) Transitive –actions: hit (semelfactive: hit the ball for an hour) –actions: eat, destroy (accomplishment) –psych verbs: frighten (x frightens y), fear (y fears x) Ditransitive –put (x put y on z, *x put y) –give (x gave y z, *x gave y, x gave z to y) –load (x loaded y (on z), x loaded z (with y)) –Open-class: reaganize, , fax
Parts-of-Speech Divide words into classes based on grammatical function –adjectives (open-class: unlimited set) modify nouns black, white, open, closed, sick, well attributive: black (black car, car is black), main (main street, *street is main), atomic predicative: afraid (*afraid child, the child is afraid) stage-level: drunk (there is a man drunk in the pub) individual-level: clever, short, tall (*there is a man tall in the bar) object-taking: proud (proud of him,*well of him) intersective: red (red car: intersection of the set of red things and the set of cars) non-intersective: former (former architect), atomic (atomic scientist) comparative, superlative: blacker, blackest, *opener, *openest –open-class: hackable, spammable
Parts-of-Speech Divide words into classes based on grammatical function –adverbs (open-class: unlimited set) modify verbs (adjectives and other adverbs) manner: slowly (moved slowly) degree: slightly, more (more clearly), very (very bad), almost sentential: unfortunately, suddenly question: how temporal: when, soon, yesterday (noun?) location: sideways, here (John is here) –open-class: spam-wise
Parts-of-Speech Divide words into classes based on grammatical function –prepositions (closed-class: fixed set) –come before an object, assigns a semantic function (from Mars, *Mars from) head-final languages: postpositions (Japanese: amerika-kara) –location: on, in, by –temporal: by, until
POS Tagging Task: –assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a word in context POS taggers –need to be fast in order to process large corpora should take no more than time linear in the size of the corpora –full parsing is slow e.g. context-free grammar n 3, n length of the sentence –POS taggers try to assign correct tag without actually parsing the sentence
POS Tagging Components: –Dictionary of words Exhaustive list of closed class items –Examples: »the, a, an: determiner »from, to, of, by: preposition »and, or: coordination conjunction Large set of open class (e.g. noun, verbs, adjectives) items with frequency information
POS Tagging Components: –Mechanism to assign tags Context-free: by frequency Context: bigram, trigram, HMM, hand-coded rules –Example: »Det Noun/*Verb the walk… –Mechanism to handle unknown words (extra-dictionary) Capitalization Morphology: -ed, -tion
How Hard is Tagging? Brown Corpus (Francis & Kucera, 1982): –1 million words –39K distinct words –35K words with only 1 tag –4K with multiple tags (DeRose, 1988)
How Hard is Tagging? Easy task to do well on: –naïve algorithm assign tag by frequency –90% accuracy (Charniak et al., 1993)
Penn TreeBank Tagset 48-tag simplification of Brown Corpus tagset Examples: 1.CCCoordinating conjunction 3.DTDeterminer 7.JJAdjective 11.MDModal 12.NNNoun (singular,mass) 13.NNSNoun (plural) 27VBVerb (base form) 28VBDVerb (past)
Penn TreeBank Tagset
Penn TreeBank Tagset $
Penn TreeBank Tagset How many tags? –Tag criterion Distinctness with respect to grammatical behavior? –Make tagging easier? Punctuation tags –Penn Treebank numbers Trivial computational task
Penn TreeBank Tagset Simplifications : –Tag TO : infinitival marker, preposition I want to win I went to the store –Tag IN : preposition: that, when, although I know that I should have stopped, although… I stopped when I saw Bill
Penn TreeBank Tagset Simplifications: –Tag DT : determiner: any, some, these, those any man these *man/men –Tag VBP : verb, present: am, are, walk Am I here? *Walked I here?/Did I walk here?
Hard to Tag Items Syntactic Function –Example: resultative I saw the man tired from running Examples (from Brown Corpus Manual) –Hyphenation: long-range, high-energy shirt-sleeved signal-to-noise –Foreign words: mens sana in corpore sano
Rule-Based POS Tagging Example Systems –ENGCG (1,100 rules) –ENGCG-2 (4000 rules) Core Components –English morphological analyzer based on two-level morphology see last lecture –56K word stems –processing apply morphological engine get all possible tags for each word apply rules
Rule-Based POS Tagging Example: –Pavlov had shown that salivation can be a conditioned reflex
Rule-Based POS Tagging Examples of tags: –PCP2 past participle –SV subject verb –SVOO subject verb object object
Rule-Based POS Tagging Example: –it isn’t that:adv odd Rule: –given input “that” –if (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) –then eliminate non-ADV tags –else eliminate ADV tag
Rule-Based POS Tagging Now ENGCG-2 (4000 rules) –
Rule-Based POS Tagging Now ENGCG-2 (4000 rules) –
Rule-Based POS Tagging Best performance of all systems: 99.7%
Next Time Look at statistical techniques …