Part-of-Speech Tagging

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Intro to NLP - J. Eisner1 Finite-State Methods.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
FSA and HMM LING 572 Fei Xia 1/5/06.
Intro to NLP - J. Eisner1 Probabilistic CKY.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.
1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.
1 Finite state automaton (FSA) LING 570 Fei Xia Week 2: 10/07/09 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Hidden Markov Models In BioInformatics
Part-of-Speech Tagging
Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.
Intro to NLP - J. Eisner1 Earley’s Algorithm (1970) Nice combo of our parsing ideas so far:  no restrictions on the form of the grammar:  A.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Probabilistic CKY Roger Levy [thanks to Jason Eisner]
Copyright © Curt Hill Finite State Automata Again This Time No Output.
CSA3202 Human Language Technology HMMs for POS Tagging.
Xinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong Wu Speech and Hearing Research Center, Department of Machine Intelligence, Peking University September.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Intro to NLP - J. Eisner1 Phonology [These slides are missing most examples and discussion from class …]
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
CS 224S / LINGUIST 285 Spoken Language Processing
Statistical NLP Winter 2009
Modeling Grammaticality
and the Forward-Backward Algorithm
CSC 594 Topics in AI – Natural Language Processing
Prototype-Driven Learning for Sequence Models
CSCI 5832 Natural Language Processing
Machine Learning in Natural Language Processing
Hidden Markov Models Part 2: Algorithms
CSCI 5832 Natural Language Processing
Earley’s Algorithm (1970) Nice combo of our parsing ideas so far:
Finite-State Methods Intro to NLP - J. Eisner.
N-Gram Model Formulas Word sequences Chain rule of probability
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
CSCI 5832 Natural Language Processing
CS4705 Natural Language Processing
Finite-State and the Noisy Channel
Chunk Parsing CS1573: AI Application Development, Spring 2003
Finite-State Programming
CPSC 503 Computational Linguistics
Hidden Markov Models Teaching Demo The University of Arizona
Modeling Grammaticality
Artificial Intelligence 2004 Speech & Natural Language Processing
and the Forward-Backward Algorithm
Presentation transcript:

Part-of-Speech Tagging A Canonical Finite-State Task 600.465 - Intro to NLP - J. Eisner

The Tagging Task Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj Uses: text-to-speech (how do we pronounce “lead”?) can write regexps like (Det) Adj* N+ over the output preprocessing to speed up parser (but a little dangerous) if you know the tag, you can back off to it in other tasks 600.465 - Intro to NLP - J. Eisner

Why Do We Care? Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj The first statistical NLP task Been done to death by different methods Easy to evaluate (how many tags are correct?) Canonical finite-state task Can be done well with methods that look at local context Though should “really” do it by parsing! 600.465 - Intro to NLP - J. Eisner

Degree of Supervision Supervised: Training corpus is tagged by humans Unsupervised: Training corpus isn’t tagged Partly supervised: Training corpus isn’t tagged, but you have a dictionary giving possible tags for each word We’ll start with the supervised case and move to decreasing levels of supervision. 600.465 - Intro to NLP - J. Eisner

Current Performance How many tags are correct? Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj How many tags are correct? About 97% currently But baseline is already 90% Baseline is performance of stupidest possible method Tag every word with its most frequent tag Tag unknown words as nouns 600.465 - Intro to NLP - J. Eisner

What Should We Look At? PN Verb Det Noun Prep Noun Prep Det Noun correct tags Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more) …? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too … 600.465 - Intro to NLP - J. Eisner

What Should We Look At? PN Verb Det Noun Prep Noun Prep Det Noun correct tags Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more) …? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too … 600.465 - Intro to NLP - J. Eisner

What Should We Look At? PN Verb Det Noun Prep Noun Prep Det Noun correct tags Bill directed a cortege of autos through the dunes PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more) …? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too … 600.465 - Intro to NLP - J. Eisner

Three Finite-State Approaches Noisy Channel Model (statistical) real language Y part-of-speech tags (n-gram model) noisy channel Y  X replace tags with words observed string X text want to recover Y from X 600.465 - Intro to NLP - J. Eisner

Three Finite-State Approaches Noisy Channel Model (statistical) Deterministic baseline tagger composed with a cascade of fixup transducers Nondeterministic tagger composed with a cascade of finite-state automata that act as filters 600.465 - Intro to NLP - J. Eisner

Review: Noisy Channel real language Y p(Y) * p(X | Y) noisy channel Y  X p(X | Y) = obseved string X p(X,Y) want to recover yY from xX choose y that maximizes p(y | x) or equivalently p(x,y) 600.465 - Intro to NLP - J. Eisner

Review: Noisy Channel p(Y) .o. = * p(X | Y) = p(X,Y) a:a/0.7 b:b/0.3 p(Y) .o. = * a:D/0.9 a:C/0.1 b:C/0.8 b:D/0.2 p(X | Y) = p(X,Y) a:D/0.63 a:C/0.07 b:C/0.24 b:D/0.06 Note p(x,y) sums to 1. Suppose x=“C”; what is best “y”? 600.465 - Intro to NLP - J. Eisner

Review: Noisy Channel p(Y) .o. * p(X | Y) = = p(X,Y) a:a/0.7 b:b/0.3 p(Y) .o. * a:C/0.1 b:C/0.8 p(X | Y) a:D/0.9 b:D/0.2 = = p(X,Y) a:C/0.07 b:C/0.24 a:D/0.63 b:D/0.06 Suppose x=“C”; what is best “y”? 600.465 - Intro to NLP - J. Eisner

Review: Noisy Channel p(Y) .o. * p(X | Y) .o. * (X = x)? = = p(x, Y) a:a/0.7 b:b/0.3 p(Y) .o. * a:C/0.1 b:C/0.8 p(X | Y) a:D/0.9 b:D/0.2 .o. * C:C/1 (X = x)? restrict just to paths compatible with output “C” = = p(x, Y) a:C/0.07 b:C/0.24 best path 600.465 - Intro to NLP - J. Eisner

Noisy Channel for Tagging acceptor: p(tag sequence) a:a/0.7 b:b/0.3 p(Y) “Markov Model” .o. * transducer: tags  words a:C/0.1 b:C/0.8 p(X | Y) a:D/0.9 b:D/0.2 “Unigram Replacement” .o. * C:C/1 (X = x)? acceptor: the observed words “straight line” = = transducer: scores candidate tag seqs on their joint probability with obs words; pick best path p(x, Y) a:C/0.07 b:C/0.24 best path 600.465 - Intro to NLP - J. Eisner

Markov Model (bigrams) Verb Det Start Prep Adj Noun Stop 600.465 - Intro to NLP - J. Eisner

Markov Model Verb Det Start Prep Adj Noun Stop 0.3 0.7 0.4 0.5 0.1 600.465 - Intro to NLP - J. Eisner

Markov Model Verb Det Start Prep Adj Noun Stop 0.8 0.7 0.3 0.4 0.5 0.1 0.2 600.465 - Intro to NLP - J. Eisner

Markov Model p(tag seq) Verb Det Start Prep Adj Noun Stop 0.8 0.3 Start 0.7 Prep Adj 0.4 0.5 Noun Stop 0.2 0.1 Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2 600.465 - Intro to NLP - J. Eisner

Markov Model as an FSA p(tag seq) Verb Det Start Prep Adj Noun Stop 0.8 0.3 Start 0.7 Prep Adj 0.4 0.5 Noun Stop 0.2 0.1 Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2 600.465 - Intro to NLP - J. Eisner

Markov Model as an FSA p(tag seq) Verb Det Start Prep Adj Noun Stop  0.2  0.1 Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2 600.465 - Intro to NLP - J. Eisner

Markov Model (tag bigrams) p(tag seq) Det Det 0.8 Adj 0.3 Start Adj Noun 0.5 Adj 0.4 Noun Stop  0.2 Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2 600.465 - Intro to NLP - J. Eisner

Noisy Channel for Tagging automaton: p(tag sequence) p(Y) “Markov Model” .o. * transducer: tags  words p(X | Y) “Unigram Replacement” .o. * automaton: the observed words p(x | X) “straight line” = = transducer: scores candidate tag seqs on their joint probability with obs words; pick best path p(x, Y) 600.465 - Intro to NLP - J. Eisner

Noisy Channel for Tagging Det Start Adj Noun Verb Prep Stop Noun 0.7 Adj 0.3 Adj 0.4  0.1 Noun 0.5 Det 0.8  0.2 p(Y) .o. * Adj:cortege/0.000001 … Noun:Bill/0.002 Noun:autos/0.001 Noun:cortege/0.000001 Adj:cool/0.003 Adj:directed/0.0005 Det:the/0.4 Det:a/0.6 p(X | Y) .o. * the cool directed autos p(x | X) = = transducer: scores candidate tag seqs on their joint probability with obs words; we should pick best path p(x, Y) 600.465 - Intro to NLP - J. Eisner

Unigram Replacement Model p(word seq | tag seq) … Noun:cortege/0.000001 Noun:autos/0.001 sums to 1 Noun:Bill/0.002 Det:a/0.6 Det:the/0.4 sums to 1 Adj:cool/0.003 Adj:directed/0.0005 Adj:cortege/0.000001 … 600.465 - Intro to NLP - J. Eisner

Compose p(tag seq) Verb Det Start Prep Adj Noun Stop Det 0.8 Adj 0.3 Adj:cortege/0.000001 … Noun:Bill/0.002 Noun:autos/0.001 Noun:cortege/0.000001 Adj:cool/0.003 Adj:directed/0.0005 Det:the/0.4 Det:a/0.6 Compose Det Start Adj Noun Verb Prep Stop Noun 0.7 Adj 0.3 Adj 0.4  0.1 Noun 0.5 Det 0.8  0.2 p(tag seq) Verb Det Det 0.8 Adj 0.3 Start Prep Adj Noun 0.5 Adj 0.4 Noun Stop  0.2 600.465 - Intro to NLP - J. Eisner

Compose p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq)  Adj:cortege/0.000001 … Noun:Bill/0.002 Noun:autos/0.001 Noun:cortege/0.000001 Adj:cool/0.003 Adj:directed/0.0005 Det:the/0.4 Det:a/0.6 Compose Det Start Adj Noun Verb Prep Stop Noun 0.7 Adj 0.3 Adj 0.4  0.1 Noun 0.5 Det 0.8  0.2 p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) Verb Det Det:a 0.48 Det:the 0.32 Adj:cool 0.0009 Adj:directed 0.00015 Adj:cortege 0.000003 Start Prep Adj Noun  Stop N:cortege N:autos Adj:cool 0.0012 Adj:directed 0.00020 Adj:cortege 0.000004 600.465 - Intro to NLP - J. Eisner

Observed Words as Straight-Line FSA word seq the cool directed autos 600.465 - Intro to NLP - J. Eisner

Compose with p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) Verb Det Det:a 0.48 Det:the 0.32 Adj:cool 0.0009 Adj:directed 0.00015 Adj:cortege 0.000003 Start Prep Adj Noun  Stop N:cortege N:autos Adj:cool 0.0012 Adj:directed 0.00020 Adj:cortege 0.000004 600.465 - Intro to NLP - J. Eisner

Compose with p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) Verb Det Det:the 0.32 Adj:cool 0.0009 Start Prep Adj why did this loop go away? Noun  Stop N:autos Adj:directed 0.00020 Adj 600.465 - Intro to NLP - J. Eisner

p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) The best path: Start Det Adj Adj Noun Stop = 0.32 * 0.0009 … the cool directed autos p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) Verb Det Det:the 0.32 Adj:cool 0.0009 Start Prep Adj Noun  Stop N:autos Adj:directed 0.00020 Adj 600.465 - Intro to NLP - J. Eisner

In Fact, Paths Form a “Trellis” p(word seq, tag seq) Adj:cool 0.0009 Det Det Det Det Det:the 0.32 Adj:directed… Start Adj Adj Adj Adj Stop Noun:cool 0.007 Noun:autos…  0.2 Adj:directed… Noun Noun Noun Noun The best path: Start Det Adj Adj Noun Stop = 0.32 * 0.0009 … the cool directed autos 600.465 - Intro to NLP - J. Eisner

.o. = All paths here are 4 words  The Trellis Shape Emerges from the Cross-Product Construction for Finite-State Composition 3  1 4 2 .o. 1 2 3 4 All paths here are 4 words = 1,1 1,2 1,3 1,4 0,0 2,1 2,2 2,3 2,4 4,4 3,1 3,2 3,3 3,4 So all paths here must have 4 words on output side 600.465 - Intro to NLP - J. Eisner

Actually, Trellis Isn’t Complete p(word seq, tag seq) Trellis has no Det  Det or Det Stop arcs; why? Adj:cool 0.0009 Det Det Det Det Det:the 0.32 Adj:directed… Start Adj Adj Adj Adj Stop Noun:cool 0.007 Noun:autos…  0.2 Adj:directed… Noun Noun Noun Noun The best path: Start Det Adj Adj Noun Stop = 0.32 * 0.0009 … the cool directed autos 600.465 - Intro to NLP - J. Eisner

Actually, Trellis Isn’t Complete p(word seq, tag seq) Lattice is missing some other arcs; why? Adj:cool 0.0009 Det Det Det Det Det:the 0.32 Adj:directed… Start Adj Adj Adj Adj Stop Noun:cool 0.007 Noun:autos…  0.2 Adj:directed… Noun Noun Noun Noun The best path: Start Det Adj Adj Noun Stop = 0.32 * 0.0009 … the cool directed autos 600.465 - Intro to NLP - J. Eisner

Actually, Trellis Isn’t Complete p(word seq, tag seq) Lattice is missing some states; why? Adj:cool 0.0009 Det Det:the 0.32 Adj:directed… Start Adj Adj Stop Noun:cool 0.007 Noun:autos…  0.2 Adj:directed… Noun Noun Noun The best path: Start Det Adj Adj Noun Stop = 0.32 * 0.0009 … the cool directed autos 600.465 - Intro to NLP - J. Eisner

Find best path from Start to Stop Det:the 0.32 Det Start Adj Noun Stop Adj:directed… Noun:autos…  0.2 Adj:cool 0.0009 Noun:cool 0.007 Use dynamic programming – like prob. parsing: What is best path from Start to each node? Work from left to right Each node stores its best path from Start (as probability plus one backpointer) Special acyclic case of Dijkstra’s shortest-path alg. Faster if some arcs/states are absent 600.465 - Intro to NLP - J. Eisner

In Summary Find X that maximizes probability product We are modeling p(word seq, tag seq) The tags are hidden, but we see the words Is tag sequence X likely with these words? Noisy channel model is a “Hidden Markov Model”: probs from tag bigram model 0.4 0.6 0.001 Start PN Verb Det Noun Prep Noun Prep Det Noun Stop Bill directed a cortege of autos through the dunes probs from unigram replacement Find X that maximizes probability product 600.465 - Intro to NLP - J. Eisner

Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are! Start PN Verb Det … Bill directed a … p( ) = p(Start) * p(PN | Start) * p(Verb | Start PN) * p(Det | Start PN Verb) * … * p(Bill | Start PN Verb …) * p(directed | Bill, Start PN Verb Det …) * p(a | Bill directed, Start PN Verb Det …) * … 600.465 - Intro to NLP - J. Eisner

Another Viewpoint We are modeling p(word seq, tag seq) Why not use chain rule + some kind of backoff? Actually, we are! Start PN Verb Det … Bill directed a … p( ) = p(Start) * p(PN | Start) * p(Verb | Start PN) * p(Det | Start PN Verb) * … * p(Bill | Start PN Verb …) * p(directed | Bill, Start PN Verb Det …) * p(a | Bill directed, Start PN Verb Det …) * … Start PN Verb Det Noun Prep Noun Prep Det Noun Stop Bill directed a cortege of autos through the dunes 600.465 - Intro to NLP - J. Eisner

Three Finite-State Approaches Noisy Channel Model (statistical) Deterministic baseline tagger composed with a cascade of fixup transducers Nondeterministic tagger composed with a cascade of finite-state automata that act as filters 600.465 - Intro to NLP - J. Eisner

Another FST Paradigm: Successive Fixups Like successive markups but alter Morphology Phonology Part-of-speech tagging … Initial annotation input Fixup 1 Fixup 2 Fixup 3 output 600.465 - Intro to NLP - J. Eisner

Transformation-Based Tagging (Brill 1995) figure from Brill’s thesis 600.465 - Intro to NLP - J. Eisner

Transformations Learned figure from Brill’s thesis BaselineTag* NN @ VB // TO _ VBP @ VB // ... _ etc. Compose this cascade of FSTs. Gets a big FST that does the initial tagging and the sequence of fixups “all at once.” 600.465 - Intro to NLP - J. Eisner

Initial Tagging of OOV Words figure from Brill’s thesis 600.465 - Intro to NLP - J. Eisner

Three Finite-State Approaches Noisy Channel Model (statistical) Deterministic baseline tagger composed with a cascade of fixup transducers Nondeterministic tagger composed with a cascade of finite-state automata that act as filters 600.465 - Intro to NLP - J. Eisner

Variations Multiple tags per word Transformations to knock some of them out How to encode multiple tags and knockouts? Use the above for partly supervised learning Supervised: You have a tagged training corpus Unsupervised: You have an untagged training corpus Here: You have an untagged training corpus and a dictionary giving possible tags for each word 600.465 - Intro to NLP - J. Eisner