600.465 - Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Albert Gatt Corpora and statistical methods. In this lecture Overview of rules of probability multiplication rule subtraction rule Probability based on.
Intro to NLP - J. Eisner1 Finite-State Methods.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
FSA and HMM LING 572 Fei Xia 1/5/06.
Intro to NLP - J. Eisner1 Probabilistic CKY.
Midterm Review CS4705 Natural Language Processing.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.
1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Part-of-Speech Tagging
Intro to NLP - J. Eisner1 Earley’s Algorithm (1970) Nice combo of our parsing ideas so far:  no restrictions on the form of the grammar:  A.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Text Models Continued HMM and PCFGs. Recap So far we have discussed 2 different models for text – Bag of Words (BOW) where we introduced TF-IDF Location.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Probabilistic CKY Roger Levy [thanks to Jason Eisner]
Leif Grönqvist 1 Tagging a Corpus of Spoken Swedish Leif Grönqvist Växjö University School of Mathematics and Systems Engineering
Albert Gatt LIN3022 Natural Language Processing Lecture 7.
CSA3202 Human Language Technology HMMs for POS Tagging.
Xinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong Wu Speech and Hearing Research Center, Department of Machine Intelligence, Peking University September.
Deterministic Part-of-Speech Tagging with Finite-State Transducers 정 유 진 KLE Lab. CSE POSTECH by Emmanuel Roche and Yves Schabes.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
Intro to NLP - J. Eisner1 Parsing Tricks.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Intro to NLP - J. Eisner1 Phonology [These slides are missing most examples and discussion from class …]
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
CS 224S / LINGUIST 285 Spoken Language Processing
Part-of-Speech Tagging
and the Forward-Backward Algorithm
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Earley’s Algorithm (1970) Nice combo of our parsing ideas so far:
Finite-State Methods Intro to NLP - J. Eisner.
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Finite-State and the Noisy Channel
Finite-State Programming
Hidden Markov Models Teaching Demo The University of Arizona
and the Forward-Backward Algorithm
Presentation transcript:

Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task

Intro to NLP - J. Eisner2 The Tagging Task Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj  Uses:  text-to-speech (how do we pronounce “lead”?)  can write regexps like (Det) Adj* N+ over the output  preprocessing to speed up parser (but a little dangerous)  if you know the tag, you can back off to it in other tasks

Intro to NLP - J. Eisner3 Why Do We Care?  The first statistical NLP task  Been done to death by different methods  Easy to evaluate (how many tags are correct?)  Canonical finite-state task  Can be done well with methods that look at local context  Though should “really” do it by parsing! Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj

Intro to NLP - J. Eisner4 Degree of Supervision  Supervised: Training corpus is tagged by humans  Unsupervised: Training corpus isn’t tagged  Partly supervised: Training corpus isn’t tagged, but you have a dictionary giving possible tags for each word  We’ll start with the supervised case and move to decreasing levels of supervision.

Intro to NLP - J. Eisner5 Current Performance  How many tags are correct?  About 97% currently  But baseline is already 90%  Baseline is performance of stupidest possible method  Tag every word with its most frequent tag  Tag unknown words as nouns Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj

Intro to NLP - J. Eisner6 What Should We Look At? Bill directed a cortege of autos through the dunes PN Verb Det Noun Prep Noun Prep Det Noun correct tags PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more) …? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too …

Intro to NLP - J. Eisner7 What Should We Look At? Bill directed a cortege of autos through the dunes PN Verb Det Noun Prep Noun Prep Det Noun correct tags PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more) …? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too …

Intro to NLP - J. Eisner8 What Should We Look At? Bill directed a cortege of autos through the dunes PN Verb Det Noun Prep Noun Prep Det Noun correct tags PN Adj Det Noun Prep Noun Prep Det Noun Verb Verb Noun Verb Adj some possible tags for Prep each word (maybe more) …? Each unknown tag is constrained by its word and by the tags to its immediate left and right. But those tags are unknown too …

Intro to NLP - J. Eisner9 Three Finite-State Approaches  Noisy Channel Model (statistical) noisy channel X  Y real language X yucky language Y want to recover X from Y part-of-speech tags (n-gram model) replace tags with words text

Intro to NLP - J. Eisner10 Three Finite-State Approaches 1.Noisy Channel Model (statistical) 2.Deterministic baseline tagger composed with a cascade of fixup transducers 3.Nondeterministic tagger composed with a cascade of finite-state automata that act as filters

Intro to NLP - J. Eisner11 Review: Noisy Channel noisy channel X  Y real language X yucky language Y p(X) p(Y | X) p(X,Y) * = want to recover x  X from y  Y choose x that maximizes p(x | y) or equivalently p(x,y)

Intro to NLP - J. Eisner12 Review: Noisy Channel p(X) p(Y | X) p(X,Y) * = a:D/0.9 a:C/0.1 b:C/0.8 b:D/0.2 a:a/0.7 b:b/0.3.o. = a:D/0.63 a:C/0.07 b:C/0.24 b:D/0.06 Note p(x,y) sums to 1. Suppose y=“C”; what is best “x”?

Intro to NLP - J. Eisner13 Review: Noisy Channel p(X) p(Y | X) p(X,Y) * = a:D/0.9 a:C/0.1 b:C/0.8 b:D/0.2 a:a/0.7 b:b/0.3.o. = a:D/0.63 a:C/0.07 b:C/0.24 b:D/0.06 Suppose y=“C”; what is best “x”?

Intro to NLP - J. Eisner14 Review: Noisy Channel p(X) p(Y | X) p(X, y) * = a:D/0.9 a:C/0.1 b:C/0.8 b:D/0.2 a:a/0.7 b:b/0.3.o. = a:C/0.07 b:C/0.24.o.* C:C/1 (Y = y)? restrict just to paths compatible with output “C” best path

Intro to NLP - J. Eisner15 Noisy Channel for Tagging p(X) p(Y | X) p(X, y) * = a:D/0.9 a:C/0.1 b:C/0.8 b:D/0.2 a:a/0.7 b:b/0.3.o. = a:C/0.07 b:C/0.24.o.* C:C/1 (Y = y)? best path acceptor: p(tag sequence) transducer: tags  words acceptor: the observed words transducer: scores candidate tag seqs on their joint probability with obs words; pick best path “Markov Model” “Unigram Replacement” “straight line”

Intro to NLP - J. Eisner16 Markov Model (bigrams) Det Start Adj Noun Verb Prep Stop

Intro to NLP - J. Eisner17 Markov Model Det Start Adj Noun Verb Prep Stop

Intro to NLP - J. Eisner18 Markov Model Det Start Adj Noun Verb Prep Stop

Intro to NLP - J. Eisner19 Markov Model Det Start Adj Noun Verb Prep Stop Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * p(tag seq) 0.1

Intro to NLP - J. Eisner20 Markov Model as an FSA Det Start Adj Noun Verb Prep Stop Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * p(tag seq) 0.1

Intro to NLP - J. Eisner21 Markov Model as an FSA Det Start Adj Noun Verb Prep Stop Noun 0.7 Adj 0.3 Adj 0.4  0.1 Noun 0.5 Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2 Det 0.8  0.2 p(tag seq)

Intro to NLP - J. Eisner22 Markov Model (tag bigrams) Det Start Adj Noun Stop Adj 0.4 Noun 0.5  0.2 Det 0.8 p(tag seq) Start Det Adj Adj Noun Stop = 0.8 * 0.3 * 0.4 * 0.5 * 0.2 Adj 0.3

Intro to NLP - J. Eisner23 Noisy Channel for Tagging p(X) p(Y | X) p(X, y) * =.o. = * p(y | Y) automaton: p(tag sequence) transducer: tags  words automaton: the observed words transducer: scores candidate tag seqs on their joint probability with obs words; pick best path “Markov Model” “Unigram Replacement” “straight line”

Intro to NLP - J. Eisner24 Noisy Channel for Tagging p(X) p(Y | X) p(X, y) * =.o. = * p(y | Y) transducer: scores candidate tag seqs on their joint probability with obs words; we should pick best path thecooldirectedautos Adj:cortege/ … Noun:Bill/0.002 Noun:autos/0.001 … Noun:cortege/ Adj:cool/0.003 Adj:directed/ Det:the/0.4 Det:a/0.6 Det Start Adj Noun Verb Prep Stop Noun 0.7 Adj 0.3 Adj 0.4  0.1 Noun 0.5 Det 0.8  0.2

Intro to NLP - J. Eisner25 Unigram Replacement Model Noun:Bill/0.002 Noun:autos/0.001 … Noun:cortege/ Adj:cool/0.003 Adj:directed/ Adj:cortege/ … Det:the/0.4 Det:a/0.6 sums to 1 p(word seq | tag seq)

Intro to NLP - J. Eisner26 Det Start Adj Noun Verb Prep Stop Adj 0.3 Adj 0.4 Noun 0.5 Det 0.8  0.2 p(tag seq) Compose Adj:cortege/ … Noun:Bill/0.002 Noun:autos/0.001 … Noun:cortege/ Adj:cool/0.003 Adj:directed/ Det:the/0.4 Det:a/0.6 Det Start Adj Noun Verb Prep Stop Noun 0.7 Adj 0.3 Adj 0.4  0.1 Noun 0.5 Det 0.8  0.2

Intro to NLP - J. Eisner27 Det:a 0.48 Det:the 0.32 Compose Det Start Adj Noun Stop Adj:cool Adj:directed Adj:cortege p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) Adj:cortege/ … Noun:Bill/0.002 Noun:autos/0.001 … Noun:cortege/ Adj:cool/0.003 Adj:directed/ Det:the/0.4 Det:a/0.6 Verb Prep Det Start Adj Noun Verb Prep Stop Noun 0.7 Adj 0.3 Adj 0.4  0.1 Noun 0.5 Det 0.8  0.2 Adj:cool Adj:directed Adj:cortege N:cortege N:autos 

Intro to NLP - J. Eisner28 Observed Words as Straight-Line FSA word seq thecooldirectedautos

Intro to NLP - J. Eisner29 Det:a 0.48 Det:the 0.32 Det Start Adj Noun Stop Adj:cool Adj:directed Adj:cortege p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) Verb Prep Compose with thecooldirectedautos Adj:cool Adj:directed Adj:cortege N:cortege N:autos 

Intro to NLP - J. Eisner30 Det:the 0.32 Det Start Adj Noun Stop Adj:cool p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) Verb Prep thecooldirectedautos Compose with Adj why did this loop go away? Adj:directed N:autos 

Intro to NLP - J. Eisner31 Det:the 0.32 Det Start Adj Noun Stop Adj:cool p(word seq, tag seq) = p(tag seq) * p(word seq | tag seq) Verb Prep Adj Adj:directed N:autos The best path: Start Det Adj Adj Noun Stop = 0.32 * … the cool directed autos 

Intro to NLP - J. Eisner32 Det:the 0.32 In Fact, Paths Form a “Trellis” Det StartAdj Noun Stop p(word seq, tag seq) Det Adj Noun Det Adj Noun Det Adj Noun Adj:directed… Noun:autos…  0.2 Adj:directed… The best path: Start Det Adj Adj Noun Stop = 0.32 * … the cool directed autos Adj:cool Noun:cool 0.007

Intro to NLP - J. Eisner33 So all paths here must have 4 words on output side All paths here are 4 words The Trellis Shape Emerges from the Cross-Product Construction for Finite-State Composition 0,0 1,12,13,11,22,23,21,32,33,31,42,43,4 4, =.o      

Intro to NLP - J. Eisner34 Det:the 0.32 Actually, Trellis Isn’t Complete Det StartAdj Noun Stop p(word seq, tag seq) Det Adj Noun Det Adj Noun Det Adj Noun Adj:directed… Noun:autos…  0.2 Adj:directed… The best path: Start Det Adj Adj Noun Stop = 0.32 * … the cool directed autos Adj:cool Noun:cool Trellis has no Det  Det or Det  Stop arcs; why?

Intro to NLP - J. Eisner35 Noun:autos… Det:the 0.32 Actually, Trellis Isn’t Complete Det StartAdj Noun Stop p(word seq, tag seq) Det Adj Noun Det Adj Noun Det Adj Noun Adj:directed…  0.2 Adj:directed… The best path: Start Det Adj Adj Noun Stop = 0.32 * … the cool directed autos Adj:cool Lattice is missing some other arcs; why? Noun:cool 0.007

Intro to NLP - J. Eisner36 Noun:autos… Det:the 0.32 Actually, Trellis Isn’t Complete Det StartStop p(word seq, tag seq) Adj Noun Adj Noun Adj:directed… The best path: Start Det Adj Adj Noun Stop = 0.32 * … the cool directed autos Adj:cool Lattice is missing some states; why? Noun:cool  0.2

Intro to NLP - J. Eisner37 Find best path from Start to Stop  Use dynamic programming – like prob. parsing:  What is best path from Start to each node?  Work from left to right  Each node stores its best path from Start (as probability plus one backpointer)  Special acyclic case of Dijkstra’s shortest-path alg.  Faster if some arcs/states are absent Det:the 0.32 Det StartAdj Noun Stop Det Adj Noun Det Adj Noun Det Adj Noun Adj:directed… Noun:autos…  0.2 Adj:directed… Adj:cool Noun:cool 0.007

Intro to NLP - J. Eisner38 In Summary  We are modeling p(word seq, tag seq)  The tags are hidden, but we see the words  Is tag sequence X likely with these words?  Noisy channel model is a “Hidden Markov Model”: Start PN Verb Det Noun Prep Noun Prep Det Noun Stop Bill directed a cortege of autos through the dunes  Find X that maximizes probability product probs from tag bigram model probs from unigram replacement

Intro to NLP - J. Eisner39 Another Viewpoint  We are modeling p(word seq, tag seq)  Why not use chain rule + some kind of backoff?  Actually, we are! Start PNVerbDet… Billdirecteda… p() = p(Start) * p(PN | Start) * p(Verb | Start PN) * p(Det | Start PN Verb) * … * p(Bill | Start PN Verb …) * p(directed | Bill, Start PN Verb Det …) * p(a | Bill directed, Start PN Verb Det …) * …

Intro to NLP - J. Eisner40 Another Viewpoint  We are modeling p(word seq, tag seq)  Why not use chain rule + some kind of backoff?  Actually, we are! Start PNVerbDet… Billdirecteda… p() = p(Start) * p(PN | Start) * p(Verb | Start PN) * p(Det | Start PN Verb) * … * p(Bill | Start PN Verb …) * p(directed | Bill, Start PN Verb Det …) * p(a | Bill directed, Start PN Verb Det …) * … Start PN Verb Det Noun Prep Noun Prep Det Noun Stop Bill directed a cortege of autos through the dunes

Intro to NLP - J. Eisner41 Three Finite-State Approaches 1.Noisy Channel Model (statistical) 2.Deterministic baseline tagger composed with a cascade of fixup transducers 3.Nondeterministic tagger composed with a cascade of finite-state automata that act as filters

Intro to NLP - J. Eisner42 Another FST Paradigm: Successive Fixups  Like successive markups but alter  Morphology  Phonology  Part-of-speech tagging  … Initial annotation Fixup 1 Fixup 2 input output Fixup 3

Intro to NLP - J. Eisner43 Transformation-Based Tagging (Brill 1995) figure from Brill’s thesis

Intro to NLP - J. Eisner44 Transformations Learned figure from Brill’s thesis Compose this cascade of FSTs. Gets a big FST that does the initial tagging and the sequence of fixups “all at once.” BaselineTag*  VB // TO _  VB //... _ etc.

Intro to NLP - J. Eisner45 Initial Tagging of OOV Words figure from Brill’s thesis

Intro to NLP - J. Eisner46 Three Finite-State Approaches 1.Noisy Channel Model (statistical) 2.Deterministic baseline tagger composed with a cascade of fixup transducers 3.Nondeterministic tagger composed with a cascade of finite-state automata that act as filters

Intro to NLP - J. Eisner47 Variations  Multiple tags per word  Transformations to knock some of them out  How to encode multiple tags and knockouts?  Use the above for partly supervised learning  Supervised: You have a tagged training corpus  Unsupervised: You have an untagged training corpus  Here: You have an untagged training corpus and a dictionary giving possible tags for each word