POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.

Slides:



Advertisements
Similar presentations
Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
Advertisements

Outline Why part of speech tagging? Word classes
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
September PART-OF-SPEECH TAGGING Universita’ di Venezia 1 Ottobre 2003.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
POS Tagging & Chunking Sambhav Jain LTRC, IIIT Hyderabad.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 22: 11/9.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
FSA and HMM LING 572 Fei Xia 1/5/06.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
POS Tagging Markov Models. POS Tagging Purpose: to give us explicit information about the structure of a text, and of the language itself, without necessarily.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
1 PART-OF-SPEECH TAGGING. 2 Topics of the next three lectures Tagsets Rule-based tagging Brill tagger Tagging with Markov models The Viterbi algorithm.
CMSC 723 / LING 645: Intro to Computational Linguistics November 3, 2004 Lecture 9 (Dorr): Word Classes, POS Tagging (Chapter 8) Intro to Syntax (Start.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Word classes and part of speech tagging Chapter 5.
Stochastic POS tagging Stochastic taggers choose tags that result in the highest probability: P(word | tag) * P(tag | previous n tags) Stochastic taggers.
Albert Gatt Corpora and Statistical Methods Lecture 9.
M ARKOV M ODELS & POS T AGGING Nazife Dimililer 23/10/2012.
Heshaam Faili University of Tehran
Chapter 5: POS Tagging Heshaam Faili University of Tehran.
1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.
Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.
Albert Gatt Corpora and Statistical Methods Lecture 10.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.
Page 1 Part-of-Speech Tagging L545 Spring Page 2 POS Tagging Problem  Given a sentence W1…Wn and a tagset of lexical categories, find the most.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,
Word classes and part of speech tagging Chapter 5.
Tokenization & POS-Tagging
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.
Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.
CSA3202 Human Language Technology HMMs for POS Tagging.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
中文信息处理 Chinese NLP Lecture 7.
Dongfang Xu School of Information
NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Group – 8 Maunik Shah Hemant Adil Akanksha Patel.
1 COMP790: Statistical NLP POS Tagging Chap POS tagging Goal: assign the right part of speech (noun, verb, …) to words in a text “The/AT representative/NN.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Lecture 5 POS Tagging Methods
Word classes and part of speech tagging
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models IP notice: slides from Dan Jurafsky.
CSCI 5832 Natural Language Processing
Part of Speech Tagging September 9, /12/2018.
Hidden Markov Models Part 2: Algorithms
CSCI 5832 Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Presentation transcript:

POS Tagging HMM Taggers (continued)

Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words

HMM Tagger What is the goal of a Markov Tagger? To maximize the following expression: P(w i |t j ) x P(t i |t 1,i-1 ) Or P(word|tag) x P(tag|previous n tags) Simplifies, by the Markov assumption, to: P(w i |t i ) x P(t i |t i-1 )

HMM Tagger P(word|tag) x P(tag|previous n tags) P(word|tag) – –The probability of the word given a tag (not vice versa) –We model this by using a word-tag matrix (often called a language model) –Familiar? HW 4 (3)

HMM Tagger P(word|tag) x P(tag|previous n tags) P(tag|previous n tags) – –How likely a tag is given the n so many tags before –Simplified to the previous tag –Modeled by using a tag-tag matrix –Familiar? HW 4 (2)

HMM Tagger But why is it P(word|tag) not P(tag|word)? Take the following examples (from J&M): 1.Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN 2.People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/?? for/IN outer/JJ space/NN

HMM Tagger Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow/NN Maximize: P(w i |t j ) x P(t j |t j-1 ) We can choose between a.P(race|VB) P(VB|TO) b.P(race|NN) P(NN|TO)

The good HMM Tagger From the Brown/Switchboard corpus: –P(VB|TO) =.34 –P(NN|TO) =.021 –P(race|VB) = –P(race|NN) = a.P(VB|TO) x P(race|VB) =.34 x = b.P(NN|TO) x P(race|NN) =.021 x =  a. TO followed by VB in the context of race is more probable (‘race’ really has no effect here).

The no-go HMM Tagger Invert word and tag, P(t|w) instead of P(w|t): 1.P(VB|race) =.02 2.P(NN|race) =.98

HMM Tagger But don’t we really want to maximize the probability of the best sequence of tags for a given sequence of words? Not just the best tag for a given word? Thus, we really want to maximize (and implement): P(t 1,…,t n | w 1,…,w n ), or T^ = argmax P(T|W)

HMM Tagger By Bayes Rule: P(T) P(W|T) P(T|W) = P(W) Since P(W) is always the same (why?), then P(T|W) = P(T) P(W|T)

HMM Tagger P(T|W) = P(T) P(W|T) = P(t 1,…,t n ) P(w 1,…,w n |t 1,…,t n ) By chain rule (computes joint probabilities from conditional probabilities, or vice versa) = P(t n | t 1, …, t n-1 ) x P(t n-1 | t 1, …, t n-2 ) x P(t n-2 | t 1, …, t n-3 ) x … x P(t 1 ) x P(w 1 | t 1, …, t n ) x P(w 2 | t 1, …, t n ) x P(w 3 | t 1, …, t n ) x … x P(w n | t 1, …, t n ) n = Π P(w i |w 1 t 1 …w i-1 t i-1 ) P(t i |w 1 t 1 …w i-1 t i-1 ) i=1

HMM Tagger P(T|W) = P(T) P(W|T) n = Π P(w i |w 1 t 1 …w i-1 t i-1 ) P(t i |w 1 t 1 …w i-1 t i-1 ) i=1 Simplifying assumption: probability of word is dependent on its tag: P(w i |w 1 t 1 …w i-1 t i-1 ) = P(w i |t i ) And the Markov assumption (for bigram): P(t i |w 1 t 1 …w i-1 t i-1 t i ) = P(t i |t i-1 ) The best tag sequence is then: n T^ = argmax Π P(t i |t i-1 ) P(w i |t i ) i=1

Implementation So we have the best tag sequence will be the maximization of: T^ = argmax Π P(t i |t i-1 ) P(w i |t i ) Training: Learn the transition and emission probabilities from a corpus –smoothing may be necessary State transition probabilities Emission probabilities

Training An HMM needs to be trained on the following: 1.The initial state probabilities 2.The state transition probabilities –The tag-tag matrix 3.The emission probabilities –The tag-word matrix

Implementation Once trained, how do we implement such a maximization function? T^ = argmax Π P(t i |t i-1 ) P(w i |t i ) Can’t we just walk through every path, calculate all probabilities, and choose the path with the highest probability (max)? Yeah, if we have a lot of time. (Why?) –Exponential –Better to use a DP algorithm, such as Viterbi

Unknown Words The tagger just described will do poorly on unknown words. Why? Because P(w i |t i ) = 0 for a word it has not seen (or more specifically, the given word-tag pair). How do we resolve this problem? –A dictionary with the most common tag (the stupid tagger) Still doesn’t solve the problem for completely novel words –Morphological/typographical analysis –Probability of a tag generating an unknown word Secondary training required