Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
Outline Why part of speech tagging? Word classes
BİL711 Natural Language Processing
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
September PART-OF-SPEECH TAGGING Universita’ di Venezia 1 Ottobre 2003.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 22: 11/9.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.
POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
1 PART-OF-SPEECH TAGGING. 2 Topics of the next three lectures Tagsets Rule-based tagging Brill tagger Tagging with Markov models The Viterbi algorithm.
CMSC 723 / LING 645: Intro to Computational Linguistics November 3, 2004 Lecture 9 (Dorr): Word Classes, POS Tagging (Chapter 8) Intro to Syntax (Start.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Word classes and part of speech tagging Chapter 5.
Stochastic POS tagging Stochastic taggers choose tags that result in the highest probability: P(word | tag) * P(tag | previous n tags) Stochastic taggers.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Heshaam Faili University of Tehran
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.
Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
1 LIN 6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 4: Part of Speech Tagging (II) - Introduction to Probability February.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Albert Gatt Corpora and Statistical Methods Lecture 10.
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Page 1 Part-of-Speech Tagging L545 Spring Page 2 POS Tagging Problem  Given a sentence W1…Wn and a tagset of lexical categories, find the most.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.
CSA3202 Human Language Technology HMMs for POS Tagging.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
NLP. Introduction to NLP Rule-based Stochastic –HMM (generative) –Maximum Entropy MM (discriminative) Transformation-based.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
1 COMP790: Statistical NLP POS Tagging Chap POS tagging Goal: assign the right part of speech (noun, verb, …) to words in a text “The/AT representative/NN.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Lecture 5 POS Tagging Methods
Word classes and part of speech tagging
Part-of-Speech Tagging
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models IP notice: slides from Dan Jurafsky.
CSCI 5832 Natural Language Processing
Part of Speech Tagging September 9, /12/2018.
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
CSCI 5832 Natural Language Processing
Evaluation Which of the three taggers did best?
CSCI 5832 Natural Language Processing
N-Gram Model Formulas Word sequences Chain rule of probability
Brian Nisonger Shauna Eggers Joshua Johanson
CPSC 503 Computational Linguistics
Part-of-Speech Tagging Using Hidden Markov Models
Presentation transcript:

Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider Thanks to Dan Jurafsky, Jim Martin, Dekang Lin, and Bonnie Dorr for some of the examples and details in these slides! 1/1/2019 LING 138/238 Autumn 2004

Week 3: Part of Speech tagging Parts of speech What’s POS tagging good for anyhow? Tag sets Rule-based tagging Statistical tagging “TBL” tagging 1/1/2019 LING 138/238 Autumn 2004

Rule-based tagging Start with a dictionary Assign all possible tags to words from the dictionary Write rules by hand to selectively remove tags Leaving the correct tag for each word. 1/1/2019 LING 138/238 Autumn 2004

3 methods for POS tagging Rule-based tagging (ENGTWOL) Stochastic (=Probabilistic) tagging HMM (Hidden Markov Model) tagging Transformation-based tagging Brill tagger 1/1/2019 LING 138/238 Autumn 2004

Statistical Tagging Based on probability theory Today we’ll go over a few basic ideas of probability theory Then we’ll do HMM and TBL tagging. 1/1/2019 LING 138/238 Autumn 2004

Probability and part of speech tags What’s the probability of drawing a 2 from a deck of 52 cards with four 2s? What’s the probability of a random word (from a random dictionary page) being a verb? 1/1/2019 LING 138/238 Autumn 2004

Probability and part of speech tags What’s the probability of a random word (from a random dictionary page) being a verb? How to compute each of these All words = just count all the words in the dictionary # of ways to get a verb: number of words which are verbs! If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is 10000/50000 = 1/5 = .20 1/1/2019 LING 138/238 Autumn 2004

Probability and Independent Events What’s the probability of picking two verbs randomly from the dictionary Events are independent, so multiply probs P(w1=V,w2=V) = P(V) * P(V) = 1/5 * 1/5 = 0.04 What if events are not independent? 1/1/2019 LING 138/238 Autumn 2004

Conditional Probability Written P(A|B). Let’s say A is “it’s raining”. Let’s say P(A) in drought-stricken California is .01 Let’s say B is “it was sunny ten minutes ago” P(A|B) means “what is the probability of it raining now if it was sunny 10 minutes ago” P(A|B) is probably way less than P(A) Let’s say P(A|B) is .0001 1/1/2019 LING 138/238 Autumn 2004

Conditional Probability and Tags P(Verb) is the probability of a randomly selected word being a verb. P(Verb|race) is “what’s the probability of a word being a verb given that it’s the word “race”? Race can be a noun or a verb. It’s more likely to be a noun. P(Verb|race) can be estimated by looking at some corpus and saying “out of all the times we saw ‘race’, how many were verbs? In Brown corpus, P(Verb|race) = 96/98 = .98 How to calculate for a tag sequence, say P(NN|DT)? 1/1/2019 LING 138/238 Autumn 2004

Stochastic Tagging Based on probability of certain tag occurring given various possibilities Necessitates a training corpus No probabilities for words not in corpus. Training corpus may be too different from test corpus. 1/1/2019 LING 138/238 Autumn 2004

Stochastic Tagging (cont.) Simple Method: Choose most frequent tag in training text for each word! Result: 90% accuracy Why? Baseline: Others will do better HMM is an example 1/1/2019 LING 138/238 Autumn 2004

HMM Tagger Intuition: Pick the most likely tag for this word. HMM Taggers choose tag sequence that maximizes this formula: P(word|tag) × P(tag|previous n tags) Let T = t1,t2,…,tn Let W = w1,w2,…,wn Find POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W. 1/1/2019 LING 138/238 Autumn 2004

HMM Tagger argmaxT P(T|W) argmaxTP(W|T)P(T) Bayes Rule argmaxTP(w1…wn|t1…tn)P(t1…tn) Remember, we are trying to find the sequence T that will maximize P(T|W) so this equation is calculated over the whole sentence. 1/1/2019 LING 138/238 Autumn 2004

HMM Tagger Assume word is dependent only on its own POS tag: it is independent of the others around it argmaxT[P(w1|t1)P(w2|t2)…P(wn|tn)][P(t1)P(t2|t1)…P(tn|tn-1)] 1/1/2019 LING 138/238 Autumn 2004

Bigram HMM Tagger Also assume that probability is dependent only on previous tag For each word and possible tag, need to calculate: P(ti) = P(wi|ti)P(ti|ti-1) then multiply this over each possible tag and each word over the sequence of tags 1/1/2019 LING 138/238 Autumn 2004

Bigram HMM Tagger How do we compute P(ti|ti-1)? c(ti-1ti)/c(ti-1) How do we compute P(wi|ti)? c(wi,ti)/c(ti) How do we compute the most probable tag sequence? Viterbi algorithm 1/1/2019 LING 138/238 Autumn 2004

An Example Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People/NNS continue/VBP to/TO inquire/VB the DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN to/TO race/??? the/DT race/??? ti = argmaxj P(tj|ti-1)P(wi|tj) i = num of word in sequence, j = num among possible tags max[P(VB|TO)P(race|VB) , P(NN|TO)P(race|NN)] Brown: P(NN|TO) = .021 × P(race|NN) = .00041 = .000007 P(VB|TO) = .34 × P(race|VB) = .00003 = .00001 1/1/2019 LING 138/238 Autumn 2004

Viterbi Algorithm S1 S2 S3 S4 S5 1/1/2019 LING 138/238 Autumn 2004

Transformation-Based Tagging (Brill Tagging) Combination of Rule-based and stochastic tagging methodologies Like rule-based because rules are used to specify tags in a certain environment Like stochastic approach because machine learning is used—with tagged corpus as input Input: tagged corpus dictionary (with most frequent tags) 1/1/2019 LING 138/238 Autumn 2004

Transformation-Based Tagging (cont.) Basic Idea: Set the most probable tag for each word as a start value Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order Training is done on tagged corpus: Write a set of rule templates Among the set of rules, find one with highest score Continue from 2 until lowest score threshold is passed Keep the ordered set of rules Rules make errors that are corrected by later rules 1/1/2019 LING 138/238 Autumn 2004

TBL Rule Application Tagger labels every word with its most-likely tag For example: race has the following probabilities in the Brown corpus: P(NN|race) = .98 P(VB|race)= .02 Transformation rules make changes to tags “Change NN to VB when previous tag is TO” … is/VBZ expected/VBN to/TO race/NN tomorrow/NN becomes … is/VBZ expected/VBN to/TO race/VB tomorrow/NN 1/1/2019 LING 138/238 Autumn 2004

TBL: Rule Learning 2 parts to a rule Triggering environment Rewrite rule The range of triggering environments of templates (from Manning & Schutze 1999:363) Schema ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 1/1/2019 LING 138/238 Autumn 2004

TBL: The Tagging Algorithm Step 1: Label every word with most likely tag (from dictionary) Step 2: Check every possible transformation & select one which most improves tagging Step 3: Re-tag corpus applying the rules Repeat 2-3 until some criterion is reached, e.g., X% correct with respect to training corpus RESULT: Sequence of transformation rules 1/1/2019 LING 138/238 Autumn 2004

TBL: Rule Learning (cont.) Problem: Could apply transformations ad infinitum! Constrain the set of transformations with “templates”: Replace tag X with tag Y, provided tag Z or word Z’ appears in some position Rules are learned in ordered sequence Rules may interact. Rules are compact and can be inspected by humans 1/1/2019 LING 138/238 Autumn 2004

Templates for TBL 1/1/2019 LING 138/238 Autumn 2004

TBL: Problems First 100 rules achieve 96.8% accuracy First 200 rules achieve 97.0% accuracy Execution Speed: TBL tagger is slower than HMM approach Learning Speed: Brill’s implementation over a day (600k tokens) BUT … (1) Learns small number of simple, non-stochastic rules (2) Can be made to work faster with FST (3) Best performing algorithm on unknown words 1/1/2019 LING 138/238 Autumn 2004

Tagging Unknown Words New words added to (newspaper) language 20+ per month Plus many proper names … Increases error rates by 1-2% Method 1: assume they are nouns Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set. Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN. 1/1/2019 LING 138/238 Autumn 2004

Using Morphological Information 1/1/2019 LING 138/238 Autumn 2004

Evaluation The result is compared with a manually coded “Gold Standard” Typically accuracy reaches 96-97% This may be compared with result for a baseline tagger (one that uses no context). Important: 100% is impossible even for human annotators. 1/1/2019 LING 138/238 Autumn 2004