Albert Gatt Corpora and Statistical Methods Lecture 9.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
POS Tagging & Chunking Sambhav Jain LTRC, IIIT Hyderabad.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –One exception: games with multiple moves In particular, the Bayesian.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003.
Evaluating Hypotheses
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Part-of-Speech Tagging
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
Some Probability Theory and Computational models A short overview.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Tokenization & POS-Tagging
Chapter 23: Probabilistic Language Models April 13, 2004.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Albert Gatt LIN3022 Natural Language Processing Lecture 7.
Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.
CSA3202 Human Language Technology HMMs for POS Tagging.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Dongfang Xu School of Information
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Hidden Markov Models BMI/CS 576
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Lecture 7 HMMs – the 3 Problems Forward Algorithm
Lecture 7 HMMs – the 3 Problems Forward Algorithm
CS4705 Natural Language Processing
CPSC 503 Computational Linguistics
Presentation transcript:

Albert Gatt Corpora and Statistical Methods Lecture 9

POS Tagging overview; HMM taggers, TBL tagging Part 2

The task Assign each word in continuous text a tag indicating its part of speech. Essentially a classification problem. Current state of the art: taggers typically have 96-97% accuracy figure evaluated on a per-word basis in a corpus with sentences of average length 20 words, 96% accuracy can mean one tagging error per sentence

Sources of difficulty in POS tagging Mostly due to ambiguity when words have more than one possible tag. need context to make a good guess about POS context alone won’t suffice A simple approach which assigns only the most common tag to each word performs with 90% accuracy!

The information sources 1. Syntagmatic information: the tags of other words in the context of w Not sufficient on its own. E.g. Greene/Rubin 1977 describe a context-only tagger with only 77% accuracy 2. Lexical information (“dictionary”): most common tag(s) for a given word e.g. in English, many nouns can be used as verbs (flour the pan, wax the car…) however, their most likely tag remains NN distribution of a word’s usages across different POSs is uneven: usually, one highly likely, other much less

Tagging in other languages (than English) In English, high reliance on context is a good idea, because of fixed word order Free word order languages make this assumption harder Compensation: these languages typically have rich morphology Good source of clues for a tagger

Evaluation and error analysis Training a statistical POS tagger requires splitting corpus into training and test data. Often, we need a development set as well, to tune parameters. Using (n-fold) cross-validation is a good idea to save data. randomly divide data into train + test train and evaluate on test repeat n times and take an average NB: cross-validation requires the whole corpus to be blind. To examine the training data, best to have fixed training & test sets, perform cross-validation on training data, and final evaluation on test set.

Evaluation Typically carried out against a gold standard based on accuracy (% correct). Ideal to compare accuracy of our tagger with: baseline (lower-bound): standard is to choose the unigram most likely tag ceiling (upper bound): e.g. see how well humans do at the same task humans apparently agree on 96-7% tags means it is highly suspect for a tagger to get 100% accuracy

HMM taggers

Using Markov models Basic idea: sequences of tags are a Markov Chain: Limited horizon assumption: sufficient to look at previous tag for information about current tag Time invariance: The probability of a sequence remains the same over time

Implications/limitations Limited horizon ignores long-distance dependences e.g. can’t deal with WH-constructions Chomsky (1957): this was one of the reasons cited against probabilistic approaches Time invariance: e.g. P(finite verb|pronoun) is constant but we may be more likely to find a finite verb following a pronoun at the start of a sentence than in the middle!

Notation We let t i range over tags Let w i range over words Subscripts denote position in a sequence Use superscripts to denote word types: w j = an instance of word type j in the lexicon t j = tag t assigned to word w j Limited horizon property becomes:

Basic strategy Training set of manually tagged text extract probabilities of tag sequences: e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = Next step: estimate the word/tag probabilities: These are basically symbol emission probabilities

Training the tagger: basic algorithm 1. Estimate probability of all possible sequences of 2 tags in the tagset from training data 2. For each tag t j and for each word w l estimate P(w l | t j ). 3. Apply smoothing.

Finding the best tag sequence Given: a sentence of n words Find: t 1,n = the best n tags Application of Bayes’ rule denominator can be eliminated as it’s the same for all tag sequences.

Finding the best tag sequence The expression needs to be reduced to parameters that can be estimated from the training corpus need to make some simplifying assumptions 1. words are independent of eachother 2. a word’s identity depends only on its tag

The independence assumption Probability of a sequence of words given a sequence of tags is computed as a function of each word independently

The identity assumption Probability of a word given a tag sequence = probability a word given its own tag

Applying these assumptions

Tagging with the Markov Model Can use the Viterbi Algorithm to find the best sequence of tags given a sequence of words (sentence) Reminder: probability of being in state (tag) j at word i on the best path most probable state (tag) at word i given that we’re in state j at word i+1

The algorithm: initialisation Assume that P(PERIOD) = 1 at end of sentence Set all other tag probs to 0

Algorithm: induction step for i = 1 to n step 1: for all tags t j do: Probability of tag t j at i+1 on best path through i Most probable tag leading to tj at i+1

Algorithm: backtrace State at n+1 for j = n to 1 do: retrieve the most probable tags for every point in sequence Calculate probability for the sequence of tags selected

Some observations The model is a Hidden Markov Model we only observe words when we tag In actuality, during training we have a visible Markov Model because the training corpus provides words + tags

“True” HMM taggers Applied to cases where we do not have a large training corpus We maintain the usual MM assumptions Initialisation: use dictionary: set emission probability for a word/tag to 0 if it’s not in dictionary Training: apply to data, use forward-backward algorithm Tagging: exactly as before