Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.

Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.

FSA and HMM LING 572 Fei Xia 1/5/06.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003.

Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,

1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.

1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

POS Tagging HMM Taggers (continued). Today Walk through the guts of an HMM Tagger Address problems with HMM Taggers, specifically unknown words.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

Albert Gatt Corpora and Statistical Methods Lecture 9.

Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.

Part-of-Speech Tagging

Online Chinese Character Handwriting Recognition for Linux

1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Graphical models for part of speech tagging

CS 4705 Hidden Markov Models Julia Hirschberg CS4705.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,

인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Tokenization & POS-Tagging

Lecture 4 Ngrams Smoothing

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

CSA3202 Human Language Technology HMMs for POS Tagging.

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

中文信息处理 Chinese NLP Lecture 7.

Dongfang Xu School of Information

Conditional Markov Models: MaxEnt Tagging and MEMMs

Modified from Diane Litman's version of Steve Bird's notes 1 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger?

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,

Hidden Markov Models BMI/CS 576

N-Grams Chapter 4 Part 2.

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

CPSC 503 Computational Linguistics

N-Gram Model Formulas Word sequences Chain rule of probability

Word Embedding Word2Vec.

CONTEXT DEPENDENT CLASSIFICATION

CSCE 771 Natural Language Processing

CSCE 771 Natural Language Processing

CPSC 503 Computational Linguistics

Part-of-Speech Tagging Using Hidden Markov Models

Presentation transcript:

Ling 570 Day 6: HMM POS Taggers 1

Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details 2

HMM POS TAGGING 3

HMM Tagger 4

5

6

7

The good HMM Tagger From the Brown/Switchboard corpus: –P(VB|TO) =.34 –P(NN|TO) =.021 –P(race|VB) = –P(race|NN) = a.P(VB|TO) x P(race|VB) =.34 x = b.P(NN|TO) x P(race|NN) =.021 x =  a. TO followed by VB in the context of race is more probable (‘race’ really has no effect here). 8

HMM Philosophy Imagine: the author, when creating this sentence, also had in mind the parts-of- speech of each of these words. After the fact, we’re now trying to recover those parts of speech. They’re the hidden part of the Markov model. 9

What happens when we do it the wrong way? Invert word and tag, P(t|w) instead of P(w|t): 1.P(VB|race) =.02 2.P(NN|race) =.98 2 would drown out virtually any other probability! We’d always tag race with NN! 10

What happens when we do it the wrong way? 11

N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously 12

N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously Predict current tag conditioned on prior n-1 tags 13

N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously Predict current tag conditioned on prior n-1 tags Predict word conditioned on current tag 14

N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously 15

N-gram POS tagging JJ NNSVBRB colorlessgreenideassleepfuriously 16

HMM bigram tagger JJ NNSVBRB colorlessgreenideassleepfuriously 17

HMM trigram tagger JJ NNSVBRB colorlessgreenideassleepfuriously 18

Training An HMM needs to be trained on the following: 1.The initial state probabilities 2.The state transition probabilities –The tag-tag matrix 3.The emission probabilities –The tag-word matrix 19

Implementation 20

Implementation Transition distribution 21

Implementation Emission distribution 22

Implementation 23

Implementation 24

REVIEW VITERBI ALGORITHM 25

Consider two examples Mariners hit a a home run Mariners hit made the news 26

Consider two examples Mariners hit a a home run N N N N V V DT N N Mariners hit made the news N N V V DT N N N N 27

Parameters As probabilities, they get very small NVDT N V DT ahithomemadeMarinersnewsrunthe N E E-05 V DT

Parameters As probabilities, they get very small NVDT N V DT ahithomemadeMarinersnewsrunthe N E E-05 V DT NVDT N V DT ahithomemadeMarinersnewsrunthe N V-10-8 DT-2 As log probabilities, they won’t underflow… …and we can just add them 29

NVDT N -3-7 V DT ahithomemadeMarinersnewsrunthe N V-10-8 DT-2 Marinershitahomerun N V DT 30

NVDT N V DT ahithomemadeMarinersnewsrunthe N V-10-8 DT-2 Marinershitmadethenews N V DT 31

Viterbi 32

Pseudocode 33

Pseudocode 34

SMOOTHING 35

Training 36

Why Smoothing? Zero counts 37

Why Smoothing? Zero counts Handle missing tag sequences: –Smooth transition probabilities 38

Why Smoothing? Zero counts Handle missing tag sequences: –Smooth transition probabilities Handle unseen words: –Smooth observation probabilities 39

Why Smoothing? Zero counts Handle missing tag sequences: –Smooth transition probabilities Handle unseen words: –Smooth observation probabilities Handle unseen (word,tag) pairs where both are known 40

Smoothing Tag Sequences 41

Smoothing Tag Sequences 42

Smoothing Tag Sequences 43

Smoothing Tag Sequences 44

Smoothing Emission Probabilities 45

Smoothing Emission Probabilities 46

Smoothing Emission Probabilities Preprocessing the training corpus: –Count occurrences of all words –Replace words singletons with magic token –Gather counts on modified data, estimate parameters Preprocessing the test set –For each test set word –If seen at least twice in training set, leave it alone –Otherwise replace with –Run Viterbi on this modified input 47

Unknown Words Is there other information we could use for P(w|t)? –Information in words themselves? Morphology: –-able:  JJ –-tion  NN –-ly  RB –Case: John  NP, etc –Augment models Add to ‘context’ of tags Include as features in classifier models –We’ll come back to this idea! 48

HMM IMPLEMENTATION 49

HMM Implementation: Storing an HMM Approach #1: –Hash table (direct): π i = 50

HMM Implementation: Storing an HMM Approach #1: –Hash table (direct): π i =pi{state_str} a ij : 51

HMM Implementation: Storing an HMM Approach #1: –Hash table (direct): π i =pi{state_str} a ij :a{from_state_str}{to_state_str} b i (o t ): 52

HMM Implementation: Storing an HMM Approach #1: –Hash table (direct): π i =pi{state_str} a ij :a{from_state_str}{to_state_str} b i (o t ): b{state_str}{symbol} 53

HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}= 54

HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}= 55

HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = 56

HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]= 57

HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]=state_str π i : 58

HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]=state_str π i :pi[state_idx] a ij : 59

HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]=state_str π i :pi[state_idx] a ij :a[from_state_idx][to_state_idx] b i (o t ): 60

HMM Implementation: Storing an HMM Approach #2: –hash tables+arrays state2idx{state_str}=state_idx symbol2idx{symbol}=symbol_idx idx2symbol[symbol_idx] = symbol idx2state[state_idx]=state_str π i :pi[state_idx] a ij :a[from_state_idx][to_state_idx] b i (o t ):b[state_idx][symbol_idx] 61

HMM Matrix Representations Issue: 62

HMM Matrix Representations Issue: –Many matrix entries are 0 Especially b[i][o] Approach 3: Sparse matrix representation –a[i] = 63

HMM Matrix Representations Issue: –Many matrix entries are 0 Especially b[i][o] Approach 3: Sparse matrix representation –a[i] = “j1 p1 j2 p2…” –a[j] = 64

HMM Matrix Representations Issue: –Many matrix entries are 0 Especially b[i][o] Approach 3: Sparse matrix representation –a[i] = “j1 p1 j2 p2…” –a[j] = “i1 p1 i2 p2..” –b[i] = “o1 p1 o2 p2 …” –b[o] = “i1 p1 i2 p2…” 65

HMM Matrix Representations Issue: –Many matrix entries are 0 Especially b[i][o] Approach 3: Sparse matrix representation –a[i] = “j1 p1 j2 p2…” –a[j] = “i1 p1 i2 p2..” –b[i] = “o1 p1 o2 p2 …” –b[o] = “i1 p1 i2 p2…” Could be: –Array of hashes –Array of lists of non-empty values –The latter is often quite fast, because lists are short and fit into cache lines 66