Text Models Continued HMM and PCFGs. Recap So far we have discussed 2 different models for text – Bag of Words (BOW) where we introduced TF-IDF Location.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Apaydin slides with a several modifications and additions by Christoph Eick.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
INTRODUCTION TO Machine Learning 3rd Edition
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
FSA and HMM LING 572 Fei Xia 1/5/06.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Inside-outside algorithm LING 572 Fei Xia 02/28/06.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Hidden Markov Models David Meir Blei November 1, 1999.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
More on Text Management. Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Albert Gatt Corpora and Statistical Methods Lecture 9.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-17: Probabilistic parsing; inside- outside probabilities.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 26– Recap HMM; Probabilistic Parsing cntd) Pushpak Bhattacharyya CSE Dept., IIT.
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 27– SMT Assignment; HMM recap; Probabilistic Parsing cntd) Pushpak Bhattacharyya.
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
PCFG estimation with EM The Inside-Outside Algorithm.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
MACHINE LEARNING 16. HMM. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Modeling dependencies.
1 Statistical methods in NLP Diana Trandabat
Hidden Markov Models BMI/CS 576
Presentation transcript:

Text Models Continued HMM and PCFGs

Recap So far we have discussed 2 different models for text – Bag of Words (BOW) where we introduced TF-IDF Location of words is not captured – Colocation (n-grams) Location of words is captured, relations between words in a “sliding window” of size n Problem: small n not realistic enough, for large n no statistics

Today Two models that allow to capture language syntax, with probabilities for words in different part-of-speech The first is based on a Markov Chain (Probabilistic Finite State Machine), the second on a Probabilistic Context Free Grammar

Example Application: Part-of-Speech Tagging Goal: For every word in a sentence, infer its part-of- speech Verb, noun, adjective… Perhaps we are interested (more) in documents for which W is often the sentence subject? Part-of-speech tagging – Useful for ranking – For machine translation – Word-Sense Disambiguation – …

Part-of-Speech Tagging Tag this word. This word is a tag. He dogs like a flea The can is in the fridge The sailor dogs me every day

A Learning Problem Training set: tagged corpus – Most famous is the Brown Corpus with about 1M words – The goal is to learn a model from the training set, and then perform tagging of untagged text – Performance tested on a test-set

Different Learning Approaches Supervised: Training corpus is tagged by humans Unsupervised: Training corpus isn’t tagged Partly supervised: Training corpus isn’t tagged, but we have a dictionary giving possible tags for each word

Training set and test set In a learning problem the common practice is to split the annotated examples to training and test set Training set used to “train” the model, e.g. learn parameters Test set used to test how well the model is doing

Simple Algorithm Assign to each word its most popular tag in the training set Bag-of-words approach Problem: Ignores context Dogs, tag will always be tagged as a noun… Can will be tagged as a verb Still, achieves around 80% correctness for real-life test-sets Can we do better?

(HMM) Hidden Markov Model Model: sentences are generated by a probabilistic process In particular, a Markov Chain whose states correspond to Parts-of-Speech Transitions are probabilistic In each state a word is outputted – The output word is again chosen probabilistically based on the state

HMM HMM is: – A set of N states – A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans – A vector of size N of initial state probabilities Pstart – A matrix NXM of emissions probabilities Pout “Hidden” because we see only the outputs, not the sequence of states traversed

Example

3 Fundamental Problems 1) Given an observation sequence, find the most likely hidden state sequence This is tagging 2) Compute the probability of a given observation Sequence (=sentence) 3) Given a training set find the model that would make the observations most likely

Tagging Find the most likely sequence of states that led to an observed output sequence – We know what the output (=sentence) is, want to track back its generation, i.e. sequence of states taken in the hidden model – Each state is a tag How many possible sequences are there?

Viterbi Algorithm Dynamic Programming V t,k is the probability of the most probable state sequence – Generating the first t + 1 observations (X0,..Xt) – And terminating at state k

Viterbi Algorithm Dynamic Programming V t,k is the probability of the most probable state sequence: – generating the first t + 1 observations (X0,..Xt) – and terminating at state k V 0,k = P start (k)*P out (k,X 0 ) V t,k = P out (k,X t )*max{V t-1k’ *P trans (k’,k)}

Finding the path Note that we are interested in the most likely path, not only in its probability So we need to keep track at each point of the argmax – Combine them to form a sequence What about top-k?

Complexity O(T*|S|^2) Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)

Forward Algorithm α t (k) is the probability of seeing the sequence X 0 …X t and terminating at state k

Computing the probabilities α 0 (k)= P start (k)*P out (k,X 0 ) α t (k)= Pout(k,X t )*Σ k’ {α t-1k’ *P trans (k’,k)} P(X 0,… X n )= Σ k α n (k)

Learning the HMM probabilities Expectation-Maximization Algorithm (Baum-Welsch) 1.Choose initial probabilities 2.Compute E ij the expected number of transitions from i to j while generating the training sequences, for each i,j (see next) 3.Set the probability of transition from i to j to be E ij / (Σ k E ik ) 4. Similarly for omission probability 5. Repeat 2-4 using the new model, until convergence

Forward-backward Forward probabilities: α t (k) is the probability of seeing the sequence X 0 …X t and terminating at state k Backward probabilities: β t (k) is the probability of seeing the sequence X t+1 …X n given that the Markov process is at state k at time t.

Computing the probabilities Forward algorithm α 0 (k)= Pstart(k)*Pout(k,X 0 ) α t (k)= Pout(k,X t )*Σ k’ {α t-1k’ *Ptrans(k’,k)} P(X 0,… Xn )= Σ k α n (k) Backward algorithm β t (k) = P(X t+1 …X n | state at time t is k) β t (k) = Σ k’ {Ptrans(k,k’)* Pout(k’,X t+1 )* β t+1 (k’)} β n+1 (k) = 1 for all k P(X 0,…X n )= Σ k β 0 (k)* Pstart(k)*Pout(k, X 0 ) Also: P(X 0,…X n )= Σ k α t (k)* β t (k)

Estimating Parameters

Expected number of transitions

Accuracy Tested experimentally Reaches 96-98% for the Brown corpus – Trained on half and tested on the other half Compare with the 80% by the trivial algorithm The hard cases are few but are very hard..

NLTK Natrual Language ToolKit Open source python modules for NLP tasks – Including stemming, POS tagging and much more

PCFG A PCFG is a tuple: –N is a set of non-terminals: – is a set of terminals –N 1 is the start symbol –R is a set of rules –P is the set of probabilities on rules We assume PCFG is in Chomsky Norm Form Parsing algorithms: –Earley (top-down) –CYK (bottom-up) –…–…

PFSA vs. PCFG PFSA can be seen as a special case of PCFG –State  non-terminal –Output symbol  terminal –Arc  context-free rule –Path  Parse tree (only right-branch binary tree) S1S2S3 ab S1 aS2 b S3 ε

PFSA and HMM HMM Finish Add a “Start” state and a transition from “Start” to any state in HMM. Add a “Finish” state and a transition from any state in HMM to “Finish”. Start

The connection between two algorithms HMM can (almost) be converted to a PFSA. PFSA is a special case of PCFG. Inside-outside is an algorithm for PCFG.  Inside-outside algorithm will work for HMM. Forward-backward is an algorithm for HMM.  In fact, Inside-outside algorithm is the same as forward-backward when the PCFG is a PFSA.

Forward and backward probabilities X1X1 XtXt XnXn … o1o1 onon X n+1 … O t-1 X1X1 … X t-1 XtXt … XnXn X n+1 O1O1 O t-1 OnOn OtOt

Backward/forward prob vs. Inside/outside prob X1X1 X t =N i OtOt OnOn O t-1 O1O1 OlOl O1O1 X1X1 X t =N i OtOt OnOn O t-1 PFSA: PCFG: Outside Inside Forward Backward

wpwp wmwm w p-1 w1w1 wqwq W q+1 N1N1 NjNj Notation

Inside and outside probabilities

Definitions Inside probability: total prob of generating words w p …w q from non-terminal N j. Outside probability: total prob of beginning with the start symbol N 1 and generating and all the words outside w p …w q When p>q,

Calculating inside probability (CYK algorithm) NjNj NrNr NsNs wpwp wdwd W d+1 wqwq

Calculating outside probability (case 1) NjNj NgNg wpwp wqwq W q+1 wewe NfNf N1N1 w1w1 wmwm

Calculating outside probability (case 2) NgNg NjNj wewe W p-1 WpWp wqwq NfNf N1N1 w1w1 wmwm

Outside probability

Probability of a sentence

Recap so far Inside probability: bottom-up Outside probability: top-down using the same chart. Probability of a sentence can be calculated in many ways.

Expected counts and update formulae

The probability of a binary rule is used (1)

The probability of N j is used (2)

The probability of a unary rule is used (3)

Multiple training sentences (1) (2)

Inner loop of the Inside-outside algorithm Given an input sequence and 1.Calculate inside probability: Base case Recursive case: 2.Calculate outside probability: Base case: Recursive case:

Inside-outside algorithm (cont) 3. Collect the counts 4. Normalize and update the parameters

Relation to EM

PCFG is a PM (Product of Multi-nominal) Model Inside-outside algorithm is a special case of the EM algorithm for PM Models. X (observed data): each data point is a sentence w 1m. Y (hidden data): parse tree Tr. Θ (parameters):

Relation to EM (cont)

Summary XtXt X t+1 OtOt N1N1 NrNr NsNs wpwp wdwd W d+1 wqwq NjNj

Summary (cont) Topology is known: –(states, arcs, output symbols) in HMM –(non-terminals, rules, terminals) in PCFG Probabilities of arcs/rules are unknown. Estimating probs using EM (introducing hidden data Y)

Converting HMM to PCFG Given an HMM=(S, Σ, π, A, B), create a PCFG=(S1, Σ1,S0, R, P) as follows: –S1= –Σ1= –S0=Start –R= –P:

Path  Parse tree X1X1 X2X2 XTXT … o1o1 o2o2 oToT X T+1 Start X1X1 D0D0 BOS X2X2 D 12 o1o1 … XTXT X T+1 D T,T+1 otot EOS

Outside probability q=T (j,i),(p,t) q=p (p,t) Outside prob for N j Outside prob for D ij

Inside probability q=T (j,i),(p,t) q=p (p,t) Inside prob for N j Inside prob for D ij

Renaming: (j,i), (s,j),(p,t),(m,T) Estimating

Renaming: (j,i), (s,j),(p,t),(m,T) Estimating

Renaming: (j,i), (s,j),(p,t),(m,T) Estimating

Renaming: (j,i), (s,j),(w,o),(m,T) Calculating

Renaming (j,i_j), (s,j),(p,t),(h,t), (m,T),(w,O), (N,D)