CS626: NLP, Speech and the Web

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 12, 13: Sequence Labeling using.
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Learning with Missing Data
CS344 : Introduction to Artificial Intelligence
Hidden Markov Models (HMM) Rabiner’s Paper
Large Vocabulary Unconstrained Handwriting Recognition J Subrahmonia Pen Technologies IBM T J Watson Research Center.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Binomial Random Variable Approximations, Conditional Probability Density Functions and Stirling’s Formula.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2014.
Hidden Markov Models Eine Einführung.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models Usman Roshan BNFO 601.
Albert Gatt Corpora and Statistical Methods Lecture 8.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21- Forward Probabilities and Robotic Action Sequences.
Graphical models for part of speech tagging
Business and Finance College Principles of Statistics Eng. Heba Hamad 2008.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 16– Linear and Logistic Regression) Pushpak Bhattacharyya CSE Dept., IIT Bombay.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.
1 Let X represent a Binomial r.v as in (3-42). Then from (2-30) Since the binomial coefficient grows quite rapidly with n, it is difficult to compute (4-1)
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology.
Prof. Pushpak Bhattacharyya, IIT Bombay1 Basics Of Entropy CS 621 Artificial Intelligence Lecture /09/05 Prof. Pushpak Bhattacharyya.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
CS621: Artificial Intelligence
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
CSE 517 Natural Language Processing Winter 2015
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
CS621: Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 33,34– HMM, Viterbi, 14 th Oct, 18 th Oct, 2010.
CS621: Artificial Intelligence Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay Lecture 19: Hidden Markov Models.
CS : NLP, Speech and Web-Topics-in-AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 38-39: Baum Welch Algorithm; HMM training.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Introduction to EM algorithm
Mathematical Foundations of BME Reza Shadmehr
Three classic HMM problems
EM for Inference in MV Data
CS344 : Introduction to Artificial Intelligence
CS621: Artificial Intelligence
CS621: Artificial Intelligence
Algorithms of POS Tagging
EM for Inference in MV Data
Introduction to HMM (cont)
CS : NLP, Speech and Web-Topics-in-AI
Presentation transcript:

CS626: NLP, Speech and the Web Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 10, 11: EM (“how to know when you do not completely know”) 16th and 20th August, 2012

Problem NLP Trinity Language Algorithm Marathi French Hindi English Semantics NLP Trinity Parsing Part of Speech Tagging Morph Analysis Marathi French HMM Hindi English Language CRF MEMM Algorithm

Maximum Likelihood considerations

Problem:- Given no of heads obtained out of N trials, what is probability of obtaining head? In case of one coin Let probabilty of obtaining head = This implies probability of obtaining exactly Nh successes (heads) out of N trials (tosses)

Most “likely” value of PH To obtain the most likely value of , we take ln of the above equation and differentiate wrt :

Value of PH in absence of any information Suppose we know nothing about the properties of a coin then what can we say about probability of head ? We have to use the entropy E: Let be the probability of head ----(1)

Entropy Entropy is defined as sum of the multiplication of probability and log of probability with – sign.It is the instrument to deal with uncertainity. So best we can do is to maximize the entropy.Maximize E subject to the eq (1) and get the value of .

Finding PH and PT 1 2 3 From 2 and 3 4 From 4 and 1

A deeper look at EM Problem: two coins are tossed, randomly picking a coin at a time. The number of trials is N, number of heads is NH and number of tails is NT. How can one estimate the following probabilities: p: prob. Of choosing coin1 p1: prob. Of head from coin1 p2: prob. Of head from coin2

Expectation Maximization (1 Coin Toss) Toss 1 coin K = Number of heads N = Number of trials X = observation of tosses = <x1>, <x2>,<x3>…<xn> - each can take values 0 or 1 p = probability of Head = (as per MLE – maximizes probability of observed data)

Expectation Maximization (1 Coin Toss) Y = <x1, z1 >, <x2, z2>,<x3, z3>…<xi, zi>…<xn, zn> xi = 1 for Head = 0 for Tail zi = indicator function = 1 if the observation comes from the coin In this case, zi = 1

Expectation Maximization (2 coin toss) X = <x1>, <x2>,<x3>…<xi>…<xn> Y = <x1, z11, z12>,<x2, z21, z22>,<x3, z31, z32>…<xi, zi1, zi2>…<xn, zn1, zn2> xi = 1 for Head = 0 for Tail zi1 = 1 if the observation comes from coin 1 else 0 zi2 = 1 if the observation comes from coin 2 else 0 only 1 of zi1 and zi2 can be 1 xi is observed while zi1 and zi2 is unobserved

Expectation Maximization (2 coin toss) Parameters of the setting p1 = probability of Head for coin 1 p2 = probabilily of Head for coin 2 p = probability of choosing for coin 1 for the toss Express p, p1 and p2 in terms of observed and unobserved data

Expectation Maximization trick Replace zi1 and zi1 in p, p1, p2 with E(zi1) and E(zi2) zi1 : event of x=xi given that observation is from coin 1 E(zi1) = expectation of zi1

Summary X = <x1>, <x2>,<x3>…<xi>…<xn> Y = <x1, z11, z12>,<x2, z21, z22>,<x3, z31, z32>…<xi, zi1, zi2>…<xn, zn1, zn2> Mstep Estep

Observations Any EM problem has observed and unobserved data Nature of distribution two coins follow two different binomial distributions Oscillation between E and M convergence to local maxima or minima guaranteed greedy algorithm

EM: Baum-Welch algorithm: counts a, b a,b r q a,b a,b String = abb aaa bbb aaa Sequence of states with respect to input symbols o/p seq State seq

Calculating probabilities from table Table of counts T=#states A=#alphabet symbols Now if we have a non-deterministic transitions then multiple state seq possible for the given o/p seq (ref. to previous slide’s feature). Our aim is to find expected count through this. Src Dest O/P Count q r a 5 b 3 2

Interplay Between Two Equations wk No. of times the transitions sisj occurs in the string

Illustration q r q r Actual (Desired) HMM Initial guess a:0.67 b:0.17

One run of Baum-Welch algorithm: string ababb P(path) q r 0.00077 0.00154 0.00442 0.00884 0.02548 0.0 0.000 0.05096 0.07644 Rounded Total  0.035 0.01 0.06 0.095 New Probabilities (P)  =(0.01/(0.01+0.06+0.095) 1.0 0.36 0.581 State sequences * ε is considered as starting and ending symbol of the input sequence string. Through multiple iterations the probability values will converge.

Quiz 1 Solutions

Quiz: Question 1 (with bigram assumption)

Quiz: Question 1 (with bigram assumption)

Quiz: Question 1 (with trigram assumption)

Quiz: Question 1 (with trigram assumption)

Quiz: Question 2 Role of nouns as adjectives Inter annotator agreement Sri Lanka skipper: Sri Lanka is a noun but takes the role of adjective here Inter annotator agreement There is often disagreement between human annotators that can be noun or function word Handling of multiwords and named entities Sri Lanka and Aravinda de Silva are named entities and they are not expected to be present in the training corpus. In addition, they are multiwords and thus they should be treated as a single entity. talk: let takes a verb, hence talk should be tagged as V, but since let and talk are separated, the HMM cannot relate them larger k will take care but will lead to data sparsity