1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.

Slides:

Advertisements

Similar presentations

Linear Regression.

Advertisements

Yasuhiro Fujiwara (NTT Cyber Space Labs)

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Learning HMM parameters

Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.

Supervised Learning Recap

Hidden Markov Model Jianfeng Tang Old Dominion University 03/03/2004.

Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.

… Hidden Markov Models Markov assumption: Transition model:

PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall

Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.

Hidden Markov Models Lecture 5, Tuesday April 15, 2003.

Lecture 5: Learning models using EM

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Conditional Random Fields

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

Final review LING572 Fei Xia Week 10: 03/11/

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Graphical models for part of speech tagging

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.

Hidden Markov Models BMI/CS 776 Mark Craven March 2002.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

CS Statistical Machine learning Lecture 24

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

CSE 517 Natural Language Processing Winter 2015

John Lafferty Andrew McCallum Fernando Pereira

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Other Models for Time Series. The Hidden Markov Model (HMM)

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Hidden Markov Models BMI/CS 576

Statistical Models for Automatic Speech Recognition

CSC 594 Topics in AI – Natural Language Processing

Context-based Data Compression

Hidden Markov Models Part 2: Algorithms

Statistical Models for Automatic Speech Recognition

Algorithms of POS Tagging

The Most General Markov Substitution Model on an Unrooted Tree

LECTURE 15: REESTIMATION, EM AND MIXTURES

Hidden Markov Models By Manish Shrivastava.

Presentation transcript:

1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov Models for Sequence Labeling

2 Motivation & Contributions Experiments Homotopy method More experiments Outline

3 Parameter setting for the joint probability of input-output which maximizes probability of the given data: L : labeled data U : unlabeled data Maximum Likelihood Principle

4 Deficiency of MLE Usually |U| >> |L|, then Which means the relationship of input-output is ignored when estimating the parameters ! –MLE focuses on modeling the input distribution P(x) –But we are interested in modeling the joint distribution P(x,y)

5 Remedy for the Deficiency Balance the effect of lab and unlab data: Find which maximally take advantage of lab and unlab data MLE

6 An experiment with HMM Lower is Better MLE Performance MLE can hurt the performance Balancing lab and unlab data related terms is beneficial

7 Our Contributions 1.Introducing a principled way to choose for HMM in sequence labeling (tagging) tasks 2.Introducing an efficient dynamic programming algorithm to compute second order statistics in HMM

8 Motivation & Contributions Experiments Homotopy method More experiments Outline

9 Task Field segmentation in information extraction 13 tag fields: AUTHOR, TITLE, … EDITOR EDITOR EDITOR EDITOR EDITOR EDITOR TITLE A. Elmagarmid, editor. Transaction TITLE TITLE TITLE TITLE TITLE TITLE PUB Models for Advanced Database Applications, Morgan PUB PUB PUB DATE DATE - Kaufmann, 1992.

10 Experimental Setup Use an HMM with 13 states –Freeze the transition (state->state) probabilities to what has been observed in the lab data –Use the Homotopy method to just learn the emission (state->alphabet) probabilities –Do add-  smoothing for the initial values of emission and transition probabilities Data statistics: –Average seq. length : 36.7 –Average number of segments in a seq: 5.4 –Size of Lab/Unlab data is 300/700

11 Baselines Held-out: put aside part of the lab data as a held-out set, and use it t choose Oracle: choose based on test data using per position accuracy Supervised: forgetting about unlab data, and just using lab data

12 Homotopy vs Baselines Higher is Better Sequence of most probable states decoding See paper for more results Even very small values of can be useful. In homotopy =.004, and in supervised = 0

13 Motivation & Contributions Experiments Homotopy method More experiments Outline

14 Path of Solutions Look at  as changes from 0 to 1 Choose the best based on the path   Discontinuity  Bifurcation

15 EM for HMM Let be a state->state or state->observation event in our HMM To find best parameter values  which (locally) maximizes the objective function for a fixed : Repeat until convergence EM (  )

16 Fixed Points of EM Useful fact At the fixed points, then This is similar to using Homotopy for root finding –Same numerical techniques should be applicable here

17 Homotopy for Root Finding To find a root of G(  ) –start from a root of a simple problem F(  ) –trace the roots of intermediate problems while morphing F to G To find  which satisfy the above: –Set the derivative to zero: gives differential equation –Numerically solve the resulting differential eqn.

18 Solving the Differential Eqn M. v = 0 Repeat until – Update in a proper direction parallel to v=Kernel(M) – Update M Jaccobian of EM 1

19 Jaccobian of EM 1 So, we need to compute the covariance matrix of the events The entry in the row and column of the covariance matrix is See the paper for details Challenging for HMM Forward-Backward

20 Expected Quadratic Counts for HMM Dynamic programming algorithm to efficiently compute Pre-compute a table Z x for each sequence Having table Z x, the EQC can be computed efficiently –The time complexity is where K is the number of states in the HMM (see paper for more details) k1k1 k2k2 xixi x i+1 xjxj … …… ……

21 How to Choose based on Path monotone: the first point at which the monotonocity of changes MaxEnt: choose for which the model has maximum entropy on the unlab data minEig: when solving the diff eqn, consider the minimum singular value of the matrix M. Across rounds, choose for which the minimum singular value is the smallest

22 Motivation & Contributions Experiments Homotopy method More experiments Outline

23 Varying the Size of Unlab Data Size of the labeled data: 100 The three Homotopy-based methods outperform EM maxEnt outperforms minEig and monotone minEig and monotone have similar performances

24 Picked Values

25 EM gives higher weight to unlabeled data compared to Homotopy-based method Picked Values selected by −maxEnt are much smaller than those selected by minEig and monotone − minEig and monotone are close

26 Conclusion and Future Work Using EM can hurt performance in the case |L| << |U| Proposed a method to alleviate this problem for HMMs for seq. labeling tasks To speed up the method –Using sampling to find approximation to covariance matrix –Using faster methods in recovering the solution path, e.g. predictor-corrector

27 Questions?

28 Is Oracle outperformed by Homotopy? No! - The performance measure used in selecting in oracle method may be different from that used in comparing homotopy and oracle - The decoding alg used in oracle may be different from that used in comparing homotopy and oracle

29 Why not set ? This adhoc way of setting has two drawbacks: -It still may hurt the performance. The proper may be much smaller than that. - In some situations, the right choice of may be a big value. Setting is very conservative and dose not fully take advantage of the available unlabeled data.

30 Homotopy vs Baselines –Viterbi Decoding: most probable seq of states decoding –SMS Decoding: seq of most probable states decoding Our method (see the paper for more results) Higher is Better