Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Slides:



Advertisements
Similar presentations
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Advertisements

An Introduction to Conditional Random Field Ching-Chun Hsiao 1.
Pattern Recognition and Machine Learning
Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
John Lafferty, Andrew McCallum, Fernando Pereira
2004/11/161 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Presented by: Chi-Chun.
Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date:
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Conditional Random Fields
Machine Learning CMPT 726 Simon Fraser University
Expectation-Maximization
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Today Logistic Regression Decision Trees Redux Graphical Models
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Isolated-Word Speech Recognition Using Hidden Markov Models
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Graphical models for part of speech tagging
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
CS Statistical Machine learning Lecture 24
Linear Models for Classification
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Structured prediction
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Data Mining Lecture 11.
Bayesian Models in Machine Learning
Generally Discriminant Analysis
LECTURE 23: INFORMATION THEORY REVIEW
Markov Networks.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao

Discriminative Training 2

Our Concerns  Feature extraction and HMM modeling should be jointly performed.  Common objective function should be considered.  To alleviate model confusion and improve recognition performance, we should estimate HMM using discriminative criterion built from statistics theory.  Model parameters should be calculated rapidly without applying descent algorithm. 3

 MCE is a popular discriminative training algorithm developed for speech recognition and extended to other PR applications.  Rather than maximizing likelihood of observed data, MCE aims to directly minimize classification errors.  Gradient descent algorithm was used to estimate HMM parameters. Minimum Classification Error (MCE) 4

 Procedure of training discriminative models using observations X  Discriminant function  Anti-discriminant function  Misclassification measure MCE Training Procedure 5

 Loss function is calculated by mapping into a range between zero to one through a sigmoid function.  Minimize the expected loss or classification error to find discriminative model. Expected Loss 6

Hypothesis Test 7

 New training criterion was derived from hypothesis test theory.  We are testing null hypothesis against alternative hypothesis.  Optimal solution is obtained by a likelihood ratio test according to Neyman-Pearson Lemma  Higher likelihood ratio imply stronger confidence towards accepting null hypothesis. Likelihood Ratio Test 8

 Null and alternative hypotheses : Observations X are from target HMM state j : Observation X are not from target HMM state j  We develop discriminative HMM parameters for target state against non-target states.  Problem turns out to verify the goodness of data alignment to the corresponding HMM states. Hypotheses in HMM Training 9

Maximum Confidence Hidden Markov Model 10

 MCHMM is estimated by maximizing the log likelihood ratio or the confidence measure where parameter set consists of HMM parameters and transformation matrix Maximum Confidence HMM 11

 Expectation-maximization (EM) algorithm is applied to tackle missing data problem for maximum confidence estimation  E-step Hybrid Parameter Estimation 12

Expectation Function 13

MC Estimates of HMM Parameters 14

MC Estimates of HMM Parameters 15

MC Estimate of Transformation Matrix 16

17

MC Classification Rule  Let Y denote an input test image data. We apply the same criterion to identify the most likely category corresponding to Y 18

Summary  A new maximum confidence HMM framework was proposed.  Hypothesis test principle was used for building training criterion.  Discriminative feature extraction and HMM modeling were performed under the same criterion.  “ Maximum Confidence Hidden Markov Modeling for Face Recognition” Chien, Jen-Tzung; Liao, Chih-Pin; Pattern Analysis and Machine Intelligence, IEEE Transactions on Volume 30, Issue 4, April 2008 Page(s):606 – 616 Pattern Analysis and Machine Intelligence, IEEE Transactions onIssue 4 19

Machine Learning Approaches 20

Introduction  Conditional Random Fields (CRF)  relax the normal conditional independence assumption of the likelihood model  enforce the homogeneity of labeling variables conditioned on the observation  Due to the weak assumptions of CRF model and its discriminative nature  allows arbitrary relationship among data  may require less resources to train its parameters 21

 Better performance of CRF models than the Hidden Markov Model (HMM) and Maximum Entropy Markov models (MEMMs)  language and text processing problem  Object recognition problems  Image and video segmentation  tracking problem in video sequences 22

Generative & Discriminative Model 23

Two Classes of Models 24  Generative model (HMM) - model the distribution of states  Direct model (MEMM and CRF) - model the posterior probability directly MEMMCRF

Comparisons of Two Kinds of Model 25  Generative model – HMM  Use Bayesian rule approximation  Assume that observations are independent  Multiple overlapping features are not modeled  Model is estimated through recursive Viterbi algorithm

 Direct model - MEMM and CRF  Direct modeling of posterior probability  Dependencies of observations are flexibly modeled  Model is estimated through recursive Viterbi algorithm 26

Hidden Markov Model & Maximum Entropy Markov Model 27

HMM for Human Motion Recognition  HMM is defined by  Transition probability  Observation probability 28

Maximum Entropy Markov Model 29  MEMM is defined by  is used to replace transition and observation probability in HMM model

Maximum Entropy Criterion 30  Definition of feature functions where  Constrained optimization problem where empirical expectation model expectation

Solution of MEMM  Lagrange multipliers are used for constrained optimization where are the model parameters  Solution is obtained by 31

GIS Algorithm  Optimize the Maxmimum Mutual Information Criterion (MMI)  Step1: Calculate the empirical expectation  Step2: Start from an initial value  Step3: Calculate the model expectation  Step4: Update model parameters  Repeat step 3 and 4 until convergence 32

Conditional Random Field 33

Conditional Random Field 34  Definition Let be a graph such that. When conditioned on, and obeyed the Markov property Then, is a conditional random field

CRF Model Parameters  The undirected graphical structure can be used to factorize into a normalized product of potential functions  Consider the graph as a linear-chain structure  Model parameter set  Feature function set 35

CRF Parameter Estimation 36  We can rewrite and maximize the posterior probability where and  Log posterior probability is given by

Parameter Updating by GIS Algorithm 37  Differentiating the log posterior probability with respect to parameter  Setting this derivative to zero yields the constraint in maximum entropy model  This estimation has no closed-form solution. We can use GIS algorithm.

CRFMEMM DifferenceObjective FunctionMax. posterior probability with Gibbs distribution Max. entropy under constrain Complexity of calculating normalization term Full DP N-Best Top One Inference in model SimilarityFeature functionState & observation State & state ParameterWeight of feature function DistributionGibbs distribution 38

Summary and Future works 39  We construct complex CRF with cycle for better modeling of contextual dependency. Graphical model algorithm is applied.  In the future, the variational inference algorithm will be developed for improving calculation of conditional probability.  The posterior probability can be calculated directly by a approximating approach.  “Graphical modeling of conditional random fields for human motion recognition” Liao, Chih-Pin; Chien, Jen-Tzung; ICASSP IEEE International Conference on March April Page(s): ICASSP IEEE International Conference on

Thanks for your attention and Discussion 40