Presentation is loading. Please wait.

Presentation is loading. Please wait.

Present by: Fang-Hui Chu Minimum Classification Error (MCE) Approach in Pattern Recognition Wu Chou, Avaya Labs Research, Avaya Inc., USA.

Similar presentations


Presentation on theme: "Present by: Fang-Hui Chu Minimum Classification Error (MCE) Approach in Pattern Recognition Wu Chou, Avaya Labs Research, Avaya Inc., USA."— Presentation transcript:

1 Present by: Fang-Hui Chu Minimum Classification Error (MCE) Approach in Pattern Recognition Wu Chou, Avaya Labs Research, Avaya Inc., USA

2 2 Outline (1/2) Introduction Optimal Classifier from Bayes Desicion Theory Discriminant Function Approach to Classifier Design Speech Recogniation and Hidden Markov Modeling –Hidden Markov Modeling of Speech MCE Classifier Design Using Discriminant Functions –MCE Classifier Design Strategy –Optimization Methods –Other Optimization Methods –HMM as a Discriminant Function –Relation Between MCE and MMI –Discussions and Comments

3 3 Outline (2/2) MCE TRAINING BASED ON EMBEDDED STRING MODEL –String-Model-Based MCE Approach –Combined String-Model-Based MCE Approach –Discriminative Language Model Estimation SUMMARY

4 4 Introduction The advent of powerful computing devices and success of statistical approaches –A renewed pursuit for more powerful method to reduce recognition error rate Although MCE-based discriminative methods is rooted in the classical Bayes’ decision theory, instead of a classification task to distribution estimation problem, it takes a discriminant-function based statistical pattern classification approach For a given family of discriminant function, optimal classifier/recognizer design involves finding a set of parameters which minimize the empirical pattern recognition error rate

5 5 Introduction Why we take this approach to design classifier? –We lack complete knowledge of the form of the distribution –Training data are inadequate How to do? –Formulating the problem of self-learning into a classification problem which consists of optimal partitioning of the observation space into regions, X k, for which the expected risk, R, is minimized –Then we apply generalized probabilistic decent algorithm to achieve the goal

6 6 Optimal Classifier from Bayes Desicion Theory C1C1 C2C2 CMCM random 要分類 : x 不確定是 C i ,但被分到 C i 的機率 但,我們並不知道標準答案

7 7 Optimal Classifier from Bayes Desicion Theory 定義 loss function : 可以想成 Class i 與 Class j 的 distance, 將 Class i 的 observation 分到 Class j ,分錯的 cost 假設 Class i 是正確答案, 則將 x 分錯而得到的 cost 之 expectation (1)

8 8 Optimal Classifier from Bayes Desicion Theory 當我們作決定時 雖然我們並不知道正確的答案,但可算出作此決定需付出的代價 如何作出較正確的決定?  雖然不知道正確答案,但付出的代價愈小,則愈正確 (2)  【 Decision Rule 】 (3)

9 9 Optimal Classifier from Bayes Desicion Theory 在 SR 及許多 application 中,我們常用的 loss function (5) 所以【 Decision Rule 】可以改寫 (6) MAP decision Bayes’ risk Posterior Probability

10 10 Optimal Classifier from Bayes Desicion Theory OK!! 若 Posterior Probability 知道,一切好辦  over 但一般來說, Posterior Probability 需有已知 class 的 labeled training data 來估測 ( 這是不容易取得的 ) 本來是 classifier design 的問題  distribution estimation problem 由 Bayes’ Theorem (7) 可省略! estimate the a posterior probabilities for any to implement the maximum a posterior decision for minimum Bayes risk

11 11 Optimal Classifier from Bayes Desicion Theory three issues : –The distribution form is often limited by the mathematical tractability of the particular distribution functions and is very likely to be inconsistent with the actual distribution –The estimation method has to be able to produce consistent parameter values when the size of the training samples varies –Requiring a training data set of sufficient size in order to have reliable parameter estimates But in practice and for speech and language processing in particular, training data are always sparse

12 12 Optimal Classifier from Bayes Desicion Theory Despite the conceptual optimality of the Bayes decision theory and its applications to pattern recognition, it can’t always be accomplished in practice Most practical “MAP” decisions in speech and language processing are not true MAP decisions

13 13 Discriminant Function Approach to Classifier Design 先只考慮 2-class 定義 discriminant function 分類用 One well-studied family of discriminant function is the Linear discriminant function which has computational advantages (9)

14 14 Discriminant Function Approach to Classifier Design More generally (10) (11)

15 15 Discriminant Function Approach to Classifier Design 再來考慮 M-class (12) 也就是說,我們要一組『最佳 discriminant functions 』 (13) When the loss function is specified

16 16 Discriminant Function Approach to Classifier Design This is quite different from the distribution estimation based approach in pattern classification

17 17 Speech Recogniation and Hidden Markov Modeling A decoder performs a maximum a posterior decision Word Sequence Best Word Sequence Acoustic Feature Score from Acoustic Model Score from Language Model

18 18 Speech Recogniation and Hidden Markov Modeling Basic components: Acoustic Feature Extraction : –Used to extract the features from waveform. –We use to represent the acoustic observation feature vector sequence. Acoustic Modeling : –Provides statistical modeling for the acoustic observation X. –Hidden Markov Model is the prevalent choice. Language Modeling : –Provides linguistic constraints to the text sequence W. –Based on statistical N-gram language models

19 19 Speech Recogniation and Hidden Markov Modeling Decoding Engine : –Search for the best word sequence given the feature and model –This is achieved through Viterbi decoding Word StringState Sequence Discrete observation Probability Continuous density HMMs

20 20 Speech Recogniation and Hidden Markov Modeling Hidden Markov modeling is a powerful statistical framework for time-varying quasi-stationary process and a popular choice for statistical modeling of speech signal

21 21 SPEECH RECOGNITION AND HIDDEN MARKOV MODELING Three basic problems have to be resolved : The evaluation problem –estimate the probability The decoding problem –find a best state sequence q The estimation problem –estimate HMM parameters from a given set of training samples (ML based algorithms such as Baum-Welch al.)

22 22 MCE Classifier Design Using Discriminant Functions MCE classifier design based on 3 steps (19)

23 23 MCE Classifier Design Using Discriminant Functions Misclassification measure (20) Generally we use

24 24 MCE Classifier Design Using Discriminant Functions

25 25 MCE Classifier Design Using Discriminant Functions Loss function (21) (22)

26 26 MCE Classifier Design Using Discriminant Functions Classifier Performance Measure (23) (24)

27 27 MCE Classifier Design Using Discriminant Functions If posterior probability is used Then the Bayes’ minimum risk is (25) X 在 Class k 的機率不可最大,也就是說分錯的 loss ?

28 28 MCE Classifier Design Using Discriminant Functions If posterior probability is used Then the Bayes’ minimum risk is (26)

29 29 Optimization Methods Expected Loss (27) We use GPD-based minimization algorithm to minimize it (28) U t is the learning bias matrix to impose a different learning rate for the correct model vs. competing models

30 30 Optimization Methods 若滿足下面三個 properties ,則 收斂

31 31 Optimization Methods Empirical Loss (31) (32) If the training samples are obtained by an independent sampling from a space with a fixed probability distribution P

32 32 HMM as a Discriminant Function 使用 HMM 當作 discriminant function (34) (35) discriminant function 利用 有三種方式來產生 (36) (37)

33 33 HMM as a Discriminant Function

34 34 HMM as a Discriminant Function 假設 Maintain HMM 原有的 constraints

35 35 HMM as a Discriminant Function 所以我們使用 parameter transformation 來保留這些 constraints

36 36 HMM as a Discriminant Function, discriminant adjustment of the mean vector

37 37 HMM as a Discriminant Function

38 38 HMM as a Discriminant Function

39 39 HMM as a Discriminant Function, discriminant adjustment of the variance

40 40 HMM as a Discriminant Function

41 41 HMM as a Discriminant Function discriminant adjustment of the mixture weight

42 42 HMM as a Discriminant Function

43 43 HMM as a Discriminant Function How to design the step size? –If the step size is too large, the classifier will be degraded at the start and sequential learning cannot be made successful –If the step size is too small, the convergence speed of the algorithm is too slow and it is practically not useful It’s difficult to design it, the general solution is still lacking

44 44 HMM as a Discriminant Function Why we normalize mean vector? –The magnitude of variances can vary in the range between 100 and 10 -5. –If using a constant step size for all mean vectors, the algorithm will either not converge or will be too slow to become practically useless This takes away the dependencies on the variance variations

45 45 Relation between MCE and MMI

46 46 Relation between MCE and MMI

47 47 Relation between MCE and MMI

48 48 Relation between MCE and MMI

49 49 Relation between MCE and MMI

50 50 Relation between MCE and MMI

51 51 Relation between MCE and MMI The objective function in MMI is not bounded for d c (x)>0 This behavior may have some adverse effects in MMI based parameter estimation, since it is based on the mutual information I(W c, X) averaged over the entire training set

52 52 Relation between MCE and MMI MCE approach has several advantages in classifier design: –It is meaningful in the sense of minimizing the empirical recognition error rate of the classifier –If the true class posterior distributions are used as discriminant functions, the asymptotic behavior of the classifier will approximate the minimum Baye’s risk

53 53 SUMMARY We examined the classical Bayes’ decision theory approach to the problem of pattern classification. We don’t know the actual probability distribution So we minimize the expected loss. Get a set of parameters. Understand what MCE is, and how to use it to solve problems


Download ppt "Present by: Fang-Hui Chu Minimum Classification Error (MCE) Approach in Pattern Recognition Wu Chou, Avaya Labs Research, Avaya Inc., USA."

Similar presentations


Ads by Google