Presentation is loading. Please wait.

Presentation is loading. Please wait.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Similar presentations


Presentation on theme: "HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004."— Presentation transcript:

1 HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004

2 Outline SU Detection Problem Two Modeling Approaches Experimental Results Conclusions & Future Work

3 SU Detection Problem Find the sentence-like boundaries given the word sequences (human transcripts or speech recognition output) and speech signal Why? Easier for human comprehension Needed by NLP modules May help speech recognition accuracy

4 SU Detection Using HMM Sequence decoding, viterbi algorithm For a classification task, it is better to find the most likely event at each interword boundary then determine the most likely sequence. E i =argmax(E ij |W,F), forward-backward algorithm

5 Terms in the HMM Approach Transition probability Hidden event LM P(W,E)=P(W 1 E 1 W 2..) Maximum likelihood parameter estimation Emission probability P(F i |E i ) ~ P(E i |F i ) Decision trees estimate posterior probs given prosodic features

6 SU Detection Using Maxent Represent as a classification task; each sample has an associated feature set P(E i |O i ) O i =(c(w i ), F i ) Naïve-Bayes method Parameter estimation Maximize joint likelihood P(E,O), ML parameter estimation Maximize conditional likelihood P(E|O), i.e. maximum entropy

7 Maxent Introduction Models things known, assumes nothing unknown. Estimate of p(y|x): Satisfies constraints Empirical distribution is equal to the expected value of features with respect to the model p(y|x) Has maximum entropy Exponential format

8 Features in Maxent Features are indicator functions f(x,y) =1 (if x=‘uhhuh’, y=SU) =0 otherwise Lambdas are estimated iteratively We use a variety of features: Word (different ngrams, different positional info) POS Chunk Automatically-induced classes

9 Features in Maxent (cont) P(E i |F i ) Convenient to use binary features in the maxent approach Encode posterior probabilities from decision trees into binary features, using accumulative binning Decisions from other classifiers (LMs)

10 Differences between HMM and Maxent Both use word context; maxent uses only F i, while HMM uses all F via the forward-backward algorithm Maxent bins the posterior probabilities, thus losing some information Maxent maximizes the conditional likelihood P(E|O), HMM maximizes the joint P(W,E) Combining LMs, HMM linearly interpolates posterior probabilities, using independence assumption; maxent more tightly integrates overlapping features

11 BN & CTS SU Detection HMMMaxentCombined BN Ref48.7248.6146.79 STT55.3756.5154.35 CTS Ref31.5130.6629.30 STT42.9443.0241.88 RT03 dev and eval set Evaluation: Error = (#missing + # false alarms)/# ref SUs

12 Some Findings Errors increase in face of recognition errors Maxent degrades more, possibly due to the higher reliance on textual info and reduced dependence on prosodic information Maxent yields more gain on CTS than BN More training data for CTS task? Prosody is more important for BN? Different genre It is easy to combine highly related knowledge sources in maxent HMM: interpolation makes independence assumption

13 Error Type Different error patterns --- HMM has more insertion errors (prosody model tends to add false alarms), therefore, they can be effectively combined DelInsAll BN HMM28.4820.2448.72 Maxent32.0616.5448.61 CTS HMM17.1914.3231.51 Maxent19.9710.6941.88

14 Effect on LM & Prosody BNCTS HMM Textual67.4838.92 Textual+prosody48.7231.51 Maxent Textual63.5636.32 Textual+prosody48.6130.66

15 Findings By using textual info only, Maxent performs much better than the HMM HMM uses the independence assumption when combining multiple LMs Maxent better integrates different textual knowledge sources When prosody is included Gain from maxent is lost Encode posterior probabilities in a lossy way Only F i is used in maxent, all F is used in HMM

16 Conclusions Combination of HMM and maxent achieves the best performance Both approaches make inaccurate assumptions and have some advantages Optimization metric matches performance measure (Conditional likelihood better than joint likelihood, still not the best) Independence assumption (loose interpolation in HMM, Maxent uses Fi) Maxent is more computationally demanding than HMM

17 Future Work HMM Maximize conditional likelihood Joint word and class LMs Maxent Use numerical features, not just binary Preserve prosodic probabilities May be able to use confidence measure in STT output MCE discriminative training


Download ppt "HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004."

Similar presentations


Ads by Google