Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Similar presentations


Presentation on theme: "Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan"— Presentation transcript:

1 Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan http://sovideo.iis.sinica.edu.tw/SLG/index.htm Automatic Phoneme Alignment Hung-Yi Lo, Jen-Wei Kuo and Hsin-Ming Wang NGASR Summer Meeting 2011 July 12, 2011

2 2011/07/12 2 Automatic Phoneme Alignment Problem Acoustic Signal Spectrogram Context Phonetic Transcription attitude ae dx ax tcl t ux dcl In "A good attitude is unbeatable"

3 2011/07/12 3 Outline A Minimum Boundary Error Framework for Automatic Phoneme Alignment Phonetic Boundary Refinement Using Support Vector Machine Phoneme Boundary Refinement Using Ranking Methods

4 2011/07/12 4 A Minimum Boundary Error Framework for Automatic Phoneme Alignment

5 2011/07/12 5 MBE Discriminative Training Let be a set of various possible phonetic alignments for a given observation utterance O, and S is the true phonetic alignment in the canonical transcription. Then, the expected boundary error is expressed as boundary error of compared to summation of all possible alignments posterior probability of

6 2011/07/12 6 MBE Forced Alignment (N-best rescoring) … Various possible hypothesized alignments generated by conventional HMM-based forced alignment approaches We want to select the alignment with the minimal expected boundary error from We examine each hypothesized alignment one by one

7 2011/07/12 7 MBE Forced Alignment … If is the true alignment, then we get the error

8 2011/07/12 8 MBE Forced Alignment … If is the true alignment, then we get the error

9 2011/07/12 9 MBE Forced Alignment … The expected boundary error of can be expresses as: The best alignment is found by minimizing the expected error

10 2011/07/12 10 Experimental Setup TIMIT corpus  Training set: 4546 utterances (3.87 hrs)  Test set: 1646 utterances (1.41 hrs) Context independent HMMs are used  50 phone models (left-to-right topology)  3 states for each HMM  5265 mixtures totally  Frame window size and shift are 20 ms and 5 ms, respectively  12 MFCCs and log energy, and their first and second differences  CMS and CVN are applied Evaluation metric  Percentage of boundaries placed without a tolerance

11 2011/07/12 11 Experimental Results Boundary Accuracy Criterion Mean Boundary Distance %Correct marks (distance ≦ tolerance) TrainingSegmentation ≦ 5ms ≦ 10ms ≦ 15ms ≦ 20ms ≦ 25ms ≦ 30ms ML 9.83 ms46.6971.1083.1488.9492.3294.52 ML+MBEML7.82 ms58.4879.7588.1692.1194.4996.11 MLMBE8.95 ms49.8674.2585.3890.6193.7595.67 ML+MBEMBE7.49 ms58.7380.5388.9792.8595.1696.64 absolute improvement (ML+MBE,MBE) vs. (ML, ML) 2.34 ms12.049.435.833.912.842.12

12 2011/07/12 12 Phonetic Boundary Refinement Using Support Vector Machine

13 2011/07/12 13 Boundary Refinement Using SVM Acoustic Signal Spectrogram Boundaries detected by HMM-based Forced Alignment Hypothesized Boundaries as Input to SVM SVM Boundary Classifier Output

14 2011/07/12 14 Feature Extraction Each frame represented by a 45-dim vector:  39 MFCC-based coefficients  Zero crossing rate  Bisector frequency  Burst degree  Spectral entropy  General weighted entropy  Subband energy Two features are extracted for each hypothesis boundary:  Symmetrical Kullback-Leibler distance  Spectral feature transition rate Each hypothesized boundary is represented by a 92-dim augmented vector consisting of elements of the feature vectors of the left and right frames next to it and the two boundary features.

15 2011/07/12 15 Phone-Transition-Dependent Classifier Training data is always limited  Cannot train a classifier for each type of phone transition Many phone transitions have similar acoustic characteristics, we can partition them into clusters The phone transitions with little training data can be covered by the classifiers of the categories they belong to Two methods for phone transition clustering:  K-means-based Clustering (KM)  Decision-tree-based Clustering (DT)

16 2011/07/12 Training Data for Boundary Classifier Positive samples: the feature vectors associated with the true phone boundaries Negative samples: the randomly selected feature vectors at least 20ms away from the true boundaries 16 Classification Model Negative Instance Positive Instance

17 2011/07/12 17 Experimental Results Method Mean Boundary Distance % Correctness <5ms<10ms<20ms<30ms<40ms HMM ML 9.7346.8571.5389.1794.6297.16 HMM ML+MBE 7.7958.7380.1592.0995.9397.89 HMM ML +SVM 7.8258.1881.1992.4796.0597.78 HMM ML+MBE +SVM 7.4458.9182.1993.1696.5298.05

18 2011/07/12 18 Phonetic Boundary Refinement Using Ranking Methods

19 2011/07/12 Training Data for Boundary Classifier Positive samples: the feature vectors associated with the true phone boundaries Negative samples: the randomly selected feature vectors at least 20ms away from the true boundaries 19 Classification Model Negative Instance Positive Instance However, this classification-based method has two drawbacks

20 2011/07/12 First Drawback: Losing Information (1/2) Only information about the boundary and far away non- boundary signal characteristics is used  What about the information nearby the boundary? 20 Classification Model Negative Instance Positive Instance

21 2011/07/12 First Drawback: Losing Information (2/2) Preference ranking  Instances extracted from the true boundaries: high preference  Nearby instances: medium preference  Far away instances: low preference 21 Classification Model Negative Instance Positive Instance Preference Ranking Low HighMid

22 2011/07/12 Second Drawback : Imbalanced Training A lot of negative instances but only a limited amount of positive instances General classification algorithms will be biased to predict all instances to be negative  Since they are learned to minimize the number of incorrectly classified instances 22

23 2011/07/12 23 Boundary Refinement as a Ranking Problem Learn a function H: X → R where H(x i ) > H(x j ) means that instance x i is preferred to x j The hypothesized boundary closed to the true boundary should have higher score We only care about relative order  Correct order: A-B-C-D  OK: {A:-100, B:-10, C: 0, D:1000}  OK: {A: 0.1, B: 0.3, C: 0.4, D: 0.41} We exploit two learning-to-rank methods:  Ranking SVM  RankBoost

24 2011/07/12 24 Ranking SVM Optimization problem: The training instances are given in ordered pairs  means that should be ranked higher than min s. t. ,

25 2011/07/12 25 RankBoost Training Sample Pairs ………….. h 1 (x) h 2 (x) h 3 (x) h T (x) Training Sample Pairs Data Weight Update Rule: Minimize Ranking Loss: Weight on Instance Pairs D t

26 2011/07/12 26 RankBoost Training Sample Pairs ………….. h 1 (x) h 2 (x) h 3 (x) h T (x) Training Sample Pairs Final Classifier:

27 2011/07/12 Four ordered ranking lists generated from each true boundary Generation of Training Pairs 27 True Boundary ……

28 2011/07/12 Four ordered ranking lists generated from each true boundary Generation of Training Pairs 28 …… (1) (2) (3) (4)

29 2011/07/12 Couple the preferred instances with each remaining instance Generation of Training Pairs 29 …… (1) (2) (3) (4)

30 2011/07/12 30 Experiment Results Method Mean Boundary Distance (ms) % Correctness <10ms<20ms HMM 7.1481.5793.73 Linear SVM KM 6.8483.5193.85 Linear SVM DT 6.8983.4493.79 RBF SVM KM 6.7584.0094.33 RBF SVM DT 6.8383.7094.12 Linear RankSVM KM 6.6283.8994.17 Linear RankSVM KM 6.7683.9094.01 RankBoost KM 6.6684.2094.14 RankBoost KM 6.6684.1394.11

31 2011/07/12 References Jen-Wei Kuo and Hsin-Min Wang, Minimum Boundary Error Training for Automatic Phonetic Segmentation, ICSLP 2006 Jen-Wei Kuo and Hsin-Min Wang, A Minimum Boundary Error Framework for Automatic Phonetic Segmentation, ISCSLP 2006 Hung-Yi Lo and Hsin-Min Wang, Phonetic Boundary Refinement Using Support Vector Machine, ICASSP 2007 Hung-Yi Lo and Hsin-Min Wang, Phone Boundary Refinement Using Ranking Methods, ISCSLP 2010 31

32 2011/07/12 32 Thank you!


Download ppt "Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan"

Similar presentations


Ads by Google