Download presentation
Presentation is loading. Please wait.
Published byLoren Quinn Modified over 9 years ago
1
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters Yih-Ru Wang Institute of Communication Engineering, National Chiao Tung University, Hsinchu, Taiwan, ROC 2011/7/12 NGASR 研討會
2
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 2 Outline Motivation, Background Why sample-based? Sample-based Acoustic Parameters & Phone Boundary Detector Experimental results Conclusions and Future works 2011/7/12 NGASR 研討會
3
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 3 Motivation Find the synchronous “clock” for Detection-based ASR, Computer Aided Language Learning(CALL) System Speech Attribution Detectors Speech signal Phone Boundary Detector Synchronous “clock” for the system Segment-based system Detection-based ASR, CALL system 2011/7/12 NGASR 研討會
4
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 4 Background Tasks of Phonetic Segmentation –Phone alignment, 87% inclusion rate for 10 msec tolerance for experts –Phone boundary detection Phone alignment : using Model-based method –HMM, MBE-HMM (Minimum Boundary Error HMM), HMM + fine tuning using SVM, … Phone boundary detection : using Metric-based method –a measure of speech signal change –norm of delta MFCC feature vector (Rabiner, 2006) –KL distance or BIC of speech signal The frame-based features, like MFCC, were used 2011/7/12 NGASR 研討會
5
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Why sample-based? Transient vs. Stationary Accuracy and precision – especially for ‘short’ phones, e.g. plosives Acoustic feature used high frequency resolution, like MFCC to ‘recognize’ phones in speech To detect the pronunciation manner/position (acoustics) changes in speech signal increase time resolution and decrease frequency resolution of the features 5 2011/7/12 NGASR 研討會
6
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 6 To find the useful measures of speech signal change in sample-based system –Sample-based Acoustic Parameters were proposed PROs of sample-based method –Better accuracy and precision –Properly detect the boundary of short phones CONs of sample-based method –Complexity of system? –Higher false alarm? 2011/7/12 NGASR 研討會
7
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 7 Sample-based Acoustic Parameters & Phone boundary detector Sub-band signal envelope –Six sub-bands used for landmark detection (Liu, 1996) ROR (rate of raising) of Sub-band signal envelope –The delta-term of a feature Bandpass freq. 5.0 – 8.0 k Hz 3.5 – 5.0 k Hz 2.0 – 3.5 k Hz 1.5 – 2.0 k Hz 0.8 – 1.2 k Hz 0.0 – 0.4 k Hz 2011/7/12 NGASR 研討會
8
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering | Stop | Glide | Vowel 8 Waveform Envelope Sub-band signal envelope 5.0 – 8.0 k Hz | Fricative | Vowel | Silence | Nasal | Fricative | Vowel | Nasal | Vowel 0.0 – 0.4 k Hz TIMIT: FDRW0/sx293Please take this dirty table cloth to the cleaners for me 2011/7/12 NGASR 研討會
9
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 9 ROR of Sub-band signal envelope ROR of signal envelope ~20ms Please take this dirty cloth… 2011/7/12 NGASR 研討會
10
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 10 Norm of sub-band signal envelopes can be a useful measure of signal change Sample-based spectral entropy can be defined as where is the i-th normalized sub-band signal envelope Sample-based spectral KL distance between speech signals at two adjacent times [n, n +1] can be defined as 2011/7/12 NGASR 研討會
11
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 11 Sample-based Spectral entropy ROR of Spectral entropy An example of sample-based spectral entropy and its ROR 2011/7/12 NGASR 研討會
12
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering An Example of sample-based spectral KL distance 12 Sample-based spectral KL distance It can be used to find the signal change points more accurately and precisely. 2011/7/12 NGASR 研討會
13
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering A MLP was used as the Phone Boundary detector The block diagram of proposed training/test procedure 13 2011/7/12 NGASR 研討會
14
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 14 Candidates Pre-selection – find all the speech samples, with index n, which satisfied Pre-selection can be used to reduce the complexity and FA of sample-based system. After candidate pre-selection, a MLP was used as the boundary detector 2011/7/12 NGASR 研討會
15
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering The AP features used for MLP detector 15 2011/7/12 NGASR 研討會
16
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 16 Iterative training procedure 2011/7/12 NGASR 研討會
17
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 2nd stage : –use similarity measure of segmental acoustic signals –Using GMM to model the pdf of a speech segment –The KL1 distance of CCGMM (Wang, 2004) Using a common GMM to represent the pdfs of two segments 17 2011/7/12 NGASR 研討會
18
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Similarity measure of two speech segments: –Discrete KL-1 distance of CCGMM coefficient –Discrete KL-2 distance using CCGMM coefficient 18 2011/7/12 NGASR 研討會
19
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering –Discrete KL-1 distance is the mean of log-likelihood of two pdfs –The similarity of two pdfs –Find high order statistics of log-likelihood pdfs (Wang, 2008) –Variance, skewness of log-likelihood pdfs 19 2011/7/12 NGASR 研討會
20
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering –use segmental similarity 20 2011/7/12 NGASR 研討會
21
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Experimental Results Database : TIMIT. After candidates pre-selection, –1 over 116 samples was selected –0.9% MD due to candidate pre-selection Performance of MLP boundary detector: 21 TIMIT corpus SamplePhone boundary Training set226727341172461 Test set8278673762466 2011/7/12 NGASR 研討會
22
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Performance of the sample-based boundary detector 22 2011/7/12 NGASR 研討會
23
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering An example of proposed phone boundary detector 23 2011/7/12 NGASR 研討會
24
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Accuracy of the sample-based boundary detector 24 2011/7/12 NGASR 研討會
25
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Compare to Dr. Rabiner’s work [2006] : 25 Absolute error 5ms 10ms 15ms same frame ±1 frame Inclusion rate (1-stage) 41.5%69.7%81.1%37.3%77.0% Inclusion rate (2 stage) 42.1%70.3%81.9%37.8%77.8% Dr. Rabiner’s result : ( 22.8% 59.2% ) 2011/7/12 NGASR 研討會
26
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Error analysis – MAE of detected boundary 26 AffricateFricativeStopGlideVowelNasalSilence Affricate-6.4/6.5*10.1/6.9*7.3/10.0*6.8/13.74.9/15.3*6.1/12.8 Fricative2.3/17.07.2/7.013.6/13.1*9.5/14.97.9/13.37.1/12.56.5/11.7 Stop-6.1/7.312.4/12.0*11.2/15.07.5/13.17.6/9.67.1/14.4 Glide-7.0/9.510.4/12.811.0/21.27.9/13.66.4/11.26.3/12.7 Vowel-6.3/9.87.9/11.89.9/15.98.8/17.66.8/11.56.9/13.6 Nasal7.6/11.3*6.2/8.211.1/13.211.6/15.37.2/13.35.6/11.2*6.9/12.1 Silence6.3/12.56.0/7.57.3/8.211.7/14.17.4/12.15.2/9.97.0/18.9 Overall : 7.6/12.4 Sample-based/HMM system (unit ms) * no. of sample less than 100 2011/7/12 NGASR 研討會
27
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Accuracy of proposed method – 27 2011/7/12 NGASR 研討會 Systems MAE/RMSE (ms) MAE/RMSE (frame) MAE/RMSE (normalized to phone duration) HMM12.4/17.01.22/1.840.204/0.322 1-stage (RNN) 7.6/11.50.96/1.820.127/0.197
28
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Error analysis (1 stage) – MDR and FAR 28 Pronunciation manners Deletion Insertion next phoneAffricateFricativeStopGlideVowelNasalSilence Affricate HMM-0.0%25.0%11.8%4.4%0.0%2.9%9.2% RNN-0.0% 17.6%5.7%7.7%8.0%11.1% Fricative HMM0.0%2.3%13.3%10.6%4.8%5.0%6.2%7.5% RNN0.0%3.3%16.3%20.5%8.2%7.0%7.4%10.5% Stop HMM-1.6%12.6%14.1%5.7%4.1%2.0%3.8% RNN-2.3%14.9%22.7%8.2%10.3%2.6%9.0% Glide HMM-2.8%16.1%28.2%5.6%4.5%5.2%3.8% RNN-5.8%6.3%6.5%6.6%8.4%7.3%9.4% Vowel HMM-2.9%6.7%6.6%6.5%10.3%4.4%7.8% RNN-6.2%9.2%6.9%6.4%10.0%7.3%10.1% Nasal HMM7.1%3.5%17.7%7.8%5.4%2.5%18.4%5.7% RNN7.1%9.8%16.1%18.8%8.3%2.5%8.7%9.4% Silence HMM2.1%0.9%6.2%6.6%4.3%4.8%3.0%5.7% RNN5.0%3.9%10.2%10.0%7.6%5.0%3.0%6.5% overallHMM : 6.4% (EER)Sample-based : 8.7% (EER) 2011/7/12 NGASR 研討會
29
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 29 Conclusions & Future works Several sampled-based acoustic parameters, which could properly model the speech signal change, were proposed Using the sample-based APs in phone boundary detector, better precision and accuracy were achieved Segment-based speech attribution detectors 2011/7/12 NGASR 研討會
30
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 30 Segment-based Attribution detector Segment based Attribution Recognizer Coding each contour using Legendre polynomial Operation point : 3% MDR, 20% FAR 2011/7/12 NGASR 研討會
31
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering –Set the operation point to low MD, high FA rate. 80123 segments / 62465 phones. –Feature extraction using the Legendre coefficients of the AP contours 31 2011/7/12 NGASR 研討會
32
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering –Pre-limitary result frame-based system using 9 frames feature. –Change into accuracy over time : 81.2% only 6 band-pass envelopes were used phone alignment 32 Pronunciation manner Segment-based Recog. Rate (%) Frame-base Recog. Rate(%) Fricative75.685.2 Stop76.772.5 Glide64.356.5 Vowel90.389.0 Nasal73.677.5 Silence89.192.2 81.982.1 2011/7/12 NGASR 研討會
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.