Download presentation
Presentation is loading. Please wait.
Published byTobias Richards Modified over 7 years ago
1
Automatic Recognition of Pathological Phoneme Pronunciation Using HMM and DTW methods
Robert Wielgat, Daniel Król Department of Technology Higher State Vocational School Tarnów, POLAND Tomasz P. Zieliński, Department of Telecommunications AGH University of Science and Technology Kraków, POLAND Tomasz Woźniak, Stanisław Grabias Division of Logopedics and Applied Linguistics, Maria Curie-Skłodowska University Lublin, POLAND The work was sponsored from KBN grant no. 1 H01F 046 28. Mister Chairman, Ladies and Gentelmen My name’s Robert Wielgat and I represent Higher State Vocational School in Tarnów in Poland. I’d like to present you results of preliminary research on HFCC based pathological speech recognition. This research was carried out in cooperation with Profesor Tomasz Zieliński from AGH University of Science and Technology in Kraków in Poland, Łukasz Hołda and Daniel Król from my alma mater and with Professors Tomasz Woźniak and Stanisław Grabias from Maria Curie-Skłodowska University in Lublin in Poland. The work was sponsored by Polish Scientific Komitee.
2
Investigated speech disorders among children
sz a f a correct pronunciation a f a elision s a f a substitution Let’s start from the types of incorect pronunciation of particular phoneme in the word. These speech disorders can be classified as elisions, substitutions, and deformations. In order to define all types of speech disorders an example of Polish word ‘szafa’ will be considered. Elision is lack of realisation of the phoneme in the word. In Substitution case correct phoneme is replaced by another correct phoneme which makes pronounciation of whole word incorect. Deformation is replacing correct phoneme by its deformed version. In this research we investigated substitutions only. s’’ a f a deformation
3
Diagnosis and therapy of speech disorders
sz Diagnosis - difficult recognition task because of many substituted and deformed phonemes in the vocabulary which are often acoustically similar; speaker independent recognition ø s a f a c s’’ Now I will present the diagnosis and therapy of speech disorders from the automatic speech recognition point of view. Diagnosis is rather difficult recognition task because speech recognizer must recognize all possible phoneme substitution or deformation which are often acoustically similar. Another complication is that phonemes must be recognized in a speaker independent manner. Much simpler task for speech recogniser is therapy case where recognition is limited to recognizing ussually tho phonemes correct phoneme and substituted one. The type of substituted phoneme depends on former diagnosis. What more substituted phoneme can be recognised in a speaker dependent manner. Therapy – recognition task is limited to recognizing usually two phonemes: correct phoneme and substituted one; substituted phoneme can be recognized speaker dependently. sz a f a s
4
Speech recognizer applied to the therapy case
FEATURE EXTRACTION: MFCC (Mel-frequency Cepstral Coefficients) HFCC (Human Factor Cepstral Coefficients) CLASSIFICATION METHODS: Word-based Dynamic Time Warping Phoneme-based Dynamic Time Warping Hidden Markov Models (HMM) of whole words of phonemes In the work we considered application of the speech recogniser to the therapy case only where one phoneme is substituted by another one. This speech recogniser can be modeled as two subsequent processes. The first process is feature extraction and the second one is classification. At the feature extraction stage the three methods were investigated: CC and MFCC – the standard methods commonly used in state of the art speech recognition, and HFCC recently proposed method. At the classification stage we tested standard word-based Dynamic Time Warping as well as proposed by our team phoneme-based Dynamic Time Warping. We also carried out initial experiment using Hidden Markov Models.
5
Mel-Frequency Cepstral Coefficients (MFCC)
1) blocking signal into frames, windowing by Hamming window 2) performing FFT on windowed frames 3) addition of the FFT power in some frequency band 4) calculation of log of accumulated spectral coefficients 5) performing DCT on them (n = 0, 1, 2,..., q-1): Now I’ll try to explain the way of calculating MFCC. The first stage is blocking signal into frames and windowing them by Hamming or another window. Next Fast Fourier Transform is performed on windowed frames. Afterwords FFT power in some frequency band is added. 6) calculating first and second derivatives of the DCT coefficients in respect to time, the so-called delta and delta- delta coefficients
6
Mel-Frequency Cepstral Coefficients (MFCC)
“Addition of the FFT power in some frequency bands” Center frequencies of these bands are equally spaced in mel frequency scale. Filter bandwidth is coupled with filter spacing (50 % overlapping). The most important issue in mel-frequency Cepstral coefficients is spacing of center frequencies in frequency bands. These center frequencies are equally spaced in mel-frequency scale.
7
Human-Factor Cepstral Coefficients (HFCC)
In HFCC filter center frequencies are equally spaced in mel frequency scale, as in the MFCC method, but filter bandwidth is a design parameter, measured in equivalent rectangular bandwidth (ERB) as an approximation of critical bands (Moore & Glasberg 1983); where filter center frequency fc is expressed in kHz. When wider filter bandwidth than ERB is exploited (ERB is scaled by some factor > 1), then the HFCC-based speech recognition can be under some circumstances more resistant to noise.
8
Human-Factor Cepstral Coefficients (HFCC)
Center frequencies are equally spaced in mel frequency scale. Filter bandwidth is a design parameter. This picture shows spacing of center frequencies and bandwiths of the frequency bands in HFCC. It is evident that spacing is exactly the same like in MFCC case but the bantwith is changed. It is narrow at the begining and become wider at the end of mel-frequency scale.
9
DTW speech recognition
From these accumulated distances we find DTW distance between phonemes iY iX W O R D X W O R D Y 1 N M phoneme „a” phoneme „sz” In our research we made some modification in DTW method. We introduse so called phoneme-based DTW. In this modification we apply DTW method only to the particular phoneme. silence Word-based DTW silence phoneme „sz” phoneme „a” We start searching from these local distances Phoneme-based DTW
10
HMM-based speech recognition
Initialization of the HMMs Parameters: AT, j, j ▼ Embedded Trainnig x 3 Fixing the Silence Models 3 Embedded Trainnig x 2 Making Triphones from Monophones Making Tied-State Triphones Finally estimated HMM parameters: AT, j, j Whole word hidden Markov models: Training: reestimation by Viterbi algorithm and Baum-Welch algorithm. Recognition: Viterbi algorithm Phoneme hidden Markov models: Training: reestimation by embedded training – modified Baum-Welch algorithm. Recognition: Token Passing Model. Detailed algorithm of training phoneme hidden Markov models.
11
Hidden Markov Models used in experiments
2 3 4 5 1 o1 o2 o3 o4 o5 o6 a23 a22 b2(o1) b2(o2) b2(o3) b4(o5) b3(o4) b4(o6) a34 a45 a12 a33 a44 Structure of Hidden Markov model for whole word and phoneme Observation probability where: Σj – covariance matrix μj – mean observation n – dimension of the observation START STOP [ʂa:fa] [sa:fa] word network
12
Experiments – vocabulary
Pairs of Polish phonemes recognized in experiments: {s, ʂ}, {ɕ, ʂ} embedded in word szafa [ʂa:fa] and its deformed versions: safa [sa:fa], siafa [ɕa:fa] {ts, t͡ʂ}, {ʨ, t͡ʂ} embedded in word czapka [t͡ʂa:pka] and its deformed versions: capka [tsa:pka], ciapka [ʨa:pka] {dz, d͡ʐ}, {ʥ, d͡ʐ} embedded in word drzewo [d͡ʐe:vo] and its deformed versions: dzewo [dze:vo], dziewo [ʥe:vo] Utterances were recorded with 48 kHz sampling frequency and 16 bit resolution.
13
Experiments - training and testing sets
Experiment number Word pronunciation Type of set Training set Testing set Female Male 1 szafa 4 8 safa 4 (1) 7 9 (5) 2 3 6 siafa czapka capka 7 (1) ciapka 5 drzewo dzewo dziewo
14
Experiments - parameters of feature extraction
Parameter Features MFCC HFCC_1 Pre-emphasis H(z) 10.9375z-1 Frame length 30 ms Frame shift 10 ms Window Hamming No of Filters 30 No of coefficients 15 DFT Length 4096 ERBscaleFactor -- 1, 4
15
Results – Comparison of DTW methods
Phoneme-based DTW average recognition accuracy was usually higher than for standard DTW method.
16
Results – Comparison of HMMs for whole word and HMMs for phoneme methods
17
Results – Comparison of HMMs for HFCC and MFCC
18
Results – Comparison of HMM and DTW methods
19
Word network: 1st version word network: 2nd version
Results – Preliminary recognition with phoneme-based word network [ʂa:fa] STOP START [sa:fa] Word network: 1st version ʂ a: f a STOP START s word network: 2nd version Very initial results indicate comparable speech recognition accuracy in case of phoneme-based word network (2nd version) % to the results obtained with standard word network (1st version) 93.55%.
20
Conclusions Future research
Comparative analysis of using DTW and HMM methods in pathological speech recognition has been presented. The therapy case was considered. Although the results are somewhat ambiguous some trends can be observed: HFCC based speech recognition gave better recognition results in comparison with standard MFCC features. After proper selection of parameters, HFCC method can be used for pathological speech recognition task. At the present stage of research it can be stated that DTW method outperforms HMM one in considered recognition task, however further problem investigation is necessary. Phoneme-based DTW method gave better results than standard DTW one. Obtained results allow to implement recognition methods in real word application for the therapy of substitution: sz-s, cz-c, drz-dz. For substitution sz-si, cz-ci, drz-dzi more efficient methods have to be found. Future research More phonemes will be tested and larger testing and training sets will be used in experiments. More sophisticated Markov models will be examined: eg. more Gaussian mixtures. Probable other research directions could be PCA and discriminant analysis.
21
Thank you for your attention
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.