Download presentation
Presentation is loading. Please wait.
Published byScot Ball Modified over 9 years ago
1
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science
2
What is Phoneme? Phonemes are very small units of intelligible sound (usually less than 200 ms). Phonetic spelling is the sequence of phonemes that a word comprises. Example: Coat ([ kōt] /K OW T/) From ([ frəm] /F R AH M/) impressive ([ imˈpresiv ] /IH M P R EH S IH V/) 2
3
Phoneme Classification What is phoneme classification? Input: A short segment of audio signal. Output: What phoneme it is. Phoneme classification is a complex task: More than 100 classes (based on International Phonetic Alphabet) Variation in speakers, dialects, accents, noise in the environment, etc. Phoneme classification can be used in: Robust speech recognition Accent/dialect detection Speech quality scoring 3
4
Related Work Different methods for phoneme classification have been used in the literature: Hidden Markov model [Lee, 1989] Neural network [Schwarz, 2009] Deep belief network [Mohamed, 2012] Support vector machine [Salomon, 2001] Hierarchical methods [Dekel, 2005] Boltzmann machine [Mohamed, 2010] Although data mining society has shown that k-NN classifiers can work well on time series data, it hasn’t been tried on phoneme yet. 4 [C. Lopes, F. Perdigao, 2011]
5
Our Dual-domain Approach 5 Time Domain: Using k-NN Dynamic Time Warping (DTW) Expensive Speed up by lower bounding techniques Frequency Domain: Using k-NN Euclidean distance between Mel- frequency cepstrum coefficients (MFCC) Fast
6
Real Example 6
7
Challenge 7 DTW is expensive (quadratic in time and space complexity) We need to apply a speed up technique Solution: Lower bounding techniques w w
8
DTW Lower bounding 8 Resampling to equal length doesn’t always work !!!
9
DTW Lower bounding 9 We use the prefix of the longer signal (Prefixed LB_Keogh) We show that Prefixed LB_Keogh is a lower bound if: w > difference between lengths of two signals We set w = c * length of the longer signal We ignore all pairs of signals that don’t satisfy the above condition. 24681012141618 x10 4 0 0.5 1 1.5 2 2.5 3 3.5 Speedup Training Set Size 102030405060708090100 80.2 80.4 80.6 80.8 81 81.2 81.4 81.6 81.8 Window Size (c%) Accuracy(%) c = 30%
10
Data Collection 10 370,000 phonemes are segmented from: Data is publicly available.
11
Phoneme Segmentation 11 The Penn Phonetics Lab Forced Aligner (p2fa) Takes a signal and a transcript Produces timing segmentations (word level and phoneme level)
12
Accuracy (All layers) 12 10-fold cross validation 100 random phonemes in each fold
13
Accented Phoneme Classification 13 00.511.522.533.5 x 10 4 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Training Set Size Accuracy MFCC DTW British vs. American accent Using Oxford test set 2-class classification problem No hierarchy
14
Conclusion We present a dual-domain hierarchical method for phoneme classification. We generate a novel dataset of 370,000 phonemes. We achieve up to 73% accuracy rate for 39 classes. Our lower bounding technique gives us up to 3X speedup. 14
15
15 Thank You Data and code available at: http://cs.unm.edu/~hamooni/papers/Dual_2014
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.