Download presentation
Presentation is loading. Please wait.
Published byJulian Ross Modified over 8 years ago
1
1 A speech recognition system for Swedish running on Android Simon Lindholm LTH May 7, 2010
2
2 Signal Processing - Windowing ● Split raw PCM data into windows ● 25 ms long ● 10 ms shift ● Apply window function ● Hamming: w(n) = 0.54 – 0.46 * cos(2πn / N)
3
3 Signal Processing - MFCC ● Mel scale – non-linear scale of human perception of pitch ● X = DFT(x) ● Triangular filters( |X|^2 ) ● DCT ● Output is 13 element long feature vector
4
4 Signal Processing... PCM DataSequence of vectors
5
5 Hidden Markov Models 1 ● N - Number of states ● Π - Initial state distribution ● A - Transitions probabilities ● Ω - Termination probabilities ● B - Probability density functions 32 a 13 a 12 a 23 Ω3Ω3 π1π1
6
6 Hidden Markov Models Observation Probability Density Functions ● Multi-variate Gaussian mixtures ● C – mixture weights ● μ - mean vectors ● Σ - covariance matrices ● Gives probabilities for the feature vectors
7
7 Hidden Markov Models Modelling Phonemes 1 ● Phonemes are trained as left- right HMMs with 3 states ● Bees-clustering for initial approximation ● Baum-Welch re-estimation for refinement 32 a 13 a 12 a 23 Ω3Ω3 π1π1 ● 46 phonemes ● Potentially 46 3 = 97,000 triphones ● Only about 300 triphones actually used for a dictionary of ~100 words
8
8 Hidden Markov Models Building Word Graphs ● Triphone HMMs combined into a larger HMM, representing the words in a dictionary. ● Phoneme HMMs are dense - implemented using matrices ● Word graphs are sparse – implemented using graph, node and edge objects ● TRIE – Words share common prefixes
9
9 Word Recognition Viterbi Algorithm ● Finds the most probable path through an HMM for a sequence of observations. ● N T possible paths ● Dynamic Programming, optimal path must include optimal subpath ● For every node i at time t, discards every path leading up to it except the one with the highest probability ● O(TN 2 ) instead of O(N T )
10
10 Word Recognition Viterbi Beam Search ● For large HMMs (word graphs) the N 2 term in O(TN 2 ) may become too large. ● Exactly the same algorithm as Viterbi, except only the K best hypotheses are explored. ● Implemented using lists state objects instead of matrices.
11
11 Implementation ● ~22000 lines C++, 4500 lines Perl ● C++ mostly for computational heavy lifting ● Perl mostly for text manipulation ● Unix style: one program does one thing ● Common configuration file
12
12 Implementation Waxholm Corpus ● Corpus from the Waxholm Dialog project at KTH ● 3 hours 30 minutes spoken sentences ● Mostly Swedish, some sentences in English ● Sentence, Word and Phoneme level annotations ● Very complex and irregular annotation file format ● Common configuration file
13
13 Implementation Configuration File math { random_seed = 326 } efx { type = mfcc delta1 = (2,-2) delta2 = (1,-1) train_ratio = 0.7 window { function = cosine length_ms = 25 ; shift_ms = 10 min_filled = 0.8 } mfcc { num_coeff = 13 ; num_filters = 29 } } hmm { num_states = 3 num_mixtures = 2 use_covariance = false statemodel = bakis-x train_order = bees, baumwelch...
14
14 Implementation Overview Waxholm Corpus parse_cor pus.pl waxholm.txt waxholm.wav efx_gen efx_split ph_all.efx hmm_test hmm_train phonemes.hmm ph_train.efxph_test.efx Phoneme test results mk_cfmatrix Confusion Matrix mk_gx.pl word_graph.gx word_test Word test results file1.wav file2.wav file3.wav... dictionary.txt ph_list.txt mk_phlist.pl
15
15 Implementation Results Tests run with ● 25 ms window, 10 ms shift ● Cosine window function ● 13 MFCC coefficients, 29 filters ● Level-1 and Level-2 MFCC deltas ● Triphones modelled as left-right HMMs ● 2 Gaussian mixture components ● 3 Baum-Welch iterations 97 * 5 = 485 words recorded with 3 different speakers ● 51.8% correct matches ● 85% within top 10
16
16 Implementation Results – Varying Parameters Varying parameters from the template configuration ● Number of Gaussian mixture components ● Number of MFCC coefficients ● MFCC delta levels ● Window functions: Rectangular, Cosine, Hamming, Hann
17
17 Implementation Results – Varying Parameters
18
18 Implementation Results – Varying Parameters Varying parameters from the template configuration ● Best percentage achieved: 57.3% ● The more number of mixtures, MFCC coefficients and MFCC deltas used the better ● Hamming window slightly better than the other window functions ● When run with 4 mixtures, 20 MFCC coefficients, level-2 delta and Hamming window: 56.5% ● Most likely crossed threshold with too many parameters for too little training data.
19
19 Android Port ● Simple recognition algorithm ported to Android as a proof of concept ● ~1700 lines Java ● Minimal interaction with the Android environment ● Does not run in real time ● Only tested on emulator – no actual phone ● Signal processing currently biggest bottleneck – very slow DFT implemented – could be drastically improved
20
20 Android Port Screenshot
21
21 Thank you for you time!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.