Download presentation
Presentation is loading. Please wait.
Published byLily Carter Modified over 9 years ago
1
Automatic speech recognition using an echo state network Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida, Gainesville, FL, USA May 10, 2006
2
CNEL Seminar History Ratio spectrum, Oct. 2000 HFCC, Sept. 2002 Bats, Dec. 2004 Electrohysterography, Aug. 2005 Echo state network, May 2006 2006 2000
3
Overview ASR motivations Intro to echo state network Multiple readout filters ASR experiments Conclusions
4
ASR Motivations Speech is most natural form of communication among humans. Human-machine interaction lags behind with tactile interface. Bottleneck in machine understanding is signal-to-symbol translation. Human speech a “tough” signal: –Nonstationary –Non-Gaussian –Nonlinear systems for production/perception How to handle the “non”-ness of speech?
5
ASR State of the Art Feature extraction: HFCC –bio-inspired frequency analysis –tailored for statistical models Acoustic pattern rec: HMM –Piecewise-stationary stochastic model –Efficient training/testing algorithms –…but several simplistic assumptions Language models –Uses knowledge of language, grammar –HMM implementations –Machine language understanding still elusive (spam blockers)
6
Hidden Markov Model Premier stochastic model of non-stationary time series used for decision making. Assumptions: 1) Speech is piecewise-stationary process. 2) Features are independent. 3) State duration is exponential. 4) State transition prob. function of previous-next state only. Can we devise a better pattern recognition model?
7
Echo State Network Partially trained recurrent neural network, Herbert Jaeger, 2001 Unique characteristics: –Recurrent “reservoir” of processing elements, interconnected with random untrained weights. –Linear readout weights trained with simple regression provide closed-form, stable, unique solution.
8
ESN Diagram & Equations
9
ESN Matrices W in : untrained, M x M in matrix –Zero mean, unity variance normally distributed –Scaled by r in W: untrained, M x M matrix –Zero mean, unity variance normally distributed –Scaled such that spectral radius r < 1 W out : trained, linear regression, M out x M matrix –Regression closed-form, stable, unique solution –O(M 2 ) per data point complexity
10
Echo States Conditions The network has echo states if x(n) is uniquely determined by left-infinite input sequence …,u(n-1),u(n). x(n) is an “echo” of all previous inputs. If f is tanh activation function: –If σ max (W)=||W||<1, guarantees echo states –If r=|λ max (W)|>1, guarantees no echo states
11
ESN Training Minimize mean-squared error between y(n) and desired signal d(n). Wiener solution:
12
ESN Example: Mackey-Glass M=60 PEs r=0.9 r in =0.3 u(n): MG, 10000 samples d(n)=u(n+1) Prediction Gain (var(u)/var(e)): Input: 16.3 dB Wiener: 45.1 dB ESN: 62.6 dB
13
Multiple Readout Filters W out projects reservoir space to output space. Question: how to divide reservoir space and use multiple readout filters? Answer: competitive network of filters Question: how to train/test competitive network of K filters? Answer: mimic HMM.
14
HMM vs. ESN Classifier HMMESN Classifier OutputLikelihoodMSE ArchitectureStates, left-to-right Minimum element Gaussian kernelReadout filter Elements combined GMMWinner-take-all TransitionsState transition matrixBinary switching matrix TrainingSegmental K-means (Baum-Welch) Segmental K-means DiscriminatoryNoMaybe, depends on desired signal
15
Segmental K-means: Init For each input, x i (n) and desired d i (n) for sequence i: Divide x,d into equal-sized chunks X η,D η (one per state). For each n, select k(n) [1,K] uniform random. After init. with all sequences:
16
Segmental K-means: Training For each utterance: –Produce MSE for each readout filter. –Find Viterbi path through MSE matrix. –Use features from each state to update auto- and cross-correlation matrices. After all utterances: Wiener solution Guaranteed to converge to local minimum in MSE over training set.
17
ASR Example 1 Isolated English digits “zero” - “nine” from TI46 corpus: 8 male, 8 female, 26 utterances each, 12.5 kHz sampling rate. ESN: M=60 PEs, r=2.0, r in =0.1, 10 word models, various #states and #filters per state. Features: 13 HFCC, 100 fps, Hamming window, pre-emphasis (α=0.95), CMS, Δ+ΔΔ (±4 frames) Pre-processing: zero-mean and whitening transform M1/F1: testing; M2/F2: validation; M3-M8/F3-F8 training Two to six training epochs for all models Desired: next frame of 39-dimension features Test: corrupted by additive noise from “real” sources (subway, babble, car, exhibition hall, restaurant, street, airport terminal, train station) Baseline: HMM with identical input features
18
ASR Results, noise free K ESN (HMM)1234510 N st =17(171)6(136)3(65)2(33)3(4)2(2) 21(83)1(46)0(4)1(3)2(2)1(0) 30(126)1(4)0(2) 0(1)2(0) 51(11)1(2)0(0) 1(0)0(0) 101(2)1(0) 0(0) 150(1)0(0) 1(0) 20000001 Number of classification errors out of 518 (smaller is better).
19
ASR Results, noisy K ESN (HMM)1234510 N st =170.9(22.4)70.0(29.7)74.6(45.6)74.3(46.0)74.3(36.2)75.8(50.9) 276.3(41.5)77.6(47.6)78.3(50.1)77.7(53.8)77.1(50.2)75.8(64.5) 378.8(29.2)79.2(44.6)79.3(51.7)79.2(58.6)79.1(58.6)78.8(55.6) 581.4(51.6)81.1(56.4)81.6(59.7)81.9(59.2)81.3(59.2)81.3(53.5) 1084.6(57.2)84.4(61.1)84.4(58.7)83.6(55.7)83.5(56.2)81.0(52.2) 1585.4(64.0)85.1(62.0)85.0(59.2)83.8(56.4)82.8(52.9)78.4(52.2) 2085.885.684.083.582.572.3 Average accuracy (%),all noise sources, 0-20 dB SNR (larger is better):
20
ASR Results, noisy Single mixture per state (K=1): ESN classifier
21
ASR Results, noisy Single mixture per state (K=1): HMM baseline
22
ASR Example 2 Same experiment setup as Example 1. ESN: M=600 PEs, 10 states, 1 filter per state, r in =0.1, various r. Desired: one-of-many encoding of class, ±1, tanh output activation function AFTER linear readout filter. Test: corrupted by additive speech-shaped noise Baseline: HMM with identical input features
23
ASR Results, noisy
24
Discussion What gives the ESN classifier its noise- robust characteristics? Theory: ESN reservoir provides context of noisy input, allowing reservoir to reduce effects of noise by averaging. Theory: Non-linearity and high- dimensionality of network increases linear separability of classes in reservoir space.
25
Future Work Replace winner-take-all with mixture-of- experts. Replace segmental K-means with Baum- Welch-type training algorithm. “Grow” network during training. Consider nonlinear activation functions (e.g., tanh, softmax) AFTER linear readout filter.
26
Conclusions ESN classifier using inspiration from HMM: –Multiple readout filters per state, multiple states. –Trained as competitive network of filters. –Segmental K-means guaranteed to converge to local minimum of total MSE from training set. ESN classifier noise robust compared to HMM: –Ave. over all sources, 0-20 dB SNR: +21 percentage points –Ave. over all sources: +9 dB SNR
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.