Automatic speech recognition using an echo state network Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University.

Automatic speech recognition using an echo state network Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida, Gainesville, FL, USA May 10, 2006

CNEL Seminar History Ratio spectrum, Oct. 2000 HFCC, Sept. 2002 Bats, Dec. 2004 Electrohysterography, Aug. 2005 Echo state network, May 2006 2006 2000

Overview ASR motivations Intro to echo state network Multiple readout filters ASR experiments Conclusions

ASR Motivations Speech is most natural form of communication among humans. Human-machine interaction lags behind with tactile interface. Bottleneck in machine understanding is signal-to-symbol translation. Human speech a “tough” signal: –Nonstationary –Non-Gaussian –Nonlinear systems for production/perception How to handle the “non”-ness of speech?

ASR State of the Art Feature extraction: HFCC –bio-inspired frequency analysis –tailored for statistical models Acoustic pattern rec: HMM –Piecewise-stationary stochastic model –Efficient training/testing algorithms –…but several simplistic assumptions Language models –Uses knowledge of language, grammar –HMM implementations –Machine language understanding still elusive (spam blockers)

Hidden Markov Model Premier stochastic model of non-stationary time series used for decision making. Assumptions: 1) Speech is piecewise-stationary process. 2) Features are independent. 3) State duration is exponential. 4) State transition prob. function of previous-next state only. Can we devise a better pattern recognition model?

Echo State Network Partially trained recurrent neural network, Herbert Jaeger, 2001 Unique characteristics: –Recurrent “reservoir” of processing elements, interconnected with random untrained weights. –Linear readout weights trained with simple regression provide closed-form, stable, unique solution.

ESN Diagram & Equations

ESN Matrices W in : untrained, M x M in matrix –Zero mean, unity variance normally distributed –Scaled by r in W: untrained, M x M matrix –Zero mean, unity variance normally distributed –Scaled such that spectral radius r < 1 W out : trained, linear regression, M out x M matrix –Regression  closed-form, stable, unique solution –O(M 2 ) per data point complexity

Echo States Conditions The network has echo states if x(n) is uniquely determined by left-infinite input sequence …,u(n-1),u(n). x(n) is an “echo” of all previous inputs. If f is tanh activation function: –If σ max (W)=||W||<1, guarantees echo states –If r=|λ max (W)|>1, guarantees no echo states

ESN Training Minimize mean-squared error between y(n) and desired signal d(n). Wiener solution:

ESN Example: Mackey-Glass M=60 PEs r=0.9 r in =0.3 u(n): MG, 10000 samples d(n)=u(n+1) Prediction Gain (var(u)/var(e)): Input: 16.3 dB Wiener: 45.1 dB ESN: 62.6 dB

Multiple Readout Filters W out projects reservoir space to output space. Question: how to divide reservoir space and use multiple readout filters? Answer: competitive network of filters Question: how to train/test competitive network of K filters? Answer: mimic HMM.

HMM vs. ESN Classifier HMMESN Classifier OutputLikelihoodMSE ArchitectureStates, left-to-right Minimum element Gaussian kernelReadout filter Elements combined GMMWinner-take-all TransitionsState transition matrixBinary switching matrix TrainingSegmental K-means (Baum-Welch) Segmental K-means DiscriminatoryNoMaybe, depends on desired signal

Segmental K-means: Init For each input, x i (n) and desired d i (n) for sequence i: Divide x,d into equal-sized chunks X η,D η (one per state). For each n, select k(n)  [1,K] uniform random. After init. with all sequences:

Segmental K-means: Training For each utterance: –Produce MSE for each readout filter. –Find Viterbi path through MSE matrix. –Use features from each state to update auto- and cross-correlation matrices. After all utterances: Wiener solution Guaranteed to converge to local minimum in MSE over training set.

ASR Example 1 Isolated English digits “zero” - “nine” from TI46 corpus: 8 male, 8 female, 26 utterances each, 12.5 kHz sampling rate. ESN: M=60 PEs, r=2.0, r in =0.1, 10 word models, various #states and #filters per state. Features: 13 HFCC, 100 fps, Hamming window, pre-emphasis (α=0.95), CMS, Δ+ΔΔ (±4 frames) Pre-processing: zero-mean and whitening transform M1/F1: testing; M2/F2: validation; M3-M8/F3-F8 training Two to six training epochs for all models Desired: next frame of 39-dimension features Test: corrupted by additive noise from “real” sources (subway, babble, car, exhibition hall, restaurant, street, airport terminal, train station) Baseline: HMM with identical input features

ASR Results, noise free K ESN (HMM)1234510 N st =17(171)6(136)3(65)2(33)3(4)2(2) 21(83)1(46)0(4)1(3)2(2)1(0) 30(126)1(4)0(2) 0(1)2(0) 51(11)1(2)0(0) 1(0)0(0) 101(2)1(0) 0(0) 150(1)0(0) 1(0) 20000001 Number of classification errors out of 518 (smaller is better).

ASR Results, noisy K ESN (HMM)1234510 N st =170.9(22.4)70.0(29.7)74.6(45.6)74.3(46.0)74.3(36.2)75.8(50.9) 276.3(41.5)77.6(47.6)78.3(50.1)77.7(53.8)77.1(50.2)75.8(64.5) 378.8(29.2)79.2(44.6)79.3(51.7)79.2(58.6)79.1(58.6)78.8(55.6) 581.4(51.6)81.1(56.4)81.6(59.7)81.9(59.2)81.3(59.2)81.3(53.5) 1084.6(57.2)84.4(61.1)84.4(58.7)83.6(55.7)83.5(56.2)81.0(52.2) 1585.4(64.0)85.1(62.0)85.0(59.2)83.8(56.4)82.8(52.9)78.4(52.2) 2085.885.684.083.582.572.3 Average accuracy (%),all noise sources, 0-20 dB SNR (larger is better):

ASR Results, noisy Single mixture per state (K=1): ESN classifier

ASR Results, noisy Single mixture per state (K=1): HMM baseline

ASR Example 2 Same experiment setup as Example 1. ESN: M=600 PEs, 10 states, 1 filter per state, r in =0.1, various r. Desired: one-of-many encoding of class, ±1, tanh output activation function AFTER linear readout filter. Test: corrupted by additive speech-shaped noise Baseline: HMM with identical input features

ASR Results, noisy

Discussion What gives the ESN classifier its noise- robust characteristics? Theory: ESN reservoir provides context of noisy input, allowing reservoir to reduce effects of noise by averaging. Theory: Non-linearity and high- dimensionality of network increases linear separability of classes in reservoir space.

Future Work Replace winner-take-all with mixture-of- experts. Replace segmental K-means with Baum- Welch-type training algorithm. “Grow” network during training. Consider nonlinear activation functions (e.g., tanh, softmax) AFTER linear readout filter.

Conclusions ESN classifier using inspiration from HMM: –Multiple readout filters per state, multiple states. –Trained as competitive network of filters. –Segmental K-means guaranteed to converge to local minimum of total MSE from training set. ESN classifier noise robust compared to HMM: –Ave. over all sources, 0-20 dB SNR: +21 percentage points –Ave. over all sources: +9 dB SNR

Automatic speech recognition using an echo state network Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University.

Similar presentations

Presentation on theme: "Automatic speech recognition using an echo state network Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic speech recognition using an echo state network Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University.

Similar presentations

Presentation on theme: "Automatic speech recognition using an echo state network Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University."— Presentation transcript:

Similar presentations

About project

Feedback