Download presentation
Presentation is loading. Please wait.
Published byLeona Richards Modified over 9 years ago
1
Biologically Inspired Noise- Robust Speech Recognition for Both Man and Machine Mark D. Skowronski Ph.D. Proposal University of Florida Gainesville, FL, USA
2
Outline Introduction Biologically inspired algorithms –Speech: Energy Redistribution –Features: Human Factor Cepstral Coefficients –Classifier: Nonlinear dynamic systems Future work
3
Introduction Biologically inspired algorithms –Speech: Energy Redistribution –Features: Human Factor Cepstral Coefficients –Classifier: Nonlinear dynamic systems Future work
4
Biological Inspiration Wall Street Journal/Broadcast news readings Untrained human listeners vs Cambridge HTK LVCSR system AWGN: 10 dB SNR Example of Read Speech:
5
Introduction Biologically inspired algorithms –Speech: Energy Redistribution –Features: Human Factor Cepstral Coefficients –Classifier: Nonlinear dynamic systems Future work
6
Speech Enhancement Motivations: Noisy cell phone conversations Power-constrained transducers Public address systems in noisy environments What can you do when turning up the volume is not an option? Biology: Lombard Effect
7
The Lombard Effect Amplitude increases. Duration increases. Pitch increases. Formant frequencies increase. High-freq to low-freq energy ratio increases. Intelligibility increases. Lombard Effect: changes in vocal characteristics, produced by a speaker in the presence of background noise.
8
Psychoacoustic Experiments Fletcher (1953): LPF or HPF phonemes varied in robustness to the filtering process, with vowels being the most robust. Miller and Nicely (1955): AWGN to speech affects place of articulation and frication most, less so for voicing and nasality. Furui (1986): Truncated vowels in consonant-vowel pairs dramatically decreased in intelligibility beyond a certain point of truncation. These points correspond to spectrally dynamic regions. Speech contains regions of relatively high information content, and emphasis of these regions increases perceived intelligibility.
9
Solution: Energy Redistribution We redistribute energy from regions of low information content to regions of high information content while conserving overall energy across words. We partition speech into Voiced/Unvoiced regions using the Spectral Flatness Measure (SFM): SFM of “clarification” X j (k) is the magnitude of the short-term Fourier transform of the j th speech window of length N.
10
Listening Test If, s, x, yes IIa, h, k, 8 IIIb, c, d, e, g, p, t, v, z, 3 IVm, n Confusable set test, from Junqua 500 trials forced decision 3 algorithms (control, ERVU, HPF) 0 dB and -10 dB SNR, AWGN unlimited playback over headphones 26 participants, 30-45 minutes
11
Listening Test Results -10 dB SNR, white noise Errors decreased 20% compared to control. “M”“E”“A”“S”
12
Energy Redistribution Summary Biologically inspired –Lombard Effect says how to modify. –Psychoacoustic experiments say where to modify. Increases intelligibility while maintaining naturalness and conserving energy. Naturalness elegantly preserved by retaining spectral and temporal cues. Effective because everyday speech is not clearly enunciated.
13
Introduction Biologically inspired algorithms –Speech: Energy Redistribution –Features: Human Factor Cepstral Coefficients –Classifier: Nonlinear dynamic systems Future work
14
ASR Introduction Isolated/Continuous speech Dependent/Independent speaker operation Word/Phoneme recognition unit Vocabulary size and perplexity Automatic Speech Recognition is the extraction of linguistic information from an utterance of speech (Text-to-Speech). Input Feature Extraction Classification
15
Input Information: phonetic, gender, age, emotion, pitch, accent, physical state, additive/channel noise “seven”
16
Feature Extraction Acoustic: formant frequencies, bandwidths Model based: linear prediction Filter-bank based: mel freq cepstral coeff (mfcc) Goal: emphasize phonetic information over other characteristics. Provides dimensionality reduction on quasi-stationary windows. Time Features “ seven ”
17
Hidden Markov Model “one” Time domain State space Feature space
18
MFCC Algorithm “seven” Cepstral domain DCT Log energy Mel-scaled filter bank F x(t) Time Filter # MFCC--the most widely-used speech feature extractor.
19
DCT vs Eigenvectors Frequency Spectra of DCT basis vectors Spectra of Eigenvectors from log energy of filtered speech Average spectral difference < 15% Basis #
20
MFCC Filter Bank Design parameters: FB freq range, number of filters. Center freqs equally-spaced in mel frequency. Triangle endpoints set by center freqs of adjacent filters. Although filter spacing is determined by perceptual mel frequency scale, bandwidth is set more for convenience than by biological motivation.
21
Human Factor Cepstral Coefficients Decouple filter bandwidth from filter bank design parameters. Set filter width according to the critical bandwidth of the human auditory system. Use Moore and Glasberg approximation of critical bandwidth, defined in Equivalent Rectangular Bandwidth (ERB). f c is critical band center frequency (KHz).
22
ASR Experiments Review Isolated English digits “zero” through “nine” from TI-46 corpus, 8 male speakers, HMM word models, 8 states per model, diagonal covariance matrix, Three mfcc versions (different filter banks), Several degrees of freedom, Linear ERB scale factor.
23
ASR Results White noise (local SNR), hfcc vs D&M
24
ASR Results White noise (global SNR), hfcc vs D&M, Linear ERB scale factor (E-factor).
25
HFCC Conclusions Added biologically inspired bandwidth to filter bank of popular speech feature extractor. Decoupled bandwidth from other filter bank design parameters. Demonstrated superior noise-robust performance of new feature extractor. Demonstrated advantages of wider filters.
26
Introduction Biologically inspired algorithms –Speech: Energy Redistribution –Features: Human Factor Cepstral Coefficients –Classifier: Nonlinear dynamic systems Future work
27
HMM Limitations HMMs are piecewise-stationary, while speech is continuous and nonstationary. Assumes frames of speech are i.i.d. State pdf estimates are data-driven. HMMs make no claim of modeling biology.
28
Novel Classifiers Deng's trended HMM. Rabiner's autoregression HMM. Morgan's HMM/neural network hybrid. Robinson's recurrent neural network. Wismüller's self-organizing map. Herrmann's transient attractor network. Maass' dynamic synapse MLP. Berger's dynamic synapse RNN.
29
Freeman's Chaotic Model Biologically inspired nonlinear dynamic model of cortical signal processing, from rabbit olfactory neo-cortex experiments. A hierarchical network of oscillators that are locally stable and globally chaotic. Demonstrated as classifier of static patterns. Represents a radical departure from current classifier paradigms.
30
KI Model Smallest element in network hierarchy. a,b constants state variable x i (t) N states W ij weight from state i to state j asymmetric sigmoid Q input I i (t) to state i.
31
Reduced KII Network Locally stable element is KII network. m(t) excitatory mitral cell g(t) inhibitory granule cell Weights K mg > 0, K gm < 0 N pairs in parallel Mitral cells fully connected Granule cells fully connected Input I(t) into excitatory cell.
32
KII Simulations m(t) g(t) Reduced KII reaches steady state point attractor or limit cycle, based on |K mg · K gm |.
33
Introduction Biologically inspired algorithms –Speech: Energy Redistribution –Features: Human Factor Cepstral Coefficients –Classifier: Nonlinear dynamic systems Future work
34
Work Completed 1.Developed biologically inspired algorithms: Energy redistribution: combines Lombard Effect (how) with psychoacoustic experimental results (where) to increase speech intelligibility. Human factor cepstral coefficients: combines existing speech front end (mfcc) with critical bandwidth information (ERB). 2.Published 3 papers, and submitted 3 more, on novel algorithms. 3.Literature survey on novel speech classifiers, and simulations of nonlinear Freeman model.
35
Work Proposed 1.Compare hfcc to human speech recognition using rhyming test in ASR experiments. 2.Measure affects of ERVU in ASR experiments. 3.Analyze hfcc algorithm, accounting for nonlinear log(·) function. 4.Experiment with other bandwidth functions besides ERB or scaled ERB. 5.Quantify tradeoff between spectral resolution and noise smoothing for hfcc using synthetic data.
36
Work Proposed, Con't 6.Build on the reduced KII network results recently reported by CNEL suggesting the network can operate as a content-addressable memory (CAM). 7.Investigate alternative information storage strategies to CAM, focusing on inherent time- varying nature of dynamic system (coupling theory is intriguing). 8.Expand literature search to areas outside speech recognition experiments that use nonlinear dynamic (chaotic) systems for information processing/storage, with emphasis on applications with time-varying signals.
37
Work Proposed, Con't 9.Consider alternative roles for nonlinear dynamics: embedded extracted features for hfcc/HMM system, trajectory tracking in the spirit of Deng’s trended HMM. 10.Demonstrate classification of static vowel patterns (vowel phonemes) with novel classifier, in presence of noise. 11.Demonstrate classification of time-varying signals (isolated English digits, rhyming test corpus), in noisy environments.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.