Download presentation
Presentation is loading. Please wait.
1
By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer “Pushing the Envelope” A six month report
2
Overview Nelson Morgan, ICSI
3
The Current Cast of Characters ICSI: Morgan, Q. Zhu, B. Chen, G. Doddington UW: M. Ostendorf, Ö. Çetin OGI: H. Hermansky, S. Sivadas, P. Jain Columbia: D. Ellis, M. Athineos SRI: K. Sönmez IDIAP: H. Bourlard, J. Ajmera, V. Tyagi
4
Rethinking Acoustic Processing for ASR Escape dependence on spectral envelope Use multiple front-ends across time/freq Modify statistical models to accommodate new front-ends Design optimal combination schemes for multiple models
5
time Task 1: Pushing the Envelope (aside) Problem: Spectral envelope is a fragile information carrier estimate of sound identity information fusion 10 ms OLD PROPOSED Solution: Probabilities from multiple time-frequency patches i th estimate up to 1s k th estimate n th estimate estimate of sound identity
6
Task 2: Beyond Frames… Solution: Advanced features require advanced models, free of fixed-frame-rate paradigm OLD PROPOSED conventional HMMshort-term features Problem: Features & models interact; new features may require different models advanced features multi-rate, dynamic-scale classifier
7
Today’s presentation Infrastructure: training, testing, software Initial Experiments: pilot studies Directions: where we’re headed
8
Infrastructure Kemal Sönmez, SRI (SRI/UW/ICSI effort)
9
Initial Experimental Paradigm Focus on a small task to facilitate exploratory work (later move to CTS) Choose a task where LM is fixed & plays a minor role (to focus on acoustics) Use mismatched train/test data: To avoid tuning to the task To facilitate later move to CTS Task: OGI numbers/ Train: swbd+macrophone
10
Composition (total ~ 60 hours) * subset of SWB-1 hand-checked at SRI for accuracy of transcriptions and segmentations WER 2-4% higher vs. full 250+ hour training Hub5 “Short” Training Set
11
Reduced UW Training Set A reduced training set to shorten expt. turn-around time Choose training utterances with per-frame likelihood scores close to the training set average 1/4 th of the original training set Statistics (gender, data set constituencies) are similar to that of the full training set. For OGI Numbers, no significant WER sacrifice in the baseline HMM system (worse for Hub 5). data set constituencies male/female macrophonecallhome credit- card other switchboard “short”32% 12%24%45/55% Reduced (UW) 38%28%12%22%48/52%
12
Development Test Sets A “Core-Subset” of OGI’s Numbers 95 corpora – telephone speech of people reciting addresses, telephone numbers, zip codes, or other miscellaneous items “Core-Subset” or “CS” consists of utterances that were phonetically hand-transcribed, intelligible, and contained only numbers Vocabulary Size: 32 words (digits + eleven, twelve… twenty… hundred…thousand, etc.) Data Set NameTotal UtteranceTotal WordsDuration (hours) Numbers95-CS Cross Validation 3571353~0.2 Numbers95-CS Development 12064673~0.6 Numbers95-CS Test 12274757~0.6
13
Statistical Modeling Tools HTK (Hidden Markov Toolkit) for establishing an HMM baseline, debugging GMTK (Graphical Models Toolkit) for implementing advanced models with multiple feature/state streams Allows direct dependencies across streams Not limited by single-rate, single-stream paradigm Rapid model specification/training/testing SRI Decipher system for providing lattices to rescore (later in CTS expts) Neural network tools from ICSI for posterior probability estimation, other statistical software from IDIAP
14
Baseline SRI Recognizer for the numbers task Bottom-up state-clustered Gaussian mixture HMMs for acoustic modeling Acoustic adaptation to speakers using affine mean and variance transforms[Not used for numbers] Vocal-tract length normalization using maximum likelihood estimation [Not helpful for numbers] Progressive search with lattice recognition and N- best rescoring [To be used in later work] Bigram LM
15
Initial Experiments Barry Chen, ICSI Hynek Hermansky, OHSU (OGI) Özgür Çetin, UW
16
Goals of Initial Experiments Establish performance baselines HMM + standard features (MFCC, PLP) HMM + current best from ICSI/OGI Develop infrastructure for new models GMTK for multi-stream & multi-rate features Novel features based on large timespans Novel features based on temporal fine structure Provide fodder for future error analysis
17
ICSI Baseline experiments PLP based - SRI system “Tandem” PLP-based ANN + SRI system Initial combination approach
18
Development Baseline: Gender Independent PLP System Training Set Word,Sentence Error Rate on Numbers95-CS Test Set Full “Short” Hub5 (85k utterances, ~64.9 hrs) 3.4%,10.2% UW Reduced Hub5 (20k utterances, ~18.8 hrs) 3.8%,11.4%
19
Phonetically Trained Neural Net Multi-Layer Perceptron (input, hidden, and output layer) Trained Using Error-Backpropagation Technique – outputs interpreted as posterior probabilities of target classes Training Targets: 47 mono-phone targets from forced alignment using SRI Eval 2002 system Training Utterances: UW Reduced Hub5 Set Training Features: PLP12+e+d+dd, mean & variance normalized on per-conversation side basis MLP Topology: 9 Frame Context Window (4 frames in past + current frame + 4 frames in future) 351 Input Units, 1500 Hidden Units, and 47 Output Units Total Number of Parameters: ~600k
20
Baseline ICSI Tandem Outputs of Neural Net before final softmax non-linearity used as inputs to PCA PCA without dimensionality reduction 4.1% Word and 11.7% Sentence Error Rate on Numbers95-CS test set
21
Baseline ICSI Tandem+PLP PLP Stream concatenated with neural net posteriors stream PCA reduces dimensionality of posteriors stream to 16 (keeping 95% of overall variance) 3.3% Word and 9.5% Sentence Error Rate on Numbers95-CS test set
22
Word and String Error Rates on Numbers95-CS Test Set
23
OGI Experiments: New Features in EARS Develop on home-grown ASR system (phoneme-based HTK) Pass the most promising to ICSI for running in SRI LVCSR system So far new features match the performance of the baseline PLP features but do not exceed it advantage seen in combination with the baseline
24
Looking to the human auditory system for design inspiration Psychophysics Components within certain frequency range (several critical bands) interact [e.g. frequency masking] Components within certain time span (a few hundreds of ms) interact [e.g. temporal masking] Physiology 2-D (time-frequency) matched filters for activity in auditory cortex [cortical receptive fields]
25
TRAP-based HMM-NN hybrid ASR Posterior probabilities of phonemes Multilayer Perceptron (MLP) Mean & variance normalized, hamming windowed critical band trajectory 101 point input Multilayer Perceptron (MLP) Multilayer Perceptron (MLP) Search for the best match
26
Feature estimation from linearly transformed temporal patterns MLP transform TANDEM HMM ASR ? ? ?
27
Preliminary TANDEM/TRAP results (OGI-HTK) WER% on OGI numbers, training on UW reduced training set, monophone models BASELINE4.5 TANDEM4.1 TANDEM with TRAP 3.9
28
Features from more than one critical-band temporal trajectory + average frequency derivative cosine transform Studying KLT-derived basis functions, we observe:
29
UW Baseline Experiments Constructed an HTK-based HMM system that is competitive with the SRI system Replicated the HMM system in GMTK Move on to models which integrate information from multiple sources in a principled manner: Multiple feature streams (multi-stream models) Different time scales (multi-rate models) Focus on statistical models not on feature extraction
30
HTK HMM Baseline An HTK-based standard HMM system: 3 state triphones with decision-tree clustering, Mixture of diagonal Gaussians as state output dists., No adaptation, fixed LM. Dimensions explored: Front-end: PLP vs. MFCC, VTLN Gender dependent vs. independent modeling Conclusions: No significant performance differences Decided on PLPs, no VTLN, gender-independent models for simplicity
31
HMM Baselines (cont.) Replicated HTK baseline with equivalent results in GMTK To reduce experiment turn-around time, wanted to reduce the training set For HMMs and Numbers95, 3/4 th of the training data can be safely ignored: WER % tooldevtest HTK3.73.2 GMTK3.73.0 Training set WER % devtest Full “short”3.73.2 1/4th (“reduced”) 3.4
32
Multi-stream Models Information fusion from multiple streams of features Partially asynchronous state sequences states of stream X states of stream Y state seq. of stream Y STATE TOPOLOGY state seq. of stream X feature stream X feature stream Y GRAPHICAL MODEL model WER % devtest HMM (PLP)3.94.2 multi-stream (PLP+MFCC)
33
Temporal envelope features (Columbia) Temporal fine structure is lost (deliberately) in STFT features: Need a compact, parametric description... 10 ms windows
34
Frequency-Domain Linear Prediction (FDLP) Extend LPC with LP model of spectrum ‘Poles’ represent temporal peaks: Features ~ pole bandwidth, ‘frequency’ TD-LP y[n] = i a i y[n-i] DFT FD-LP Y[k] = i b i Y[k-i]
35
Preliminary FDLP Results Distribution of pole magnitudes for different phone classes (in 4 bands): NN Classifier Frame Accuracies: plp12N57.0% plp12N+FDLP458.2%
36
Directions Dan Ellis, Columbia (SRI/UW/Columbia work) Nelson Morgan, ICSI (OGI/IDIAP/ICSI work + summary)
37
Multi-rate Models (UW) long-term features short-term features Cross-scale dependencies (example) coarse state chain fine state chain Integrate acoustic information from different time scales Account for dependencies across scales Better robustness against time- and/or frequency localized interferences Reduced redundancy gives better confidence estimates
38
SRI Directions Task 1: Signal-adaptive weighting of time-frequency patches Basis-entropy based representation Matching pursuit search for optimal weighting of patches Optimality based on minimum entropy criterion Task 2: Graphical models of patch combinations Tiling-driven dependency modeling GM combines across patch selections Optimality based on information in representation
39
Data-derived phonetic features (Columbia) Find a set of independent attributes to account for phonetic (lexical) distinctions phones replaced by feature streams Will require new pronunciation models asynchronous feature transitions (no phones) mapping from phonetics (for unseen words) Joint work with Eric Fosler-Lussier
40
ICA for feature bases PCA finds decorrelated bases; ICA finds independent bases Lexically-sufficient ICA basis set?
41
OGI Directions: Targets in sub-bands Initially context-independent and band- specific phonemes Gradually shifted to band-specific 6 broad phonetic classes (stops, fricatives, nasals, vowels, silence, flaps) Moving towards band-independent speech classes (vocalic-like, fricative-like, plosive- like, ???)
42
More than one temporal pattern? Mean & Variance normalized, Hamming windowed critical band trajectory MLP KLT 1 101 dim KLT n
43
Pre-processing by 2-D operators with subsequent TRAP-TANDEM frequency time 121 000 -2 01 -202 01 012 01 -20 -20 01 012 differentiate f average t differentiate t average f diff upwards av downwards diff downwards av upwards
44
IDIAP Directions: Phase AutoCorrelation Features Traditional Features: Autocorrelation based. Very sensitive to additive noise, other variations. Phase AutoCorrelation (PAC): if represents autocorrelation coeffs derived from a frame of length PACs:
45
Entropy Based Multi- Stream Combination Combination of evidences from more than one expert to improve performance Entropy as a measure of confidence Experts having low entropy are more reliable as compared to experts having high entropy Inverse entropy weighting criterion Relationship between entropy of the resulting (recombined) classifier and recognition rate
46
ICSI Directions: Posterior Combination Framework Combination of Several Discriminative Probability Streams
47
Improvement of the Combo Infrastructure Improve basic features: Add prosodic features: voicing level, energy continuity, Improve PLP by further removing the pitch difference among speakers. Tandem Different targets, different training features. E.g.: word boundary. Improve TRAP (OGI) Combination Entropy based, accuracy based stream weighting or stream selection.
48
New types of tandem features: Possible word/syllable boundary NN Processing Input feature Target posterior Input feature: Traditional or improved PLP Spectral continuity Voicing, voicing continuity Formant continuity feature …more Phonemes Word/syllable boundary Broad phoneme classes Manner/ place / articulation… etc
49
Data Driven Subword Unit Generation (IDIAP/ICSI) Initial segmentation: large number of clusters Is thresholdless BIC-like merging criterion met? Merge, re-segment, and re-estimate Yes Stop No Motivation: Phoneme-based units may not be optimal for ASR. Approach (based on speaker segmentation method):
50
Summary Staff and tools in place to proceed with core experiments Pilot experiments provided coherent substrate for cooperation between 6 sites Future directions for individual sites are all over the map, which is what we want Possible exploration of collaborations w/MS in this meeting
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.