By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer EARS Kickoff Meeting: “Pushing the Envelope”
Modern ASR Systems From 50,000 ft, all ASR systems the same: - compute local spectral envelope - determine likelihoods of speech sounds - search for most likely HMMs Spectral envelope distorted by many things - Alternatives often are bad fits to the statistical models
ASR is half-deaf Phonetic classification very poor Success due to constraints (domain, speaker, noise-canceling mic, etc) These constraints can mask the underlying weakness of the technology
“Y'see, they just find out who complains the loudest about the cooking, and he gets to be the cook.” - Utah Phillips Who gets to try to fix it?
Rethinking Acoustic Processing for ASR Escape dependence on spectral envelope Use multiple front ends across time/freq Modify statistical models to accommodate new front ends Design optimal combination schemes for multiple models
The Two EARS-NA Tasks Signal processing - Replacing the spectral envelope by long-time and short-time (multirate) probabilistic functions of the spectro-temporal plane. Statistical Modeling: Modifying the statistical models, both to incorporate these new multirate front ends and to explicitly handle areas of missing information.
time Task 1: Pushing the Envelope (aside) Problem: Spectral envelope is a fragile information carrier estimate of sound identity information fusion 10 ms OLD PROPOSED Solution: Probabilities from multiple time-frequency patches i-th estimate up to 1s k-th estimate n-th estimate estimate of sound identity
Multiple time-frequency tradeoffs Temporal trajectories of narrow subbands Optimal search for more general patches Data-driven broad class probabilities time k-th estimate n-th estimate i-th estimate up to 1s
Pitch-related features Current recognizers have no use for pitch Listeners benefit from pitch Correlogram estimates spectrum of pitch
Principled multistream Not just different, but useful in combination - minimizing relative entropy between error signals - minimizing conditional information of posterior signals Choosing categories for per-stream probabilistic functions (e.g., broad classes)
Task 2: Beyond Frames… Solution: Advanced features require advanced models, not limited by fixed-frame-rate paradigm OLD PROPOSED conventional HMMshort-term features Problem: Features & models interact, new features may require different models advanced features multi-rate / dynamic scale classifier
Multirate Models Goal: Model features that span different time scales and dependence across scales/streams advanced features multirate classifier
Multirate Models (ctd) Why multirate vs. redundant features? - Redundant features violate independence assumptions, lead to poor confidence (posterior) estimates - Redundancy adds unnecessary computation Important research issues: - Acoustically driven rate mixing and/or variable alignment - Discriminative learning of dependence across streams
Partial information techniques Can integrate across unknown dimensions particularly simple for diagonal Gaussians e.g. Spectral masks: Skip missing dimensions Hard part is identifying the bad data
Multistream statistics All possible combinations of individual streams
Multistream statistics (ctd) Statistical modeling in both frequency and time: HMM2
Evaluation For greatest and most reliable progress, need frequent internal evaluations Most importantly, need to define helpful evaluation tasks – to guide the research Other considerations beyond the task: - definition of performance measures - choice of corpora - establishment of an evaluation process
Task and corpus, initial plan Evaluation tasks – Recognition of words and syllables Cross-corpus testing - training on Hub 5, Macrophone - testing on OGI numbers for quick turn- around, debugging Testing on Hub 5 in due course Rescoring SRI decoder output (N-best or lattice)
Metrics and diagnostics Word and syllable error statistics Detection statistics and error distribution across speakers (and other conditions that are deemed to be important) Comparison to human performance Running scores on dev sets within group, held-out evals at least annually (NA-sayer wants weekly )
Connection to RT evals Rescore output of SRI system In later years work more closely with RT team to transfer most successful ideas Feedback from RT experience (error diagnostics) is also important
Summary An alternative view of acoustic processing for ASR for features+models Pushing the envelope … aside Matching new front end characteristics with appropriate statistical models Diagnostic evaluations a key feature
Closing Thought “When you come to a fork in the road, take it.” - Yogi Berra