By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Multipitch Tracking for Noisy Speech
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Speaker Adaptation for Vowel Classification
Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
By the Novel Approaches team, With site leaders: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sönmez, SRI Mari Ostendorf, UW Hervé.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Polyphonic Music Transcription Using A Dynamic Graphical Model Barry Rafkind E6820 Speech and Audio Signal Processing Wednesday, March 9th, 2005.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
Speech Parameter Generation From HMM Using Dynamic Features Keiichi Tokuda, Takao Kobayashi, Satoshi Imai ICASSP 1995 Reporter: Huang-Wei Chen.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
EE 6332, Spring, 2014 Wireless Communication Zhu Han Department of Electrical and Computer Engineering Class 11 Feb. 19 th, 2014.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
From last time …. ASR System Architecture Pronunciation Lexicon Signal Processing Probability Estimator Decoder Recognized Words “zero” “three” “two”
SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
PATTERN COMPARISON TECHNIQUES
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Hierarchical Multi-Stream Posterior Based Speech Recognition System
Speech Recognition UNIT -5.
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Human Speech Communication
Speaker Identification:
Learning Long-Term Temporal Features
Presentation transcript:

By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL George Doddington, NA-sayer EARS Kickoff Meeting: “Pushing the Envelope”

Modern ASR Systems From 50,000 ft, all ASR systems the same: - compute local spectral envelope - determine likelihoods of speech sounds - search for most likely HMMs Spectral envelope distorted by many things - Alternatives often are bad fits to the statistical models

ASR is half-deaf Phonetic classification very poor Success due to constraints (domain, speaker, noise-canceling mic, etc) These constraints can mask the underlying weakness of the technology

“Y'see, they just find out who complains the loudest about the cooking, and he gets to be the cook.” - Utah Phillips Who gets to try to fix it?

Rethinking Acoustic Processing for ASR Escape dependence on spectral envelope Use multiple front ends across time/freq Modify statistical models to accommodate new front ends Design optimal combination schemes for multiple models

The Two EARS-NA Tasks Signal processing - Replacing the spectral envelope by long-time and short-time (multirate) probabilistic functions of the spectro-temporal plane. Statistical Modeling: Modifying the statistical models, both to incorporate these new multirate front ends and to explicitly handle areas of missing information.

time Task 1: Pushing the Envelope (aside) Problem: Spectral envelope is a fragile information carrier estimate of sound identity information fusion 10 ms OLD PROPOSED Solution: Probabilities from multiple time-frequency patches i-th estimate up to 1s k-th estimate n-th estimate estimate of sound identity

Multiple time-frequency tradeoffs Temporal trajectories of narrow subbands Optimal search for more general patches Data-driven broad class probabilities time k-th estimate n-th estimate i-th estimate up to 1s

Pitch-related features Current recognizers have no use for pitch Listeners benefit from pitch Correlogram estimates spectrum of pitch

Principled multistream Not just different, but useful in combination - minimizing relative entropy between error signals - minimizing conditional information of posterior signals Choosing categories for per-stream probabilistic functions (e.g., broad classes)

Task 2: Beyond Frames… Solution: Advanced features require advanced models, not limited by fixed-frame-rate paradigm OLD PROPOSED conventional HMMshort-term features Problem: Features & models interact, new features may require different models advanced features multi-rate / dynamic scale classifier

Multirate Models Goal: Model features that span different time scales and dependence across scales/streams advanced features multirate classifier

Multirate Models (ctd) Why multirate vs. redundant features? - Redundant features violate independence assumptions, lead to poor confidence (posterior) estimates - Redundancy adds unnecessary computation Important research issues: - Acoustically driven rate mixing and/or variable alignment - Discriminative learning of dependence across streams

Partial information techniques Can integrate across unknown dimensions particularly simple for diagonal Gaussians e.g. Spectral masks: Skip missing dimensions Hard part is identifying the bad data

Multistream statistics All possible combinations of individual streams

Multistream statistics (ctd) Statistical modeling in both frequency and time: HMM2

Evaluation For greatest and most reliable progress, need frequent internal evaluations Most importantly, need to define helpful evaluation tasks – to guide the research Other considerations beyond the task: - definition of performance measures - choice of corpora - establishment of an evaluation process

Task and corpus, initial plan Evaluation tasks – Recognition of words and syllables Cross-corpus testing - training on Hub 5, Macrophone - testing on OGI numbers for quick turn- around, debugging Testing on Hub 5 in due course Rescoring SRI decoder output (N-best or lattice)

Metrics and diagnostics Word and syllable error statistics Detection statistics and error distribution across speakers (and other conditions that are deemed to be important) Comparison to human performance Running scores on dev sets within group, held-out evals at least annually (NA-sayer wants weekly  )

Connection to RT evals Rescore output of SRI system In later years work more closely with RT team to transfer most successful ideas Feedback from RT experience (error diagnostics) is also important

Summary An alternative view of acoustic processing for ASR for features+models Pushing the envelope … aside Matching new front end characteristics with appropriate statistical models Diagnostic evaluations a key feature

Closing Thought “When you come to a fork in the road, take it.” - Yogi Berra