SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.

Slides:



Advertisements
Similar presentations
Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
Advances in WP2 Torino Meeting – 9-10 March
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
Speaker Adaptation for Vowel Classification
By the Novel Approaches team: Nelson Morgan, ICSI Hynek Hermansky, OGI Dan Ellis, Columbia Kemal Sonmez, SRI Mari Ostendorf, UW Hervé Bourlard, IDIAP/EPFL.
Speech Recognition Deep Learning and Neural Nets Spring 2015.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
Representing Acoustic Information
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Audio classification Discriminating speech, music and environmental audio Rajas A. Sambhare ECE 539.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Multiple-Layer Networks and Backpropagation Algorithms
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
Waqas Haider Khan Bangyal. Multi-Layer Perceptron (MLP)
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Radial Basis Function Networks:
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Supervisor: Dr. Eddie Jones Co-supervisor: Dr Martin Glavin Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Basics of Neural Networks Neural Network Topologies.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Performance Comparison of Speaker and Emotion Recognition
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Speech Processing Using HTK Trevor Bowden 12/08/2008.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Perceptrons Michael J. Watts
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Olivier Siohan David Rybach
Multiple-Layer Networks and Backpropagation Algorithms
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Hierarchical Multi-Stream Posterior Based Speech Recognition System
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Linear Prediction.
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
3. Applications to Speaker Verification
8-Speech Recognition Speech Recognition Concepts
Learning Long-Term Temporal Features
Presented by Chen-Wei Liu
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International Computer Science Institute, Berkeley, CA, USA Presenter: Chen Hung-Bin 2004 Special Workshop in Maui (SWIM)

Outline Introduction Conventional Features Multi-Layered Perceptrons (MLPs) three different temporal resolutions Experiments Conclusion

Introduction In this pape, we describe a three-stage process of scaling to the larger conversational telephone speech (CTS) task. One goal was to improve conversational telephone speech (CTS) recognition by modifying the acoustic front end. –We found that approaches developed for the recognition of natural numbers scaled quite well to two different levels of CTS complexity: recognition of utterances primarily consisting of the 500 most frequent words in Switchboard and large vocabulary recognition of Switchboard conversations

Conventional Features Mel Frequency Cepstral Coefficients (MFCC) Perceptual Linear Prediction (PLP) Hidden Activation TRAPS (HATS) Modulation-filtered spectrogram (MSG) Relative Spectral Perceptual Linear Prediction (RASTA-PLP)

Perceptual Linear Prediction Equal loudness preemphasis ( 等響度曲線預強 ) 4kHz 附近 人的耳朵是最靈敏的

Perceptual Linear Prediction Intensity-loudness power law

Perceptual Linear Prediction HTK 做法 : –Fill filterbank channels –equal-loudness curve –Do IDFT to get autocorrelation values –transfer from lpc to cepstral coef // Mel to Hz conversion for( i=1; i<=pOrder; i++ ) { // cf = fbank centre frequencies f_hz_mid = 700*(exp(cf[i]/1127)-1); fsq = (f_hz_mid * f_hz_mid); fsub = fsq / (fsq + 1.6e5); EQL[i] = fsub * fsub * ((fsq e6) /(fsq e6)); } // equal-loudness curve for( i=1; i<=pOrder; i++ ) { p[i+1] = bins[i] * EQL[i] if( F_Debug == 2 ) }

RASTA-PLP modulation filtering Perceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments, 1998

Multi-Layered Perceptrons (MLPs) A multilayer perceptrons is a feedforward neural network with one or more hidden layers The signals are propagated in a forward direction on a layer-by-layer basis The network consists of –an input layer of source neurons –at least one hidden layer of computational neurons –an output layer of computational neurons

TempoRAl Patterns (TRAPs) spectral-energy based vector at time t with variables –Based on posterior probabilities of speech categories for long and short time functions of the time-frequency plane –These feature may be represented as multiple streams of probabilistic information Working with narrow spectral subbands and long temporal windows –Naive One Stage Approach –Two Stage Linear Approaches –Two Stage Non-Linear Approaches Hidden Activation TRAPS (HATS)

Naive One Stage Approach baseline approach –51 frames of all 15 bands of log critical band energies (LCBEs) as inputs to an MLP. –These inputs are built by stacking 25 frames before and after the current frame to the current frame, and the target phoneme comes from the current frame.

Two Stage Linear Approaches 15 Bands x 51 Frames –first, calculate principal component analysis (PCA) transforms –second, combine what was learned at each critical band posteriors

Two Stage Non-Linear Approaches

Augmenting PLP Front End Features We used three different temporal resolutions. –The original PLP features were derived from short term spectral analysis –the PLP/MLP features used 9 frames of PLP features –and the TRAPS features used 51 frames of log critical band energies dimension reduce dimension to dimension 56 dimension 42 dimension

Inverse entropy weighted combination (INVENT) The combined output posterior probability –the MLP feature with lower entropy is more important than an MLP feature with high entropy K k K k i=1 K k i=2 K k i=3

softmax Therefore we cannot use the entropy based weighting directly. –We convert the spectrum into a probability mass function (PMF) using the equation

Average of the posteriors combination (AVG) For the average combination –

Experiments goal The PLP/MLP and the TRAPS features, developed for a very small task, were then applied to successively larger problems Our methods work on the small vocabulary continuous numbers task even when we did not train explicitly on continuous numbers There were several advantages to use –First, since the recognition vocabulary consisted of common words, it was likely that error rate reduction would apply to the larger task as well –Second, there were many examples of these 500 words in the training data, so less training data was required than would be needed for the full task

THE 500WORD CTS TASK The 500 word test set was a subset of the 2001 Hub-5 evaluation data. –Given the 500 most common words in Switchboard I, we chose utterances from the 2001 evaluation data in which 90% or more of the words in all utterance training set –consisted of 217 female and 205 male speakers –contained one third of the total number of utterances –The female speech consisted of 0.92 hours from English CallHome, hours from Switchboard I with transcriptions, 0.69 hours from the Switchboard Cellular Database. –The male speech consisted of 0.19 hours from English CallHome, hours from Switchboard I, 0.59 hours from Switchboard Cellular, 0.06 hours from the Switchboard Credit Card Corpus.

THE 500WORD CTS TASK We used the tuning set to tune system parameters like word transition weight and language model scaling And we determined word error rates on the test set tuning set –0.97 hours –8242 total word tokens test set –1.42 hours –11845 total word tokens language model –Triphone gender-independent HMMs using the SRI speech recognition system and using a simple bigram language model

Results on Top 500Words Task baseline PLP features –we trained gender dependent triphone HMMs on the 23 hour RUSH training set –and then tested this system on the 500 word test set achieving a 43.8% word error rate Word error rate (WER) and relative eduction of WER on the top 500 word test set of systems

OGI NUMBERS TASK The training set for this stage was an 18.7-hour subset of the old “short” SRI Hub training set –48% of the training data was male and 52% female –4.4 hours of this training set comes from English CallHome –2.7 hours from Hand Transcribed Switchboard –2.0 hours from Switchboard Credit Card Corpus –9.6 hours from Macrophone (read speech) tuning set ? testing set –1.3 hours of speech –2519 utterances and 9699 word tokens language model –Triphone gender-independent HMMs using the SRI speech recognition system and using a simple bigram language model

Results on Numbers Task The testing dictionary contained thirty words for numbers and two words for hesitation Word error rate (WER) and relative reduction of WER on Numbers using different combination approaches.

FULL CTS VOCABULARY in the 500 word task like 500WORD CTS TASK This set contained a total of hours of CTS –female speaker 2.75 hours of English CallHome hours fromMississippi State transcribed Switchboard I 2.03 hours of Switchboard Cellular form the data –male speaker 0.56 hours of English CallHome hours from Switchboard I 1.83 hours from Switchboard Cellular 0.20 hours of Switchboard Credit Card Corpus

FULL CTS VOCABULARY tuning set ? testing set –6.33 hours of speech –62890 total word tokens language model –Triphone gender-independent HMMs using the SRI speech recognition system and using a simple bigram language model

Results on Full CTS Task 2001 Hub-5 evaluation set Word error rate (WER) and relative reduction of WER on Numbers using different combination approaches.

CONCLUSION Word error rate was significantly reduced for the larger tasks as well The combination methods, which gave equivalent performance for the smaller task, were also comparable on the larger tasks.