Jeff Ma and Spyros Matsoukas EARS STT Meeting March 26 2005, Philadelphia Post-RT04 work on Mandarin.

Slides:

Advertisements

Similar presentations

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),

Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.

Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.

Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.

Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.

Human Pose detection Abhinav Golas S. Arun Nair. Overview Problem Previous solutions Solution, details.

Segmentation and Event Detection in Soccer Audio Lexing Xie, Prof. Dan Ellis EE6820, Spring 2001 April 24 th, 2001.

Speaker Adaptation for Vowel Classification

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

CSE 291 Final Project: Adaptive Multi-Spectral Differencing Andrew Cosand UCSD CVRR.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Tumor Discrimination Using Textures 2 Presented by: Maysam Heydari.

Communications & Multimedia Signal Processing Refinement in FTLP-HNM system for Speech Enhancement Qin Yan Communication & Multimedia Signal Processing.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

Why is ASR Hard? Natural speech is continuous

Discriminative Feature Optimization for Speech Recognition

Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

1 Linear Prediction. Outline Windowing LPC Introduction to Vocoders Excitation modeling  Pitch Detection.

Basics of Neural Networks Neural Network Topologies.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska

Nick Wang, 25 Oct Speaker identification and verification using EigenVoices O. Thyes, R. Kuhn, P. Nguyen, and J.-C. Junqua in ICSLP2000 Presented.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

SPEECH CODING Maryam Zebarjad Alessandro Chiumento Supervisor : Sylwester Szczpaniak.

Present document contains informations proprietary to France Telecom. Accepting this document means for its recipient he or she recognizes the confidential.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Statistical Models for Automatic Speech Recognition

1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.

ON THE ARCHITECTURE OF THE CDMA2000® VARIABLE-RATE MULTIMODE WIDEBAND (VMR-WB) SPEECH CODING STANDARD Milan Jelinek†, Redwan Salami‡, Sassan Ahmadi*, Bruno.

CRANDEM: Conditional Random Fields for ASR

Presentation for EEL6586 Automatic Speech Processing

Statistical Models for Automatic Speech Recognition

Linear Prediction.

Speaker Identification:

Learning Long-Term Temporal Features

Deep neural networks for spike sorting: exploring options

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin

2 Outline  A new phoneme set  Using long-span features  Pitch features  Automatic segmentation  Data Gaussianization  Silence chopping  Updated system results

3 A new phoneme set  Problem –Hundreds of single-phoneme words dramatically increased the sizes of lattices after crossword quinphone expansion, which caused out-of-memory trouble  Solution –Added 5 dummy phonemes to turn all single-phoneme words into double-phoneme words (“a”  “Da-a”, “i”  “Di-i”, “e”  ”De-e”, “o”  ”Do-o” and “er”  ”Der-er”) –Number of cross-word quinphones was reduced by 40% Phoneme-set size Mix. Exp. Un-adapted CER Adapted CER 77Yes40.9n/a 82Yes40.8n/a 77No No

4 Long-span features (ML-SI)  Frame concatenation –Basic frame includes normalized-energy, 14 PLP coefficients, pitch (F0) and prob. of voicing (PV) –Use LDA+MLLT to project the concatenated frames down to lower dimensionality Input feature LDA +MLLT output dimension Un-adapted CER basic frame + derivatives Concatenate 9 basic frames Concatenate 13 basic frames4640.6

5 Long-span features (ML-SAT)  Modified HLDA-SAT –Do the first CMLLR on the original basic frames rather than the concatenated frames to avoid the high dimensionality problem Feature SetHLDA-SATAdapted CER basic frame + 1 st,2 nd,3 rd deriv.old36.8 concatenate 9 basic framesmodified36.0 concatenate 9 basic frames modified (state re-clust.) 35.6

6 Long-span features (MPE-SAT)  MPE model training –Trained on top of the modified HLDA-SAT model –Lattices were generated directly from the backward decoding pass (instead of using N-best lattices) Feature SetModelAdapted CER basic frame + 1 st,2 nd,3 rd deriv.ML-SAT36.8 basic frame + 1 st,2 nd,3 rd deriv. MPE-SAT (N-best lattices) 34.8 concatenate 9 basic framesML-SAT35.6 concatenate 9 basic frames MPE-SAT (deeper lattices) 33.3

7 Pitch features: algorithms  Two algorithms (old vs. new) –The old algorithm: poor smoothing on unvoiced speech, not in log domain, normalized to 0 mean and unit variance –The new algorithm: similar to IBM’s, good smoothing on unvoiced speech, in log10 domain, normalized to 0 mean and unit variance –The new one was used in RT04, since an initial comparison showed it outperformed the old one in terms of CER PitchUn-adapted CER old43.8 new43.6

8 Pitch features: a thorough comparison PitchFeature SetUn-adapted CER Adapted CER oldbasic frame + 1 st,2 nd,3 rd deriv newbasic frame + 1 st,2 nd,3 rd deriv oldconcatenate 9 basic frames newconcatenate 9 basic frames  The old pitch is better in terms of CER

9 Automatic segmentation: problem  Our RT04 system lost much more on Eval04 set than on Dev04 due to the automatic segmentation –Found that two conversations have severe channel leaking (cross-talk) and another one has strong background noises –Our auto-segment algorithm misclassified the cross-talk as clean speech on both channels, which caused a lot of insertion errors –Misclassification caused by errors in the “SS” (speech on both channels) class labeling of the training data Test setManu-segAuto-seg%CER increase Dev Eval

10 Automatic segmentation: solution  Retrain 4-class GMM with corrected labels –Re-assign the SS (speech on both channels) class data to either SN class or NS class (one channel is speech and the other noise) if the channel correlation is higher than a threshold (0.27) Test setManu-segNew Auto-seg%CER increase Dev Eval

11 Automatic segmentation: more testing  Tested on adapted decoding passes (on Eval04) –Using cross-model adaptation –The loss was reduced to 0.4% after adaptation Segmentation Un-adapted CER 1 st adapted CER 2 nd adapted CER manual RT04 auto-seg New auto-seg

12 Gaussianization – after HLDA  Inspired by the gain CU reported  Gaussianization applied to the HLDA features –Tested various number of mixture components in the GMMs used to gaussianize the data –Best case: 4-mixture GMM gives 0.5% gain # GMM componentsUn-adapted CER 0 (no gaussianization)

13 Gaussianization – before HLDA  Gaussianization applied to the static cepstra and energy –Easy to combine with HLDA-SAT training –Best case: 4-mixture GMM gives 0.6% gain, which is similar to the gain in the “after-HLDA” case # GMM componentsUn-adapted CER 0 (no gaussianization)

14 Gaussianization – test on HLDA-SAT  Trained HLDA-SAT with the gaussianization done before HLDA –0.3% gain –No gain compared to the HLDA-SAT that re-clusters the quinphone states in the transformed space (35.6%) # GMM componentsAdapted CER 0 (no gaussianization)

15 Silence chopping  Long silences in the new HKUST data –Did a quick silence chopping of endpoint silences during RT04  Set up an automatic silence chopping procedure –Chops segments on long silences –0.2% gain Silence chop Un-adapted CER 1 st adapted CER (HLDA-SAT) 1 st adapted CER (MPE) RT04 chop New chop

16 An updated system Un-adapted M14r 36.3 M M M10r 29.0 M15r 29.3 M16r 29.4 M Adapted (2 nd pass) Adapted (1 st pass) ROVER 27.2 Note: CER measured on Eval04

17 Model characteristics ModelFeatureMixExpPhn-setSATPitch M2PLP, deriv.yes77yesnew M4PLP, deriv.yes147yesnew M10rPLPno82yesold M13PLP, MPE-HLDAno82yesnew M15rMFCCno82yesold M16rPLPno147yesold M14rPLPno82noold  All models trained with MPE  M2 and M4 use derivatives and HLDA to project to 46 dim. All other models concatenate 9 frames and project to 46 dim with LDA+MLLT or MPE-HLDA  M2, M4 and M13 have not been updated yet