May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

Slides:



Advertisements
Similar presentations
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Advertisements

Improvement of Audio Capture in Handheld Devices through Digital Filtering Problem Microphones in handheld devices are of low quality to reduce cost. This.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation Man-Wai MAK and Hon-Bill YU The Hong Kong Polytechnic University.
Optimizing number of hidden neurons in neural networks
K. Boakye: Qualifying Exam Presentation Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings.
Speaker Clustering using MDL Principles Kofi Boakye Stat212A Project December 3, 2003.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Speaker Adaptation for Vowel Classification
8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
November 1, 2005IEEE MMSP 2005, Shanghai, China1 Adaptive Multi-Frame-Rate Scheme for Distributed Speech Recognition Based on a Half Frame-Rate Front-End.
Speaker Diarisation and Large Vocabulary Recognition at CSTR: The AMI/AMIDA System Fergus McInnes 7 December 2011 History – AMI, AMIDA and recent developments.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.
Olivier Siohan David Rybach
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin.
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
CRANDEM: Conditional Random Fields for ASR
Sphinx Recognizer Progress Q2 2004
A maximum likelihood estimation and training on the fly approach
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
Learning Long-Term Temporal Features
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International Computer Science Institute

May 30th, 2006Speech Group Lunch Talk Overview ● Background ● Previous work and Proposed changes ● HMM segmenter and ASR System ● Features Investigated ● Experimental Results ● Conclusions

May 30th, 2006Speech Group Lunch Talk Background ● Segmentation of audio into speech/nonspeech is a critical first step in ASR ● Especially true for Individual Headset Microphone (IHM) condition in meetings – Issues: 1) Crosstalk 2) Breath/contact noise – Single-channel energy-based methods ineffective

May 30th, 2006Speech Group Lunch Talk Background ● Initiatives such as AMI, IM2, and the NIST RT eval show interest in recognition and understanding of multispeaker meetings

May 30th, 2006Speech Group Lunch Talk Background ● Major source of error for IHM recognition: speech activity detection errors

May 30th, 2006Speech Group Lunch Talk Previous Work ● Previous approach: Time-based intersection of two distinct segmenters 1) HMM-based segmenter with standard cepstral features – 12 MFCCs – Log-Energy – First and second differences

May 30th, 2006Speech Group Lunch Talk Previous Work ● Previous approach: Time-based intersection of two distinct segmenters 2) Local-energy detector – Generates segments by zero-thresholding “crosstalk-compensated” energy-like signal

May 30th, 2006Speech Group Lunch Talk Proposed Changes ● Though intersection approach was effective, it was believed to be limited – Cross-channel analysis disjoint from speech activity modeling – Fixed threshold potentially lacks robustness – Fails to incorporate other acoustically derived features (e.g., cross-correlation) ● New approach: integrate features directly into HMM segmenter – Append features to cepstral feature vector

May 30th, 2006Speech Group Lunch Talk HMM Segmenter ● Derived from HMM-based speech recognition system ● Two-class HMM with three-state phone model ● Multivariate GMM with 256 components ● Segmentation proceeds by repeatedly decoding waveform with decreasing transition penalties – Results in segments less than 60s

May 30th, 2006Speech Group Lunch Talk HMM Segmenter ● Post-processing – Pad segments by a fixed amount (40ms) to prevent “clipping” effects – Merge segments with small separation (< 0.4s) to “smooth” segmentation – Constraints optimized based on recognition accuracy and runtime for segmenter with standard cepstral features

May 30th, 2006Speech Group Lunch Talk ASR System ● For development and validation experiments ICSI-SRI RT-05S system used – Multiple decoding passes and front-ends for cross-adaptation and hypothesis refinement – PLP and MFCC+MLP features – Features transformed with VTLN and HLDA along with feature-level constrained MLLR – Models trained on 2000 hours of telephone data and MAP adapted to 100 hours of meeting data – 4-gram LM trained with telephone, meeting transcripts, broadcast, and Web data

May 30th, 2006Speech Group Lunch Talk Features: Cross-channel ● Log-Energy Differences (LEDs) – Log of ratio of short-time energy between target and each non-target channel ● Normalized Log-Energy Differences – Subtract minimum frame energy of a channel from all energy values in the channel – Addresses significant gain differences – Largely independent of amount of speech in channel

May 30th, 2006Speech Group Lunch Talk Features: Cross-channel ● Normalized Maximum Cross-correlation (NMXC) – Serves as an indicator of crosstalk – More common cross-channel feature than energy-differences

May 30th, 2006Speech Group Lunch Talk Features: Cross-channel ● Feature vector length standardization – For cross-channel features, number of channels may vary, but feature vector length must be fixed – Proposed solution: use order statistics (maximum and minimum) of the feature values generated on the different channels

May 30th, 2006Speech Group Lunch Talk Experiments: AMI devtest ● Performance of features initially investigated on AMI development set ● Testing – 12-minute excerpts from 4 meetings ● Training – First 10 minutes from each of 35 meetings ● “Fast” (two-decoding pass) version of recognition system used for quick turnaround

May 30th, 2006Speech Group Lunch Talk Experiments: AMI devtest ● Results New features give significant improvement over baseline ● Reduced insertions NLEDs give ~1% reduction over LEDs

May 30th, 2006Speech Group Lunch Talk Experiments: Eval04 ● Having established effectiveness of features, systems were evaluated using RT-04S set ● Meetings vary in style, number of participants, and room acoustics ● Testing – 11-minute excerpts from 8 meetings, 2 from each of CMU, ICSI, NIST, and LDC ● Training – First 10 minutes from each of 15 NIST meetings and 73 ICSI meetings

May 30th, 2006Speech Group Lunch Talk Experiments: Eval04 ● Results Features give improvement over baseline and previous system NMXC features not as robust ● Removed from consideration for final SAD system

May 30th, 2006Speech Group Lunch Talk System Validation: Eval05 (and 06) ● Finalized system: HMM segmenter with baseline and NLED features* ● Training – Union of previous training sets ● AMI (35 mtgs), NIST (15 mtgs), ICSI (73 mtgs) – Baseline and intersection systems used two models (ICSI+NIST and AMI) – New systems used single model with pooled data *Eval06 official submission used LEDs

May 30th, 2006Speech Group Lunch Talk System Validation: Eval05 (and 06) NIST ALL WER AMI CMU ICSI VT NLEDs+SDM LEDs +SDM Segmenter Method Reference LEDs ● Using the SDM signal –Eval05 included a meeting with an unmiked participant –SDM served as “stand-in” mic for participant –Including the SDM signal (and energy normalization) improved results by >12% on NIST meetings! –The SDM signal was not used for eval06 since there were no unmiked speakers

May 30th, 2006Speech Group Lunch Talk System Validation: Eval05 (and 06) WER eval Sub Del Ins WER eval06 NLEDs LEDs System Reference intersection baseline ● 1.2% gain over last year’s segmenter on eval05 ● Energy normalization gave extra 1.2% gain on eval06, 2.0% on eval05 (due to unmiked speaker in NIST meeting)

May 30th, 2006Speech Group Lunch Talk Additional Experiments: MLP Features ● Use features as inputs to Multi-Layer Perceptron to see if additional gains can be made ● Training – Inputs consist of baseline and either LED or NLED features (41 components) – Input context window of 11 frames and 400 hidden units – 90/10 split for cross-validation

May 30th, 2006Speech Group Lunch Talk Additional Experiments: MLP Features ● Amidevtest Results MLP with LEDs better than with NLEDs Addition of baseline features degrades performance No combination outperforms NLED features

May 30th, 2006Speech Group Lunch Talk Conclusions ● Integrating cross-channel analysis with speech activity modeling yields large WER reductions ● Simple cross-channel energy-based features perform very well and are more robust than cross- correlation based features ● Minimum energy subtraction produces still further gains ● Inclusion of omnidirectional mic allows crosstalk suppression even from speakers without dedicated microphones ● Still room for improvement as significant gap (>2%) exists between automatic and ideal segmentation

May 30th, 2006Speech Group Lunch Talk Fin