June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

VTrack: Accurate, Energy-Aware Road Traffic Delay Estimation Using Mobile Phones Arvind Thiagarajan, Lenin Ravindranath, Katrina LaCurts, Sivan Toledo,

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

K. Boakye: Qualifying Exam Presentation Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings.

Speaker Clustering using MDL Principles Kofi Boakye Stat212A Project December 3, 2003.

Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.

8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.

Review of ICASSP 2004 Arthur Chan. Part I of This presentation (6 pages) Pointers of ICASSP 2004 (2 pages) NIST Meeting Transcription Workshop (2 pages)

9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.

9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

Notes on ICASSP 2004 Arthur Chan May 24, This Presentation (5 pages)  Brief note of ICASSP 2004  NIST RT 04 Evaluation results  Other interesting.

By Sarita Jondhale1 Pattern Comparison Techniques.

Technical Seminar Presented by :- Debabandana Apta (EC ) National Institute of Science and Technology [1] “ECHO CANCELLATION” Presented.

Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.

Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.

Multimodal Information Analysis for Emotion Recognition

11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.

Spoken Dialog Systems Diane J. Litman Professor, Computer Science Department.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.

Detection Of Anger In Telephone Speech Using Support Vector Machine and Gaussian Mixture Model Prepared By : Siti Marahaini Binti Mahamood.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin.

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Statistical Models for Automatic Speech Recognition

John H.L. Hansen & Taufiq Al Babba Hasan

Speaker Identification:

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in the Meetings Domain

June 14th, 2005Speech Group Lunch Talk Overview Motivation Techniques Meetings Domain Crosstalk compensation Initial Results and Modifications Subsequent results –Development –Evaluation Conclusions

June 14th, 2005Speech Group Lunch Talk Motivation Audio signal contains isolated non-speech phenomena I.Externally produced Ex’s: Car honking, door slamming, telephone ringing II.Speaker produced Ex’s: Breathing, laughing, coughing III.Non-production Ex’s: Pause, silence

June 14th, 2005Speech Group Lunch Talk Motivation Some of these can be dealt with by recognizer –Explicit modeling –“Junk” model Many cannot –Non-speaker produced phenomena is too large and too rare for good modeling Desire: prevent non-speech regions from being processed by recognizer → Speech Activity Detection (SAD)

June 14th, 2005Speech Group Lunch Talk Techniques Two Main Approaches I.Threshold based -Decision performed according to one or more (possibly adaptive) thresholds -Method very sensitive to variations II.Classifier based -Ex’s: Viterbi decoder, ANN, GMM -Method relies on general statistics rather than local information -Requires fairly intensive training

June 14th, 2005Speech Group Lunch Talk Techniques Both threshold and classifier approaches typically make use of certain acoustic features I.Energy -Fundamental component to many SADs -Generally lacks robustness to noise and impulsive interference II.Zero-crossing rate -Successful as a correction term in energy-based systems III.Harmonicity (e.g., via autocorrelation) -Relates to voicing -Performs poorly in unvoiced speech regions

June 14th, 2005Speech Group Lunch Talk Meetings Domain With initiatives such as M4, AMI, and our own ICSI meeting recorder project, ASR in meetings is of strong interest Objective: Determine who said what, when, using information from multiple sensors (mics)

June 14th, 2005Speech Group Lunch Talk Meetings Domain Sensors of interest: personal mics –Come as either headset or lapel units –Should be able to obtain fairly high transcripts from these channels Domain has certain complexities that make task challenging, namely variability in 1) Number of speakers 2) Number, type, and location of sensors 3) Acoustic conditions

June 14th, 2005Speech Group Lunch Talk target speech crosstalk Crosstalk As a preprocessing step to ASR, SAD is also affected by these to varying levels Key culprit in poor SAD performance: crosstalk Example

June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Generate energy signals for each audio channel and subtract minimum energy from each –Minimum energy serves as “noise floor”

June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Compute mean energy of non-target channels

June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Subtract mean from target channel

June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Apply thresholds using Schmitt trigger Merge segments with inter-segment pauses less than a set number Suppress segments of duration less than a set number Apply head and tail collars to avoid “clipping” segments

June 14th, 2005Speech Group Lunch Talk Initial Results Performance was examined for RT-04 Meetings development data 10 minute excerpts from 8 meetings, 2 from each of 1)ICSI 2)CMU 3)LDC 4)NIST Note: CMU and LDC data obtained from lapel mics

June 14th, 2005Speech Group Lunch Talk Initial Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: My SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Verdict: Sad results  Possible reason: sensitivity of thresholds

June 14th, 2005Speech Group Lunch Talk Modification: Segment Intersection Idea: System ideally should be generating segments from the target speaker only. By intersecting these segments with another SAD, we can filter out crosstalk and reduce insertion errors Modified SAD to have zero threshold –Sensitivity needed to address deletions –False alarms addressed by intersection

June 14th, 2005Speech Group Lunch Talk SRI SAD –Two-class HMM using GMMs for speech and non- speech –Regions merged and padded to satisfy constraints (min duration and min pause) Constraints optimized for recognition accuracy Modification: Segment Intersection SNS

June 14th, 2005Speech Group Lunch Talk New Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Verdict: Happy results

June 14th, 2005Speech Group Lunch Talk New Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Note that improvement comes largely from reduced insertions

June 14th, 2005Speech Group Lunch Talk New Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG

June 14th, 2005Speech Group Lunch Talk New Results Site-level breakdown: WERs Insertions AllICSICMULDCNIST SRI SAD Intersection SAD Hand Segments AllICSICMULDCNIST SRI SAD Intersection SAD Hand Segments

June 14th, 2005Speech Group Lunch Talk Graphical Example SRI SAD My SAD Intersection Hand Segs

June 14th, 2005Speech Group Lunch Talk Results: Eval04 Applied 2004 Eval system to Eval04 data 11 minute excerpts from 8 meetings, 2 from each of 1)ICSI 2)CMU 3)LDC 4)NIST Note: No lapel mics (with exception of 1 ICSI channel)

June 14th, 2005Speech Group Lunch Talk Results: Eval04 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Applied 2004 Eval system to Eval04 data

June 14th, 2005Speech Group Lunch Talk Results: Eval04 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Applied 2004 Eval system to Eval04 data

June 14th, 2005Speech Group Lunch Talk Results: AMI Dev Data # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Applied 2005 CTS (not meetings) system with AMI-adapted LM to AMI development data

June 14th, 2005Speech Group Lunch Talk Results: AMI Dev Data # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Applied 2005 CTS (not meetings) system with AMI-adapted LM to AMI development data

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 ICSI System –SRI SAD GMMs trained on 2004 training data for non-AMI meetings and 2005 AMI data for AMI meetings –Recognizer Based on models from SRI’s RT-04F CTS system w/ Tandem/HATS MLP features –Adapted to meetings using ICSI, NIST, and AMI data LMs trained on conversational speech, broadcast news, and web texts and adapted to meetings Vocab consisted of 54K+ words, from CTS system and ICSI, CMU, NIST, and AMI training transcripts

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Cf. AMI entry: 30.6 WER !!!

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 Site-level breakdown: WERs Insertions AllICSICMUAMINISTVT SRI SAD Intersection SAD Hand Segments AllICSICMUAMINISTVT SRI SAD Intersection SAD Hand Segments

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 One culprit: 3 NIST channels with no speech Example (un-mic’d speaker?) SRI SAD My SAD Intersection Hand Segs

June 14th, 2005Speech Group Lunch Talk Conclusions Crosstalk compensation is successful at reducing insertions while not adversely affecting deletions, resulting in lower WER –Demonstrates power of combining information sources For 2005 Meeting Eval, gap between automatic and hand segments quite large –Initial analysis identifies zero-speech channels –Further analysis necessary

June 14th, 2005Speech Group Lunch Talk Acknowledgments Andreas Stolcke Chuck Wooters Adam Janin

June 14th, 2005Speech Group Lunch Talk Fin