June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in the Meetings Domain
June 14th, 2005Speech Group Lunch Talk Overview Motivation Techniques Meetings Domain Crosstalk compensation Initial Results and Modifications Subsequent results –Development –Evaluation Conclusions
June 14th, 2005Speech Group Lunch Talk Motivation Audio signal contains isolated non-speech phenomena I.Externally produced Ex’s: Car honking, door slamming, telephone ringing II.Speaker produced Ex’s: Breathing, laughing, coughing III.Non-production Ex’s: Pause, silence
June 14th, 2005Speech Group Lunch Talk Motivation Some of these can be dealt with by recognizer –Explicit modeling –“Junk” model Many cannot –Non-speaker produced phenomena is too large and too rare for good modeling Desire: prevent non-speech regions from being processed by recognizer → Speech Activity Detection (SAD)
June 14th, 2005Speech Group Lunch Talk Techniques Two Main Approaches I.Threshold based -Decision performed according to one or more (possibly adaptive) thresholds -Method very sensitive to variations II.Classifier based -Ex’s: Viterbi decoder, ANN, GMM -Method relies on general statistics rather than local information -Requires fairly intensive training
June 14th, 2005Speech Group Lunch Talk Techniques Both threshold and classifier approaches typically make use of certain acoustic features I.Energy -Fundamental component to many SADs -Generally lacks robustness to noise and impulsive interference II.Zero-crossing rate -Successful as a correction term in energy-based systems III.Harmonicity (e.g., via autocorrelation) -Relates to voicing -Performs poorly in unvoiced speech regions
June 14th, 2005Speech Group Lunch Talk Meetings Domain With initiatives such as M4, AMI, and our own ICSI meeting recorder project, ASR in meetings is of strong interest Objective: Determine who said what, when, using information from multiple sensors (mics)
June 14th, 2005Speech Group Lunch Talk Meetings Domain Sensors of interest: personal mics –Come as either headset or lapel units –Should be able to obtain fairly high transcripts from these channels Domain has certain complexities that make task challenging, namely variability in 1) Number of speakers 2) Number, type, and location of sensors 3) Acoustic conditions
June 14th, 2005Speech Group Lunch Talk target speech crosstalk Crosstalk As a preprocessing step to ASR, SAD is also affected by these to varying levels Key culprit in poor SAD performance: crosstalk Example
June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Generate energy signals for each audio channel and subtract minimum energy from each –Minimum energy serves as “noise floor”
June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Compute mean energy of non-target channels
June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Subtract mean from target channel
June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Apply thresholds using Schmitt trigger Merge segments with inter-segment pauses less than a set number Suppress segments of duration less than a set number Apply head and tail collars to avoid “clipping” segments
June 14th, 2005Speech Group Lunch Talk Initial Results Performance was examined for RT-04 Meetings development data 10 minute excerpts from 8 meetings, 2 from each of 1)ICSI 2)CMU 3)LDC 4)NIST Note: CMU and LDC data obtained from lapel mics
June 14th, 2005Speech Group Lunch Talk Initial Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: My SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Verdict: Sad results Possible reason: sensitivity of thresholds
June 14th, 2005Speech Group Lunch Talk Modification: Segment Intersection Idea: System ideally should be generating segments from the target speaker only. By intersecting these segments with another SAD, we can filter out crosstalk and reduce insertion errors Modified SAD to have zero threshold –Sensitivity needed to address deletions –False alarms addressed by intersection
June 14th, 2005Speech Group Lunch Talk SRI SAD –Two-class HMM using GMMs for speech and non- speech –Regions merged and padded to satisfy constraints (min duration and min pause) Constraints optimized for recognition accuracy Modification: Segment Intersection SNS
June 14th, 2005Speech Group Lunch Talk New Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Verdict: Happy results
June 14th, 2005Speech Group Lunch Talk New Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Note that improvement comes largely from reduced insertions
June 14th, 2005Speech Group Lunch Talk New Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG
June 14th, 2005Speech Group Lunch Talk New Results Site-level breakdown: WERs Insertions AllICSICMULDCNIST SRI SAD Intersection SAD Hand Segments AllICSICMULDCNIST SRI SAD Intersection SAD Hand Segments
June 14th, 2005Speech Group Lunch Talk Graphical Example SRI SAD My SAD Intersection Hand Segs
June 14th, 2005Speech Group Lunch Talk Results: Eval04 Applied 2004 Eval system to Eval04 data 11 minute excerpts from 8 meetings, 2 from each of 1)ICSI 2)CMU 3)LDC 4)NIST Note: No lapel mics (with exception of 1 ICSI channel)
June 14th, 2005Speech Group Lunch Talk Results: Eval04 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Applied 2004 Eval system to Eval04 data
June 14th, 2005Speech Group Lunch Talk Results: Eval04 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Applied 2004 Eval system to Eval04 data
June 14th, 2005Speech Group Lunch Talk Results: AMI Dev Data # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Applied 2005 CTS (not meetings) system with AMI-adapted LM to AMI development data
June 14th, 2005Speech Group Lunch Talk Results: AMI Dev Data # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Applied 2005 CTS (not meetings) system with AMI-adapted LM to AMI development data
June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 ICSI System –SRI SAD GMMs trained on 2004 training data for non-AMI meetings and 2005 AMI data for AMI meetings –Recognizer Based on models from SRI’s RT-04F CTS system w/ Tandem/HATS MLP features –Adapted to meetings using ICSI, NIST, and AMI data LMs trained on conversational speech, broadcast news, and web texts and adapted to meetings Vocab consisted of 54K+ words, from CTS system and ICSI, CMU, NIST, and AMI training transcripts
June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Cf. AMI entry: 30.6 WER !!!
June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG
June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 Site-level breakdown: WERs Insertions AllICSICMUAMINISTVT SRI SAD Intersection SAD Hand Segments AllICSICMUAMINISTVT SRI SAD Intersection SAD Hand Segments
June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 One culprit: 3 NIST channels with no speech Example (un-mic’d speaker?) SRI SAD My SAD Intersection Hand Segs
June 14th, 2005Speech Group Lunch Talk Conclusions Crosstalk compensation is successful at reducing insertions while not adversely affecting deletions, resulting in lower WER –Demonstrates power of combining information sources For 2005 Meeting Eval, gap between automatic and hand segments quite large –Initial analysis identifies zero-speech channels –Further analysis necessary
June 14th, 2005Speech Group Lunch Talk Acknowledgments Andreas Stolcke Chuck Wooters Adam Janin
June 14th, 2005Speech Group Lunch Talk Fin