June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.

June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in the Meetings Domain

June 14th, 2005Speech Group Lunch Talk Overview Motivation Techniques Meetings Domain Crosstalk compensation Initial Results and Modifications Subsequent results –Development –Evaluation Conclusions

June 14th, 2005Speech Group Lunch Talk Motivation Audio signal contains isolated non-speech phenomena I.Externally produced Ex’s: Car honking, door slamming, telephone ringing II.Speaker produced Ex’s: Breathing, laughing, coughing III.Non-production Ex’s: Pause, silence

June 14th, 2005Speech Group Lunch Talk Motivation Some of these can be dealt with by recognizer –Explicit modeling –“Junk” model Many cannot –Non-speaker produced phenomena is too large and too rare for good modeling Desire: prevent non-speech regions from being processed by recognizer → Speech Activity Detection (SAD)

June 14th, 2005Speech Group Lunch Talk Techniques Two Main Approaches I.Threshold based -Decision performed according to one or more (possibly adaptive) thresholds -Method very sensitive to variations II.Classifier based -Ex’s: Viterbi decoder, ANN, GMM -Method relies on general statistics rather than local information -Requires fairly intensive training

June 14th, 2005Speech Group Lunch Talk Techniques Both threshold and classifier approaches typically make use of certain acoustic features I.Energy -Fundamental component to many SADs -Generally lacks robustness to noise and impulsive interference II.Zero-crossing rate -Successful as a correction term in energy-based systems III.Harmonicity (e.g., via autocorrelation) -Relates to voicing -Performs poorly in unvoiced speech regions

June 14th, 2005Speech Group Lunch Talk Meetings Domain With initiatives such as M4, AMI, and our own ICSI meeting recorder project, ASR in meetings is of strong interest Objective: Determine who said what, when, using information from multiple sensors (mics)

June 14th, 2005Speech Group Lunch Talk Meetings Domain Sensors of interest: personal mics –Come as either headset or lapel units –Should be able to obtain fairly high transcripts from these channels Domain has certain complexities that make task challenging, namely variability in 1) Number of speakers 2) Number, type, and location of sensors 3) Acoustic conditions

June 14th, 2005Speech Group Lunch Talk target speech crosstalk Crosstalk As a preprocessing step to ASR, SAD is also affected by these to varying levels Key culprit in poor SAD performance: crosstalk Example

June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Generate energy signals for each audio channel and subtract minimum energy from each –Minimum energy serves as “noise floor”

June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Compute mean energy of non-target channels

June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Subtract mean from target channel

June 14th, 2005Speech Group Lunch Talk Crosstalk compensation Apply thresholds using Schmitt trigger Merge segments with inter-segment pauses less than a set number Suppress segments of duration less than a set number Apply head and tail collars to avoid “clipping” segments

June 14th, 2005Speech Group Lunch Talk Initial Results Performance was examined for RT-04 Meetings development data 10 minute excerpts from 8 meetings, 2 from each of 1)ICSI 2)CMU 3)LDC 4)NIST Note: CMU and LDC data obtained from lapel mics

June 14th, 2005Speech Group Lunch Talk Initial Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG20061826460.216.223.52.542.372.7 SRI Baseline: My SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG2002182647018.111.98.538.570.4 Verdict: Sad results  Possible reason: sensitivity of thresholds

June 14th, 2005Speech Group Lunch Talk Modification: Segment Intersection Idea: System ideally should be generating segments from the target speaker only. By intersecting these segments with another SAD, we can filter out crosstalk and reduce insertion errors Modified SAD to have zero threshold –Sensitivity needed to address deletions –False alarms addressed by intersection

June 14th, 2005Speech Group Lunch Talk SRI SAD –Two-class HMM using GMMs for speech and non- speech –Regions merged and padded to satisfy constraints (min duration and min pause) Constraints optimized for recognition accuracy Modification: Segment Intersection SNS

June 14th, 2005Speech Group Lunch Talk New Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG20021826470.117.812.24.634.568.8 SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG 2002182647018.111.98.538.570.4 Verdict: Happy results

June 14th, 2005Speech Group Lunch Talk New Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG20021826470.117.812.24.634.568.8. SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG 2002182647018.111.98.538.570.4 Note that improvement comes largely from reduced insertions

June 14th, 2005Speech Group Lunch Talk New Results # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG20021826470.117.812.24.634.568.8 SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG2002182647018.111.98.538.570.4 Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG20021826472.318.69.13.230.961.8

June 14th, 2005Speech Group Lunch Talk New Results Site-level breakdown: WERs Insertions AllICSICMULDCNIST SRI SAD38.521.452.750.429.8 Intersection SAD34.51947.940.930.9 Hand Segments30.917.843.334.528.8 AllICSICMULDCNIST SRI SAD8.557.317.53.1 Intersection SAD4.62.23.98.63.2 Hand Segments3.22 43.7

June 14th, 2005Speech Group Lunch Talk Graphical Example SRI SAD My SAD Intersection Hand Segs

June 14th, 2005Speech Group Lunch Talk Results: Eval04 Applied 2004 Eval system to Eval04 data 11 minute excerpts from 8 meetings, 2 from each of 1)ICSI 2)CMU 3)LDC 4)NIST Note: No lapel mics (with exception of 1 ICSI channel)

June 14th, 2005Speech Group Lunch Talk Results: Eval04 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG58132078167.816.216.02.134.3 SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG58972078567.816.515.63.335.534.6 Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG58972078571.317.511.23.432.131.4 Applied 2004 Eval system to Eval04 data

June 14th, 2005Speech Group Lunch Talk Results: AMI Dev Data # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG28874018872.616.810.63.931.377.8 SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG 28874018772.217.410.47.034.879.0 Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG 28874018874.417.77.93.729.363.8 Applied 2005 CTS (not meetings) system with AMI-adapted LM to AMI development data

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 ICSI System –SRI SAD GMMs trained on 2004 training data for non-AMI meetings and 2005 AMI data for AMI meetings –Recognizer Based on models from SRI’s RT-04F CTS system w/ Tandem/HATS MLP features –Adapted to meetings using ICSI, NIST, and AMI data LMs trained on conversational speech, broadcast news, and web texts and adapted to meetings Vocab consisted of 54K+ words, from CTS system and ICSI, CMU, NIST, and AMI training transcripts

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG33402512177.511.111.43.325.866.0 SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG 33292512178.711.210.17.729.065.1 Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG 3333251218211.26.71.619.552.3 Cf. AMI entry: 30.6 WER !!!

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG33402512177.511.111.43.325.866.0 SRI Baseline: Intersection SAD: # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG 33292512178.711.210.17.729.065.1 Hand segmentation # Sent# WordsCorrSubDelInsWERSent. Err SUM/AVG 3333251218211.26.71.619.552.3

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 Site-level breakdown: WERs Insertions AllICSICMUAMINISTVT SRI SAD2920.623.32244.835.3 Intersection SAD25.824.523.3 34.123.4 Hand Segments19.516.919.919.221.220.6 AllICSICMUAMINISTVT SRI SAD7.71.12.81.420.713.5 Intersection SAD3.30.92.61.51.31.6 Hand Segments1.612.61.41.11.5

June 14th, 2005Speech Group Lunch Talk Moment of Truth: Eval05 One culprit: 3 NIST channels with no speech Example (un-mic’d speaker?) SRI SAD My SAD Intersection Hand Segs

June 14th, 2005Speech Group Lunch Talk Conclusions Crosstalk compensation is successful at reducing insertions while not adversely affecting deletions, resulting in lower WER –Demonstrates power of combining information sources For 2005 Meeting Eval, gap between automatic and hand segments quite large –Initial analysis identifies zero-speech channels –Further analysis necessary

June 14th, 2005Speech Group Lunch Talk Acknowledgments Andreas Stolcke Chuck Wooters Adam Janin

June 14th, 2005Speech Group Lunch Talk Fin

June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.

Similar presentations

Presentation on theme: "June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.

Similar presentations

Presentation on theme: "June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in."— Presentation transcript:

Similar presentations

About project

Feedback