Download presentation
Presentation is loading. Please wait.
Published byJunior Franklin Modified over 9 years ago
1
Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau *, Aaron E. Rosenberg **, and S. Parthasarathy ** * Rice University, Houston, Texas - ** AT&T Labs Research, Florham Park, New Jersey ivan@ieee.org - aer@research.att.com - sps@ research.att.com Note 1: This work has been done when the first author was with AT&T Labs Research.Note 2: The first author would like to thank Rice University for financing his conference participation. One-target-speaker detection: subset aABC_NLI of the HUB4 database (ABC Nightline) target speaker: Ted Koppel 3 broadcasts for training the target model 12 broadcasts for testing (26 to 35 minutes) Two-target-speaker detection: subset bABC_WNN of the HUB4 database (ABC World News Now) target speakers: Mark Mullen (T1) and Thalia Assures (T2) 3 broadcasts for training the target models 16 broadcasts for testing (29 to 31 minutes) One-target-speaker detection: subset aABC_NLI of the HUB4 database (ABC Nightline) target speaker: Ted Koppel 3 broadcasts for training the target model 12 broadcasts for testing (26 to 35 minutes) Two-target-speaker detection: subset bABC_WNN of the HUB4 database (ABC World News Now) target speakers: Mark Mullen (T1) and Thalia Assures (T2) 3 broadcasts for training the target models 16 broadcasts for testing (29 to 31 minutes) high-fidelity: High fidelity with no background clean: all quality categories with no background allspeech: all quality categories with or without background alldata: previous category + all the untranscribed portions high-fidelity: High fidelity with no background clean: all quality categories with no background allspeech: all quality categories with or without background alldata: previous category + all the untranscribed portions Database: Feature vectors: 20 cepstral coefficients + 20 -cepstral coefficients. Gaussian mixture models : 64 mixtures and diagonal covariance matrices. Target speakers models : Three 90s segments of high-fidelity speech, extracted from 3 broadcasts, concatenated together. First background model (B1): Eight 60s segments of high-fidelity speech (4 females, 4 males) concatenated together (from aABC_NLI). Second background model (B2): Three 90s segments of non-speech data (music only 10%, noise only 10%, commercials 80%), extracted from 3 broadcasts, concatenated together (from aABC_NLI). Third background model (B3): 29 segments (293.5s) of high-fidelity speech (10 females, 10 males) concatenated together (from cABC_WNT). Fourth background model (B4): 23 segments (561.2s) of non-speech data (commercials + theme music), extracted from 2 broadcasts, concatenated together (from bABC_WNN). Feature vectors: 20 cepstral coefficients + 20 -cepstral coefficients. Gaussian mixture models : 64 mixtures and diagonal covariance matrices. Target speakers models : Three 90s segments of high-fidelity speech, extracted from 3 broadcasts, concatenated together. First background model (B1): Eight 60s segments of high-fidelity speech (4 females, 4 males) concatenated together (from aABC_NLI). Second background model (B2): Three 90s segments of non-speech data (music only 10%, noise only 10%, commercials 80%), extracted from 3 broadcasts, concatenated together (from aABC_NLI). Third background model (B3): 29 segments (293.5s) of high-fidelity speech (10 females, 10 males) concatenated together (from cABC_WNT). Fourth background model (B4): 23 segments (561.2s) of non-speech data (commercials + theme music), extracted from 2 broadcasts, concatenated together (from bABC_WNN). Modeling: Problem and Definitions : Data - broadcast band audio data from television news programs containing speech segments from a variety of speakers plus segments containing mixed speech and music (typically commercials), and music only. Speech segments may have variable quality and may be contaminated by music, speech, and/or noise backgrounds. Speaker detection task - locate and label segments of designated speakers (target speakers) in the data. Overall goal - aid information retrieval from large multimedia databases. Assumption - segmented and labeled training data exist for target speakers, other speakers, and other audio material. Problem and Definitions : Data - broadcast band audio data from television news programs containing speech segments from a variety of speakers plus segments containing mixed speech and music (typically commercials), and music only. Speech segments may have variable quality and may be contaminated by music, speech, and/or noise backgrounds. Speaker detection task - locate and label segments of designated speakers (target speakers) in the data. Overall goal - aid information retrieval from large multimedia databases. Assumption - segmented and labeled training data exist for target speakers, other speakers, and other audio material. Future directions: use more than one model for each target speaker. use more background models. study the performances as a function of the smoothing parameters and the segmentation algorithm parameters. use a new post processor to find the best path through a speaker lattice. Future directions: use more than one model for each target speaker. use more background models. study the performances as a function of the smoothing parameters and the segmentation algorithm parameters. use a new post processor to find the best path through a speaker lattice. Results: Results of the one-target-speaker detection experiments Results of the two-target-speaker detection experiments for the alldata category, using B3,B4 for the background models Results of the one-target-speaker detection experiments Results of the two-target-speaker detection experiments for the alldata category, using B3,B4 for the background models Conclusion: A method for estimating target speaker segments in multi-speaker audio data using a simple sequential decision technique has been developed. The method does not require segregating speech and audio data, and does not require other speakers in the data to be modeled explicitly. The method works best for uniform quality speaker segments with duration greater than 2 seconds. Approximately 70% of target speaker segments with duration 2 seconds or greater are detected correctly accompanied by approximately 5 false alarm segments per hour. A method for estimating target speaker segments in multi-speaker audio data using a simple sequential decision technique has been developed. The method does not require segregating speech and audio data, and does not require other speakers in the data to be modeled explicitly. The method works best for uniform quality speaker segments with duration greater than 2 seconds. Approximately 70% of target speaker segments with duration 2 seconds or greater are detected correctly accompanied by approximately 5 false alarm segments per hour. Frame-level Miss Rate (FMIR): # labeled target frames not estimated as target frames total # labeled target frames Frame-level False Alarm Rate (FFAR): # estimated target frames labeled as non-target frames total # labeled non-target frames Frame-level Miss Rate (FMIR): # labeled target frames not estimated as target frames total # labeled target frames Frame-level False Alarm Rate (FFAR): # estimated target frames labeled as non-target frames total # labeled non-target frames Segment-level Miss Rate (SMIR): # missed segments. total # target segments Segment-level Miss Rate (SMIR): # missed segments. total # target segments Evaluation: Segment-level False Alarm Rate (SFAR): # false alarm segments divided by the total duration of the broadcast. Segment-level COnfusion Rate (SCOR): # confusion segments divided by the total duration of the broadcast. Segment-level False Alarm Rate (SFAR): # false alarm segments divided by the total duration of the broadcast. Segment-level COnfusion Rate (SCOR): # confusion segments divided by the total duration of the broadcast. Frame-level COnfusion Rate (FCOR): # labeled target frames estimated as target frames of another speaker total # labeled target frames (FCOR is a component of FMIR) Frame-level COnfusion Rate (FCOR): # labeled target frames estimated as target frames of another speaker total # labeled target frames (FCOR is a component of FMIR) Detection algorithm: log-likelihood ratio: smoothed log-likelihood ratio every vectors: with (1s) and (0.2s) smoothed log-likelihood ratio every vectors: with (1s) and (0.2s) segmentation algorithm:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.