Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau , Aaron E. Rosenberg , and S. Parthasarathy Rice University, Houston, Texas.

Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau *, Aaron E. Rosenberg **, and S. Parthasarathy ** * Rice University, Houston, Texas - ** AT&T Labs Research, Florham Park, New Jersey ivan@ieee.org - aer@research.att.com - sps@ research.att.com Note 1: This work has been done when the first author was with AT&T Labs Research.Note 2: The first author would like to thank Rice University for financing his conference participation.  One-target-speaker detection: subset aABC_NLI of the HUB4 database (ABC Nightline) target speaker: Ted Koppel 3 broadcasts for training the target model 12 broadcasts for testing (26 to 35 minutes)  Two-target-speaker detection: subset bABC_WNN of the HUB4 database (ABC World News Now) target speakers: Mark Mullen (T1) and Thalia Assures (T2) 3 broadcasts for training the target models 16 broadcasts for testing (29 to 31 minutes)  One-target-speaker detection: subset aABC_NLI of the HUB4 database (ABC Nightline) target speaker: Ted Koppel 3 broadcasts for training the target model 12 broadcasts for testing (26 to 35 minutes)  Two-target-speaker detection: subset bABC_WNN of the HUB4 database (ABC World News Now) target speakers: Mark Mullen (T1) and Thalia Assures (T2) 3 broadcasts for training the target models 16 broadcasts for testing (29 to 31 minutes) high-fidelity: High fidelity with no background clean: all quality categories with no background allspeech: all quality categories with or without background alldata: previous category + all the untranscribed portions high-fidelity: High fidelity with no background clean: all quality categories with no background allspeech: all quality categories with or without background alldata: previous category + all the untranscribed portions Database:  Feature vectors: 20 cepstral coefficients + 20  -cepstral coefficients.  Gaussian mixture models : 64 mixtures and diagonal covariance matrices.  Target speakers models : Three 90s segments of high-fidelity speech, extracted from 3 broadcasts, concatenated together.  First background model (B1): Eight 60s segments of high-fidelity speech (4 females, 4 males) concatenated together (from aABC_NLI).  Second background model (B2): Three 90s segments of non-speech data (music only 10%, noise only 10%, commercials 80%), extracted from 3 broadcasts, concatenated together (from aABC_NLI).  Third background model (B3): 29 segments (293.5s) of high-fidelity speech (10 females, 10 males) concatenated together (from cABC_WNT).  Fourth background model (B4): 23 segments (561.2s) of non-speech data (commercials + theme music), extracted from 2 broadcasts, concatenated together (from bABC_WNN).  Feature vectors: 20 cepstral coefficients + 20  -cepstral coefficients.  Gaussian mixture models : 64 mixtures and diagonal covariance matrices.  Target speakers models : Three 90s segments of high-fidelity speech, extracted from 3 broadcasts, concatenated together.  First background model (B1): Eight 60s segments of high-fidelity speech (4 females, 4 males) concatenated together (from aABC_NLI).  Second background model (B2): Three 90s segments of non-speech data (music only 10%, noise only 10%, commercials 80%), extracted from 3 broadcasts, concatenated together (from aABC_NLI).  Third background model (B3): 29 segments (293.5s) of high-fidelity speech (10 females, 10 males) concatenated together (from cABC_WNT).  Fourth background model (B4): 23 segments (561.2s) of non-speech data (commercials + theme music), extracted from 2 broadcasts, concatenated together (from bABC_WNN). Modeling: Problem and Definitions :  Data - broadcast band audio data from television news programs containing speech segments from a variety of speakers plus segments containing mixed speech and music (typically commercials), and music only. Speech segments may have variable quality and may be contaminated by music, speech, and/or noise backgrounds.  Speaker detection task - locate and label segments of designated speakers (target speakers) in the data.  Overall goal - aid information retrieval from large multimedia databases.  Assumption - segmented and labeled training data exist for target speakers, other speakers, and other audio material. Problem and Definitions :  Data - broadcast band audio data from television news programs containing speech segments from a variety of speakers plus segments containing mixed speech and music (typically commercials), and music only. Speech segments may have variable quality and may be contaminated by music, speech, and/or noise backgrounds.  Speaker detection task - locate and label segments of designated speakers (target speakers) in the data.  Overall goal - aid information retrieval from large multimedia databases.  Assumption - segmented and labeled training data exist for target speakers, other speakers, and other audio material. Future directions:  use more than one model for each target speaker.  use more background models.  study the performances as a function of the smoothing parameters and the segmentation algorithm parameters.  use a new post processor to find the best path through a speaker lattice. Future directions:  use more than one model for each target speaker.  use more background models.  study the performances as a function of the smoothing parameters and the segmentation algorithm parameters.  use a new post processor to find the best path through a speaker lattice. Results: Results of the one-target-speaker detection experiments Results of the two-target-speaker detection experiments for the alldata category, using B3,B4 for the background models Results of the one-target-speaker detection experiments Results of the two-target-speaker detection experiments for the alldata category, using B3,B4 for the background models Conclusion:  A method for estimating target speaker segments in multi-speaker audio data using a simple sequential decision technique has been developed. The method does not require segregating speech and audio data, and does not require other speakers in the data to be modeled explicitly.  The method works best for uniform quality speaker segments with duration greater than 2 seconds.  Approximately 70% of target speaker segments with duration 2 seconds or greater are detected correctly accompanied by approximately 5 false alarm segments per hour.  A method for estimating target speaker segments in multi-speaker audio data using a simple sequential decision technique has been developed. The method does not require segregating speech and audio data, and does not require other speakers in the data to be modeled explicitly.  The method works best for uniform quality speaker segments with duration greater than 2 seconds.  Approximately 70% of target speaker segments with duration 2 seconds or greater are detected correctly accompanied by approximately 5 false alarm segments per hour.  Frame-level Miss Rate (FMIR): # labeled target frames not estimated as target frames total # labeled target frames  Frame-level False Alarm Rate (FFAR): # estimated target frames labeled as non-target frames total # labeled non-target frames  Frame-level Miss Rate (FMIR): # labeled target frames not estimated as target frames total # labeled target frames  Frame-level False Alarm Rate (FFAR): # estimated target frames labeled as non-target frames total # labeled non-target frames  Segment-level Miss Rate (SMIR): # missed segments. total # target segments  Segment-level Miss Rate (SMIR): # missed segments. total # target segments Evaluation:  Segment-level False Alarm Rate (SFAR): # false alarm segments divided by the total duration of the broadcast.  Segment-level COnfusion Rate (SCOR): # confusion segments divided by the total duration of the broadcast.  Segment-level False Alarm Rate (SFAR): # false alarm segments divided by the total duration of the broadcast.  Segment-level COnfusion Rate (SCOR): # confusion segments divided by the total duration of the broadcast.  Frame-level COnfusion Rate (FCOR): # labeled target frames estimated as target frames of another speaker total # labeled target frames (FCOR is a component of FMIR)  Frame-level COnfusion Rate (FCOR): # labeled target frames estimated as target frames of another speaker total # labeled target frames (FCOR is a component of FMIR) Detection algorithm:  log-likelihood ratio:  smoothed log-likelihood ratio every vectors: with (1s) and (0.2s)  smoothed log-likelihood ratio every vectors: with (1s) and (0.2s)  segmentation algorithm:

Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau , Aaron E. Rosenberg , and S. Parthasarathy Rice University, Houston, Texas.

Similar presentations

Presentation on theme: "Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau , Aaron E. Rosenberg , and S. Parthasarathy Rice University, Houston, Texas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau *, Aaron E. Rosenberg **, and S. Parthasarathy ** * Rice University, Houston, Texas.

Similar presentations

Presentation on theme: "Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau *, Aaron E. Rosenberg **, and S. Parthasarathy ** * Rice University, Houston, Texas."— Presentation transcript:

Similar presentations

About project

Feedback

Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau , Aaron E. Rosenberg , and S. Parthasarathy Rice University, Houston, Texas.

Presentation on theme: "Detection of Target Speakers in Audio Databases Ivan Magrin-Chagnolleau , Aaron E. Rosenberg , and S. Parthasarathy Rice University, Houston, Texas."— Presentation transcript: