Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments
Good morning, My name is Guan-Lin Chao, from Carnegie Mellon University This is a joint work with William Chan, and Prof Ian Lane Guan-Lin Chao1, William Chan1, Ian Lane1,2 1Electrical and Computer Engineering,2Language Technology Institute Carnegie Mellon University

Outline Introduction Previous Work Approach Experiments Results
This is the outline of this talk.

Introduction

Motivation Social Robot in a Noisy Party Mixed Speech Video Identity
ASR in Cocktail Party Environments Mixed Speech Video Identity Say you have a social robot in a noisy party For example this robot wants to join the conversation with the two ladies First step of interaction can be automatic speech recognition, then follows the speech understanding The problem we address in this work is Automatic Speech Recognition in cocktail party environments What input signals do the robot have? Of course it perceives the mixed speech signal In most cases, robots are equipped with cameras, so it has video input Moreover, the robot may even know the identity of the speakers

Scenario of Interest Cocktail-party problem with overlapping speech from two speakers: target speaker and background speaker Target Speaker: the speaker whose speech to be recognized Multimodal Inputs single channel overlapping speech of target speaker and background speaker target speaker's mouth region of interest (ROI) image target speaker's identity embedding Our focused scenario is cocktail party ASR with overlapping speech from two speakers: a target speaker and a background speaker Target Speaker is the speaker whose speech we want to recognize What multimodal inputs do we have? single channel overlapping speech of target speaker and background speaker target speaker's mouth region of interest (ROI) image target speaker's identity embedding

Previous Work

Previous Approaches to Cocktail Party ASR
Blind Signal Separation NMF-based (Sun ‘13) (Xu ‘15), Masking-based (Reddy ‘07) (Grais ‘13) Training goal not necessarily aligned with ASR Multimodal Robust Features Audio + Speaker ID: (Saon ‘13) (Abdel-Hamid ‘13) (Karanasou ‘14) (Peddinti ‘15) Audio + Visual: (Ngiam ‘11) (Mroueh ‘15) (Noda ‘15) Hybrid Let’s review some of the previous approaches to the cocktail party ASR problem The approaches can be grouped into two categories: Blind Signal Separation, Multimodal Robust Features, and a hybrid of both Blind Signal Separation approaches can be grouped into Nonnegative Matrix Factorization based and Masking based NMF-based approaches include Sun et al (Universal speech models for speaker independent single channel sourceseparation): a universal pitch model was learned Xu et al(Single-channel speech separation using sequential discriminative dictionary learning) proposed a sparse-coding based dictionary learning algorithm Masking-based approaches include Reddy et al(Soft mask methods for single-channel speaker separation) proposed using soft mask to estimate weights of mixed signals Grais et al(Deep neural networks for single channel source separation), they proposed using deep neural network to estimate soft masks Sao et el (Speaker adaptation of neural network acoustic models using i-vectors) proposed to supply speaker identity vectors, i-vectors along with acoustic features (Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code) Abdel-Hamid proposed to an adaptation method using speaker-specific code as input (Adaptation of deep neural network acoustic models using factorised i-vectors) Karanasou Ngiam (Multimodal deep learning) Mroueh (“Deep multimodal learning for audio-visual speech recognition,) Noda (“Audio-visual speech recognition using deep learning) NMF-based approaches learn a dictionary of universal time-frequency components to represent mixed signal as combinations of sources Masking-based approaches learn a binary or soft mask to estimate weights from sources AI approaches use i-vectors or factorized i-vectors with different models architectures for speaker adaptation AV approaches proposed different models to learn multimodal features and fusing techniques

Approach

Feed-forward DNN Acoustic Models with Different Combinations of Additional Modalities
Speaker-targeted model: speaker-independent model with speaker identity information input We proposed to use a feed-forward dnn for acoustic modeling in a hybrid DNN-HMM architecture with Combinations of Additional modalities: With or without visual features and with or without speaker identity information There are four combinations of input, so we trained four DNN models for each case Here we term a speaker-independent model with speaker identity information input as a speaker-targeted model

Three Variants of Speaker-Targeted Models: Fusing Audiovisual and Speaker Identity
We also investigate three ways to fuse the audio-visual features with the speaker identity information There are three variants A,B,C In (A), speaker identity is directly concatenated the with audio or audio-visual features. In (B) speaker identity is first mapped into a compact but presumably more discriminative embedding, and then it the compact embedding is concatenated with audio or audiovisual features. In (C) speaker identity is connected to a later layer than audio or audio-visual features. (A) Concatenating the speaker identity directly with audio-only and audio-visual features (B) Mapping speaker identity into a compact embedding which is concatenated with audio-only and audio-visual features (C) Connecting the speaker identity to a later layer than audio-only and audio-visual features

Experiments

Dataset Simulated speech mixtures from GRID corpus
GRID corpus (Cooke ‘06) low-noise audio and video recordings 34 speakers, 1000 sentences each speaker sentence syntax: $command $color $preposition $letter $digit $adverb Data separation following 1st CHiME Challenge convention (Barker ‘13) 15395 training, 548 development, 540 testing utterances We simulated speech mixtures from GRID corpus, the target speaker and a background speaker’s utterances are mixed with equal weights using SoX software The GRID corpus is a multi-speaker audio-visual corpus. It consist of low-noise audio and video recordings From 34 speakers, each read 1000 sentences The sentences follow the six-word syntax The data separation followed the convention of 1st Chime Challenge We excluded some utterances which mouth ROI images were unavailable

Dataset Clean speech sample Overlapping speech sample
Let’s watch some sample videos This is a clean speech sample This is simulated overlapping speech sample

Features Audio Features Visual Features Speaker Identity Information
40 log-mel filterbanks with context of 11 frames Visual Features Target speaker’s grayscale mouth ROI based on facial landmarks extracted by IntraFace (de la Torre ‘15) Speaker Identity Information Target speaker’s ID-embedding [0, …, 0, 1, 0, ..., 0] For Audio features: we extracted 40 bin log-mel filterbanks with context of 11 frames For visual features: First we used IntraFace software to extract facial landmarks, , and we cropped the images into 60*30 pixels according to mouth landmarks, Then we used Target speaker’s grayscale mouth ROI pixels values as visual features Speaker identity information is represented by Target speaker’s ID-embedding, which is a one-hot vector

Acoustic Model Number of hidden layers
4 for audio-only models and speaker-independent audio-visual models 5 for speaker-targeted and speaker-dependent audio-visual models 2048 nodes per hidden layer with ReLU activation 2371 output phoneme labels SGD with minibatch size 128 and learning rate 0.01 the architecture of our DNN acoustic model The number of hidden layers for audio-only models and speaker-independent audio-visual models is 4, and it is 5 for speaker-targeted and speaker-dependent audio-visual models. Each hidden layer contains 2048 nodes with ReLU activation The output layer has a softmax of 2371 phoneme labels. We use stochastic gradient descent with minibatch size of 128 frames and a learning rate of 0.01.

Results

WER Comparisons of Models
The audio-only baseline for two-speaker cocktail-party problem is % Introduction of visual information to acoustic features can reduce WER significantly in cocktail-party environments, improving the WER to 4.4% Using speaker identity information in conjunction with acoustic features achieves a better improvement on WER, reducing WER up to 3.6% WER table of different models The audio-only baseline for two-speaker cocktail-party problem is 26.3% It also shows that the introducing visual information to acoustic features can reduce WER significantly in cocktail-party environments, improving the WER to 4.4% Using speaker identity information in conjunction with acoustic features achieves a better improvement on WER, reducing WER up to 3.6%

WER Comparisons of Models
A weak tendency that providing speaker information in earlier layers of the network seems to have advantage speaker-dependent ASR system performs better than a speaker- targeted ASR system introduction of visual information improves the WER of speaker- dependent acoustic models while it doesn’t improve the speaker- targeted acoustic models Comparing the three variants of speaker-targeted models, we could see A weak tendency that providing speaker information in earlier layers of the network seems to have advantage We compare speaker-dependent and speaker-targeted models, an intuitive result shows that a speaker-dependent ASR system which is optimized for one specific speaker performs better than a speaker-targeted ASR system which is optimized for multiple speakers simultaneously Introducing visual information improves the WER of speaker-dependent acoustic models while it doesn’t improve the speaker-targeted acoustic models. We subscribe this finding to the limitation of the neural network’s modeling capacity that we use for both models in a speaker-dependent model, the capacity is able to optimize one specific speaker’s visual information But in a single speaker-targeted model, it is not powerful enough to learn a unified optimization for all 31 speakers’ visual information

WER Comparisons of Speakers
WER of the individual speakers. A similar trend between different models’ performance on individual speakers

Summary

Summary Proposed a speaker-targeted audio-visual DNN-HMM model for speech recognition in cocktail-party environments Used different combinations of acoustic and visual features and speaker identity information as DNN inputs Experiments suggested the performance: DNNAVI ≈ DNNAI > DNNAV > DNNA Prospective Investigate better representations in multimodal data space to incorporate audio, visual and speaker identity information Explore more model architectures to achieve a better unified optimization for the speaker-targeted audio-visual models

Thank you! Questions?

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

Similar presentations

Presentation on theme: "Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

Similar presentations

Presentation on theme: "Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon."— Presentation transcript:

Similar presentations

About project

Feedback