Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

Slides:

Advertisements

Similar presentations

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Advertisements

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Classification Neural Networks 1

Speaker Adaptation for Vowel Classification

Distributed Representations of Sentences and Documents

Autoencoders Mostafa Heidarpour

Image Denoising and Inpainting with Deep Neural Networks Junyuan Xie, Linli Xu, Enhong Chen School of Computer Science and Technology University of Science.

Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.

Why is ASR Hard? Natural speech is continuous

Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.

Case Studies Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIVERSITI MALAYSIA SARAWAK.

Presented by: Kamakhaya Argulewar Guided by: Prof. Shweta V. Jain

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Multimodal Information Analysis for Emotion Recognition

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

Speech Recognition through Neural Networks By Mohammad Usman Afzal Mohammad Waseem.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Olivier Siohan David Rybach

Big data classification using neural network

Deep Neural Network for Robust Speech Recognition With Auxiliary Features From Laser-Doppler Vibrometer Sensor Zhipeng Xie1,2，Jun Du1, Ian McLoughlin3.

Compressive Coded Aperture Video Reconstruction

Reza Yazdani Albert Segura José-María Arnau Antonio González

Deeply learned face representations are sparse, selective, and robust

Deep Learning Amin Sobhani.

Deep Reinforcement Learning

Data Mining, Neural Network and Genetic Programming

ARTIFICIAL NEURAL NETWORKS

COMP24111: Machine Learning and Optimisation

Pick samples from task t

Multimodal Learning with Deep Boltzmann Machines

Intelligent Information System Lab

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Grid Long Short-Term Memory

Classification Neural Networks 1

Deep Learning Hierarchical Representations for Image Steganalysis

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

Outline Background Motivation Proposed Model Experimental Results

Neural Speech Synthesis with Transformer Network

Lip movement Synthesis from Text

John H.L. Hansen & Taufiq Al Babba Hasan

A maximum likelihood estimation and training on the fly approach

EE 492 ENGINEERING PROJECT

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Word embeddings (continued)

Heterogeneous convolutional neural networks for visual recognition

Learning Long-Term Temporal Features

Attention for translation

Deep neural networks for spike sorting: exploring options

3. Adversarial Teacher-Student Learning (AT/S)

Human-object interaction

Introduction to Neural Networks

MULTI-VIEW VISUAL SPEECH RECOGNITION BASED ON MULTI TASK LEARNING HouJeung Han, Sunghun Kang and Chang D. Yoo Dept. of Electrical Engineering, KAIST, Republic.

Combination of Feature and Channel Compensation (1/2)

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

End-to-End Speech-Driven Facial Animation with Temporal GANs

Huawei CBG AI Challenges

Outline Announcement Neural networks Perceptrons - continued

Presentation transcript:

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon University This is a joint work with William Chan, and Prof Ian Lane Guan-Lin Chao1, William Chan1, Ian Lane1,2 1Electrical and Computer Engineering,2Language Technology Institute Carnegie Mellon University

Outline Introduction Previous Work Approach Experiments Results This is the outline of this talk.

Introduction

Motivation Social Robot in a Noisy Party Mixed Speech Video Identity ASR in Cocktail Party Environments Mixed Speech Video Identity Say you have a social robot in a noisy party For example this robot wants to join the conversation with the two ladies First step of interaction can be automatic speech recognition, then follows the speech understanding The problem we address in this work is Automatic Speech Recognition in cocktail party environments What input signals do the robot have? Of course it perceives the mixed speech signal In most cases, robots are equipped with cameras, so it has video input Moreover, the robot may even know the identity of the speakers

Scenario of Interest Cocktail-party problem with overlapping speech from two speakers: target speaker and background speaker Target Speaker: the speaker whose speech to be recognized Multimodal Inputs single channel overlapping speech of target speaker and background speaker target speaker's mouth region of interest (ROI) image target speaker's identity embedding Our focused scenario is cocktail party ASR with overlapping speech from two speakers: a target speaker and a background speaker Target Speaker is the speaker whose speech we want to recognize What multimodal inputs do we have? single channel overlapping speech of target speaker and background speaker target speaker's mouth region of interest (ROI) image target speaker's identity embedding

Previous Work

Previous Approaches to Cocktail Party ASR Blind Signal Separation NMF-based (Sun ‘13) (Xu ‘15), Masking-based (Reddy ‘07) (Grais ‘13) Training goal not necessarily aligned with ASR Multimodal Robust Features Audio + Speaker ID: (Saon ‘13) (Abdel-Hamid ‘13) (Karanasou ‘14) (Peddinti ‘15) Audio + Visual: (Ngiam ‘11) (Mroueh ‘15) (Noda ‘15) Hybrid Let’s review some of the previous approaches to the cocktail party ASR problem The approaches can be grouped into two categories: Blind Signal Separation, Multimodal Robust Features, and a hybrid of both Blind Signal Separation approaches can be grouped into Nonnegative Matrix Factorization based and Masking based NMF-based approaches include Sun et al (Universal speech models for speaker independent single channel sourceseparation): a universal pitch model was learned Xu et al(Single-channel speech separation using sequential discriminative dictionary learning) proposed a sparse-coding based dictionary learning algorithm Masking-based approaches include Reddy et al(Soft mask methods for single-channel speaker separation) proposed using soft mask to estimate weights of mixed signals Grais et al(Deep neural networks for single channel source separation), they proposed using deep neural network to estimate soft masks Sao et el (Speaker adaptation of neural network acoustic models using i-vectors) proposed to supply speaker identity vectors, i-vectors along with acoustic features (Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code) Abdel-Hamid proposed to an adaptation method using speaker-specific code as input (Adaptation of deep neural network acoustic models using factorised i-vectors) Karanasou Ngiam (Multimodal deep learning) Mroueh (“Deep multimodal learning for audio-visual speech recognition,) Noda (“Audio-visual speech recognition using deep learning) NMF-based approaches learn a dictionary of universal time-frequency components to represent mixed signal as combinations of sources Masking-based approaches learn a binary or soft mask to estimate weights from sources AI approaches use i-vectors or factorized i-vectors with different models architectures for speaker adaptation AV approaches proposed different models to learn multimodal features and fusing techniques

Approach

Feed-forward DNN Acoustic Models with Different Combinations of Additional Modalities Speaker-targeted model: speaker-independent model with speaker identity information input We proposed to use a feed-forward dnn for acoustic modeling in a hybrid DNN-HMM architecture with Combinations of Additional modalities: With or without visual features and with or without speaker identity information There are four combinations of input, so we trained four DNN models for each case Here we term a speaker-independent model with speaker identity information input as a speaker-targeted model

Three Variants of Speaker-Targeted Models: Fusing Audiovisual and Speaker Identity We also investigate three ways to fuse the audio-visual features with the speaker identity information There are three variants A,B,C In (A), speaker identity is directly concatenated the with audio or audio-visual features. In (B) speaker identity is first mapped into a compact but presumably more discriminative embedding, and then it the compact embedding is concatenated with audio or audiovisual features. In (C) speaker identity is connected to a later layer than audio or audio-visual features. (A) Concatenating the speaker identity directly with audio-only and audio-visual features (B) Mapping speaker identity into a compact embedding which is concatenated with audio-only and audio-visual features (C) Connecting the speaker identity to a later layer than audio-only and audio-visual features

Experiments

Dataset Simulated speech mixtures from GRID corpus GRID corpus (Cooke ‘06) low-noise audio and video recordings 34 speakers, 1000 sentences each speaker sentence syntax: $command $color $preposition $letter $digit $adverb Data separation following 1st CHiME Challenge convention (Barker ‘13) 15395 training, 548 development, 540 testing utterances We simulated speech mixtures from GRID corpus, the target speaker and a background speaker’s utterances are mixed with equal weights using SoX software The GRID corpus is a multi-speaker audio-visual corpus. It consist of low-noise audio and video recordings From 34 speakers, each read 1000 sentences The sentences follow the six-word syntax The data separation followed the convention of 1st Chime Challenge We excluded some utterances which mouth ROI images were unavailable

Dataset Clean speech sample Overlapping speech sample Let’s watch some sample videos This is a clean speech sample This is simulated overlapping speech sample

Features Audio Features Visual Features Speaker Identity Information 40 log-mel filterbanks with context of 11 frames Visual Features Target speaker’s grayscale mouth ROI based on facial landmarks extracted by IntraFace (de la Torre ‘15) Speaker Identity Information Target speaker’s ID-embedding [0, …, 0, 1, 0, ..., 0] For Audio features: we extracted 40 bin log-mel filterbanks with context of 11 frames For visual features: First we used IntraFace software to extract facial landmarks, , and we cropped the images into 60*30 pixels according to mouth landmarks, Then we used Target speaker’s grayscale mouth ROI pixels values as visual features Speaker identity information is represented by Target speaker’s ID-embedding, which is a one-hot vector

Acoustic Model Number of hidden layers 4 for audio-only models and speaker-independent audio-visual models 5 for speaker-targeted and speaker-dependent audio-visual models 2048 nodes per hidden layer with ReLU activation 2371 output phoneme labels SGD with minibatch size 128 and learning rate 0.01 the architecture of our DNN acoustic model The number of hidden layers for audio-only models and speaker-independent audio-visual models is 4, and it is 5 for speaker-targeted and speaker-dependent audio-visual models. Each hidden layer contains 2048 nodes with ReLU activation The output layer has a softmax of 2371 phoneme labels. We use stochastic gradient descent with minibatch size of 128 frames and a learning rate of 0.01.

Results

WER Comparisons of Models The audio-only baseline for two-speaker cocktail-party problem is 26.3% Introduction of visual information to acoustic features can reduce WER significantly in cocktail-party environments, improving the WER to 4.4% Using speaker identity information in conjunction with acoustic features achieves a better improvement on WER, reducing WER up to 3.6% WER table of different models The audio-only baseline for two-speaker cocktail-party problem is 26.3% It also shows that the introducing visual information to acoustic features can reduce WER significantly in cocktail-party environments, improving the WER to 4.4% Using speaker identity information in conjunction with acoustic features achieves a better improvement on WER, reducing WER up to 3.6%

WER Comparisons of Models A weak tendency that providing speaker information in earlier layers of the network seems to have advantage speaker-dependent ASR system performs better than a speaker- targeted ASR system introduction of visual information improves the WER of speaker- dependent acoustic models while it doesn’t improve the speaker- targeted acoustic models Comparing the three variants of speaker-targeted models, we could see A weak tendency that providing speaker information in earlier layers of the network seems to have advantage We compare speaker-dependent and speaker-targeted models, an intuitive result shows that a speaker-dependent ASR system which is optimized for one specific speaker performs better than a speaker-targeted ASR system which is optimized for multiple speakers simultaneously Introducing visual information improves the WER of speaker-dependent acoustic models while it doesn’t improve the speaker-targeted acoustic models. We subscribe this finding to the limitation of the neural network’s modeling capacity that we use for both models in a speaker-dependent model, the capacity is able to optimize one specific speaker’s visual information But in a single speaker-targeted model, it is not powerful enough to learn a unified optimization for all 31 speakers’ visual information

WER Comparisons of Speakers WER of the individual speakers. A similar trend between different models’ performance on individual speakers

Summary

Summary Proposed a speaker-targeted audio-visual DNN-HMM model for speech recognition in cocktail-party environments Used different combinations of acoustic and visual features and speaker identity information as DNN inputs Experiments suggested the performance: DNNAVI ≈ DNNAI > DNNAV > DNNA Prospective Investigate better representations in multimodal data space to incorporate audio, visual and speaker identity information Explore more model architectures to achieve a better unified optimization for the speaker-targeted audio-visual models

Thank you! Questions?