End-to-End Speech-Driven Facial Animation with Temporal GANs

Slides:

Advertisements

Similar presentations

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Advertisements

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

SOMTIME: AN ARTIFICIAL NEURAL NETWORK FOR TOPOLOGICAL AND TEMPORAL CORRELATION FOR SPATIOTEMPORAL PATTERN LEARNING.

Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.

Skeleton Based Action Recognition with Convolutional Neural Network

ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.

Objective Quality Assessment Metrics for Video Codecs - Sridhar Godavarthy.

Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.

1 Deep Recurrent Neural Networks for Acoustic Modelling 2015/06/01 Ming-Han Yang William ChanIan Lane.

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Naifan Zhuang, Jun Ye, Kien A. Hua

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

ECE 417 Lecture 1: Multimedia Signal Processing

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Environment Generation with GANs

Deep Neural Net Scenery Generation

Automatic Lung Cancer Diagnosis from CT Scans (Week 2)

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

Recurrent Neural Networks for Natural Language Processing

Deep Predictive Model for Autonomous Driving

Textual Video Prediction Week 2

Deep Learning: Model Summary

Proposed architecture of a Fully Integrated

Intelligent Information System Lab

Synthesis of X-ray Projections via Deep Learning

Single Image Super-Resolution

Shunyuan Zhang Nikhil Malik

Presenter: Hajar Emami

Low Dose CT Image Denoising Using WGAN and Perceptual Loss

Grid Long Short-Term Memory

Image to Image Translation using GANs

The Big Health Data–Intelligent Machine Paradox

LECTURE 35: Introduction to EEG Processing

LECTURE 33: Alternative OPTIMIZERS

Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions

LECTURE 42: AUTOMATIC INTERPRETATION OF EEGS

Multimodal Caricatural Mirror

Project #2 Multimodal Caricatural Mirror Intermediate report

Lip movement Synthesis from Text

John H.L. Hansen & Taufiq Al Babba Hasan

LECTURE 41: AUTOMATIC INTERPRETATION OF EEGS

CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.

实习生汇报 ——北邮张安迪.

Textual Video Prediction

ImageNet Classification with Deep Convolutional Neural Networks

Advances in Deep Audio and Audio-Visual Processing

TPGAN overview.

Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arXiv.org,

Attention for translation

Chuan Wang1, Haibin Huang1, Xiaoguang Han2, Jue Wang1

Automatic Handwriting Generation

Neural Machine Translation using CNN

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Deep screen image crop and enhance

End-to-End Facial Alignment and Recognition

CRCV REU 2019 Kara Schatz.

Cengizhan Can Phoebe de Nooijer

Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS

Week 3 Volodymyr Bobyr.

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Paper presentation by: Dan Andrei Ganea and Anca Negulescu

Week 7 Presentation Ngoc Ta Aidean Sharghi

Listen Attend and Spell – a brief introduction

Deep screen image crop and enhance

Presentation transcript:

End-to-End Speech-Driven Facial Animation with Temporal GANs Patrick Groot Koerkamp (6628478)

High level overview Generating videos of a talking head Temporal GAN Audio synchronized lip movements Natural facial expressions (Blinks and Eyebrow movements) Temporal GAN

Generative Adversarial Networks (GAN) Generator Discriminator

Motivation Simplify film animation process Better lip-syncing Generate parts of occluded faces Improve band-limited visual telecommunications

Background Generate realistic faces Mapping Audio Features (MFCC) Computer Graphics Overhead Transform Audio Features to Video Frames Neglect facial expressions Generate on present information No facial dynamics Challenging

Proposal / Contributions GAN capable of generating videos Audio signal Single still image Subject independent No handcrafted audio No visual feature reliance No post processing Comprehensive assessment Method performance Image quality Lip-reading verification Identity maintaining Realism (Turing)

Related work Speech-Driven Facial Animation GAN-Based Video Synthesis Acoustics, Vocal-tract, Facial motion Hidden Markov Models (HMM) Deep neural networks Convolutional neural networks GAN-Based Video Synthesis Image/Video generation MoCoGAN Cross-modal applications

End-to-End Speech-Driven Facial Synthesis 1 Generator ReLU > TanH 2 Discriminators ReLU > Sigmoid

Generator Identity Encoder Context Encoder Frame Decoder Audio Encoder RNN Frame Decoder Noise Generator

Audio Encoder & Context Encoder 7 Layer CNN Extracts 256 dimensional features Passed to RNN Context Encoder 2 Layer GRU (Gated Recurrent Unit)

Identity Encoder & Frame Decoder 6 Layer CNN Produces identity encoding Frame Decoder Generates a frame of the sequence

Discriminators Frame Discriminator Sequence Discriminator 6 Layer CNN Is frame real or not? Sequence Discriminator

Training Adam Learning Rate Generator: 2 * 10^-4 Loss Formula: L1 Formula: Obtain optimal generator G* Adam Learning Rate Generator: 2 * 10^-4 Frame Discriminator: 10^-3 Decay after epoch 20 (10% Rate) Sequence Discriminator: 5 * 10^-5

Experiments PyTorch Nvidia GTX 1080 Ti CPU Takes a week to train Avg. generation time: 7ms 75 sequential frames synthesized in 0.5s CPU Avg. generation time: 1s 75 sequential frames synthesized in 15s

Experiments (2) Datasets Increased training data by mirroring Metrics GRID TCD Increased training data by mirroring Metrics Generated video : PSNR & SSIM Frame sharpness : FDBM & CPBD Content : ACD Accuracy spoken msg : WER

Qualitative Results Produces realistic videos Also works on previously unseen faces Characteristic human expressions Frowns Blinks

Qualitative Results (2) GAN-based method L1 loss and adversarial loss Baseline for quantitative assessment Failures of static baseline Opening mouth when silent Neglecting previous face

Quantitative Results Performance measure 30-person survey Turing test GRID & TCD datasets Compare to static baseline 30-person survey Turing test 10 videos 153 responses Avg. 63% correct

Quiz

Future work Different architectures Expressions are generated randomly More natural sequences Expressions are generated randomly Natural extension Capture mood Reflect mood in facial expressions

Questions