End-to-End Speech-Driven Facial Animation with Temporal GANs

Slides:



Advertisements
Similar presentations
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Advertisements

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
SOMTIME: AN ARTIFICIAL NEURAL NETWORK FOR TOPOLOGICAL AND TEMPORAL CORRELATION FOR SPATIOTEMPORAL PATTERN LEARNING.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.
Skeleton Based Action Recognition with Convolutional Neural Network
ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.
Objective Quality Assessment Metrics for Video Codecs - Sridhar Godavarthy.
Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.
1 Deep Recurrent Neural Networks for Acoustic Modelling 2015/06/01 Ming-Han Yang William ChanIan Lane.
Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Naifan Zhuang, Jun Ye, Kien A. Hua
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
ECE 417 Lecture 1: Multimedia Signal Processing
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Environment Generation with GANs
Deep Neural Net Scenery Generation
Automatic Lung Cancer Diagnosis from CT Scans (Week 2)
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.
Recurrent Neural Networks for Natural Language Processing
Deep Predictive Model for Autonomous Driving
Textual Video Prediction Week 2
Deep Learning: Model Summary
Proposed architecture of a Fully Integrated
Intelligent Information System Lab
Synthesis of X-ray Projections via Deep Learning
Single Image Super-Resolution
Shunyuan Zhang Nikhil Malik
Presenter: Hajar Emami
Low Dose CT Image Denoising Using WGAN and Perceptual Loss
Grid Long Short-Term Memory
Image to Image Translation using GANs
The Big Health Data–Intelligent Machine Paradox
LECTURE 35: Introduction to EEG Processing
Papers 15/08.
LECTURE 33: Alternative OPTIMIZERS
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
LECTURE 42: AUTOMATIC INTERPRETATION OF EEGS
Multimodal Caricatural Mirror
Project #2 Multimodal Caricatural Mirror Intermediate report
Lip movement Synthesis from Text
John H.L. Hansen & Taufiq Al Babba Hasan
LECTURE 41: AUTOMATIC INTERPRETATION OF EEGS
CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.
实习生汇报 ——北邮 张安迪.
Textual Video Prediction
ImageNet Classification with Deep Convolutional Neural Networks
Advances in Deep Audio and Audio-Visual Processing
TPGAN overview.
Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arXiv.org,
Attention for translation
Chuan Wang1, Haibin Huang1, Xiaoguang Han2, Jue Wang1
Automatic Handwriting Generation
Neural Machine Translation using CNN
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Deep screen image crop and enhance
End-to-End Facial Alignment and Recognition
CRCV REU 2019 Kara Schatz.
Cengizhan Can Phoebe de Nooijer
Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS
Week 3 Volodymyr Bobyr.
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Paper presentation by: Dan Andrei Ganea and Anca Negulescu
Week 7 Presentation Ngoc Ta Aidean Sharghi
Listen Attend and Spell – a brief introduction
Deep screen image crop and enhance
Presentation transcript:

End-to-End Speech-Driven Facial Animation with Temporal GANs Patrick Groot Koerkamp (6628478)

High level overview Generating videos of a talking head Temporal GAN Audio synchronized lip movements Natural facial expressions (Blinks and Eyebrow movements) Temporal GAN

Generative Adversarial Networks (GAN) Generator Discriminator

Motivation Simplify film animation process Better lip-syncing Generate parts of occluded faces Improve band-limited visual telecommunications

Background Generate realistic faces Mapping Audio Features (MFCC) Computer Graphics Overhead Transform Audio Features to Video Frames Neglect facial expressions Generate on present information No facial dynamics Challenging

Proposal / Contributions GAN capable of generating videos Audio signal Single still image Subject independent No handcrafted audio No visual feature reliance No post processing Comprehensive assessment Method performance Image quality Lip-reading verification Identity maintaining Realism (Turing)

Related work Speech-Driven Facial Animation GAN-Based Video Synthesis Acoustics, Vocal-tract, Facial motion Hidden Markov Models (HMM) Deep neural networks Convolutional neural networks GAN-Based Video Synthesis Image/Video generation MoCoGAN Cross-modal applications

End-to-End Speech-Driven Facial Synthesis 1 Generator ReLU > TanH 2 Discriminators ReLU > Sigmoid

Generator Identity Encoder Context Encoder Frame Decoder Audio Encoder RNN Frame Decoder Noise Generator

Audio Encoder & Context Encoder 7 Layer CNN Extracts 256 dimensional features Passed to RNN Context Encoder 2 Layer GRU (Gated Recurrent Unit)

Identity Encoder & Frame Decoder 6 Layer CNN Produces identity encoding Frame Decoder Generates a frame of the sequence

Discriminators Frame Discriminator Sequence Discriminator 6 Layer CNN Is frame real or not? Sequence Discriminator

Training Adam Learning Rate Generator: 2 * 10^-4 Loss Formula: L1 Formula: Obtain optimal generator G* Adam Learning Rate Generator: 2 * 10^-4 Frame Discriminator: 10^-3 Decay after epoch 20 (10% Rate) Sequence Discriminator: 5 * 10^-5

Experiments PyTorch Nvidia GTX 1080 Ti CPU Takes a week to train Avg. generation time: 7ms 75 sequential frames synthesized in 0.5s CPU Avg. generation time: 1s 75 sequential frames synthesized in 15s

Experiments (2) Datasets Increased training data by mirroring Metrics GRID TCD Increased training data by mirroring Metrics Generated video : PSNR & SSIM Frame sharpness : FDBM & CPBD Content : ACD Accuracy spoken msg : WER

Qualitative Results Produces realistic videos Also works on previously unseen faces Characteristic human expressions Frowns Blinks

Qualitative Results (2) GAN-based method L1 loss and adversarial loss Baseline for quantitative assessment Failures of static baseline Opening mouth when silent Neglecting previous face

Quantitative Results Performance measure 30-person survey Turing test GRID & TCD datasets Compare to static baseline 30-person survey Turing test 10 videos 153 responses Avg. 63% correct

Quiz

Future work Different architectures Expressions are generated randomly More natural sequences Expressions are generated randomly Natural extension Capture mood Reflect mood in facial expressions

Questions