Presentation is loading. Please wait.

Presentation is loading. Please wait.

End-to-End Speech-Driven Facial Animation with Temporal GANs

Similar presentations


Presentation on theme: "End-to-End Speech-Driven Facial Animation with Temporal GANs"— Presentation transcript:

1 End-to-End Speech-Driven Facial Animation with Temporal GANs
Patrick Groot Koerkamp ( )

2 High level overview Generating videos of a talking head Temporal GAN
Audio synchronized lip movements Natural facial expressions (Blinks and Eyebrow movements) Temporal GAN

3 Generative Adversarial Networks (GAN)
Generator Discriminator

4 Motivation Simplify film animation process Better lip-syncing
Generate parts of occluded faces Improve band-limited visual telecommunications

5 Background Generate realistic faces
Mapping Audio Features (MFCC) Computer Graphics Overhead Transform Audio Features to Video Frames Neglect facial expressions Generate on present information No facial dynamics Challenging

6 Proposal / Contributions
GAN capable of generating videos Audio signal Single still image Subject independent No handcrafted audio No visual feature reliance No post processing Comprehensive assessment Method performance Image quality Lip-reading verification Identity maintaining Realism (Turing)

7 Related work Speech-Driven Facial Animation GAN-Based Video Synthesis
Acoustics, Vocal-tract, Facial motion Hidden Markov Models (HMM) Deep neural networks Convolutional neural networks GAN-Based Video Synthesis Image/Video generation MoCoGAN Cross-modal applications

8 End-to-End Speech-Driven Facial Synthesis
1 Generator ReLU > TanH 2 Discriminators ReLU > Sigmoid

9 Generator Identity Encoder Context Encoder Frame Decoder
Audio Encoder RNN Frame Decoder Noise Generator

10 Audio Encoder & Context Encoder
7 Layer CNN Extracts 256 dimensional features Passed to RNN Context Encoder 2 Layer GRU (Gated Recurrent Unit)

11 Identity Encoder & Frame Decoder
6 Layer CNN Produces identity encoding Frame Decoder Generates a frame of the sequence

12 Discriminators Frame Discriminator Sequence Discriminator 6 Layer CNN
Is frame real or not? Sequence Discriminator

13 Training Adam Learning Rate Generator: 2 * 10^-4
Loss Formula: L1 Formula: Obtain optimal generator G* Adam Learning Rate Generator: 2 * 10^-4 Frame Discriminator: 10^-3 Decay after epoch 20 (10% Rate) Sequence Discriminator: 5 * 10^-5

14 Experiments PyTorch Nvidia GTX 1080 Ti CPU Takes a week to train
Avg. generation time: 7ms 75 sequential frames synthesized in 0.5s CPU Avg. generation time: 1s 75 sequential frames synthesized in 15s

15 Experiments (2) Datasets Increased training data by mirroring Metrics
GRID TCD Increased training data by mirroring Metrics Generated video : PSNR & SSIM Frame sharpness : FDBM & CPBD Content : ACD Accuracy spoken msg : WER

16 Qualitative Results Produces realistic videos
Also works on previously unseen faces Characteristic human expressions Frowns Blinks

17 Qualitative Results (2)
GAN-based method L1 loss and adversarial loss Baseline for quantitative assessment Failures of static baseline Opening mouth when silent Neglecting previous face

18 Quantitative Results Performance measure 30-person survey Turing test
GRID & TCD datasets Compare to static baseline 30-person survey Turing test 10 videos 153 responses Avg. 63% correct

19 Quiz

20 Future work Different architectures Expressions are generated randomly
More natural sequences Expressions are generated randomly Natural extension Capture mood Reflect mood in facial expressions

21 Questions


Download ppt "End-to-End Speech-Driven Facial Animation with Temporal GANs"

Similar presentations


Ads by Google