End-to-End Speech-Driven Facial Animation with Temporal GANs Patrick Groot Koerkamp (6628478)
High level overview Generating videos of a talking head Temporal GAN Audio synchronized lip movements Natural facial expressions (Blinks and Eyebrow movements) Temporal GAN
Generative Adversarial Networks (GAN) Generator Discriminator
Motivation Simplify film animation process Better lip-syncing Generate parts of occluded faces Improve band-limited visual telecommunications
Background Generate realistic faces Mapping Audio Features (MFCC) Computer Graphics Overhead Transform Audio Features to Video Frames Neglect facial expressions Generate on present information No facial dynamics Challenging
Proposal / Contributions GAN capable of generating videos Audio signal Single still image Subject independent No handcrafted audio No visual feature reliance No post processing Comprehensive assessment Method performance Image quality Lip-reading verification Identity maintaining Realism (Turing)
Related work Speech-Driven Facial Animation GAN-Based Video Synthesis Acoustics, Vocal-tract, Facial motion Hidden Markov Models (HMM) Deep neural networks Convolutional neural networks GAN-Based Video Synthesis Image/Video generation MoCoGAN Cross-modal applications
End-to-End Speech-Driven Facial Synthesis 1 Generator ReLU > TanH 2 Discriminators ReLU > Sigmoid
Generator Identity Encoder Context Encoder Frame Decoder Audio Encoder RNN Frame Decoder Noise Generator
Audio Encoder & Context Encoder 7 Layer CNN Extracts 256 dimensional features Passed to RNN Context Encoder 2 Layer GRU (Gated Recurrent Unit)
Identity Encoder & Frame Decoder 6 Layer CNN Produces identity encoding Frame Decoder Generates a frame of the sequence
Discriminators Frame Discriminator Sequence Discriminator 6 Layer CNN Is frame real or not? Sequence Discriminator
Training Adam Learning Rate Generator: 2 * 10^-4 Loss Formula: L1 Formula: Obtain optimal generator G* Adam Learning Rate Generator: 2 * 10^-4 Frame Discriminator: 10^-3 Decay after epoch 20 (10% Rate) Sequence Discriminator: 5 * 10^-5
Experiments PyTorch Nvidia GTX 1080 Ti CPU Takes a week to train Avg. generation time: 7ms 75 sequential frames synthesized in 0.5s CPU Avg. generation time: 1s 75 sequential frames synthesized in 15s
Experiments (2) Datasets Increased training data by mirroring Metrics GRID TCD Increased training data by mirroring Metrics Generated video : PSNR & SSIM Frame sharpness : FDBM & CPBD Content : ACD Accuracy spoken msg : WER
Qualitative Results Produces realistic videos Also works on previously unseen faces Characteristic human expressions Frowns Blinks
Qualitative Results (2) GAN-based method L1 loss and adversarial loss Baseline for quantitative assessment Failures of static baseline Opening mouth when silent Neglecting previous face
Quantitative Results Performance measure 30-person survey Turing test GRID & TCD datasets Compare to static baseline 30-person survey Turing test 10 videos 153 responses Avg. 63% correct
Quiz
Future work Different architectures Expressions are generated randomly More natural sequences Expressions are generated randomly Natural extension Capture mood Reflect mood in facial expressions
Questions