Download presentation
Presentation is loading. Please wait.
Published byAdelino Barreiro Modified over 5 years ago
1
End-to-End Speech-Driven Facial Animation with Temporal GANs
Patrick Groot Koerkamp ( )
2
High level overview Generating videos of a talking head Temporal GAN
Audio synchronized lip movements Natural facial expressions (Blinks and Eyebrow movements) Temporal GAN
3
Generative Adversarial Networks (GAN)
Generator Discriminator
4
Motivation Simplify film animation process Better lip-syncing
Generate parts of occluded faces Improve band-limited visual telecommunications
5
Background Generate realistic faces
Mapping Audio Features (MFCC) Computer Graphics Overhead Transform Audio Features to Video Frames Neglect facial expressions Generate on present information No facial dynamics Challenging
6
Proposal / Contributions
GAN capable of generating videos Audio signal Single still image Subject independent No handcrafted audio No visual feature reliance No post processing Comprehensive assessment Method performance Image quality Lip-reading verification Identity maintaining Realism (Turing)
7
Related work Speech-Driven Facial Animation GAN-Based Video Synthesis
Acoustics, Vocal-tract, Facial motion Hidden Markov Models (HMM) Deep neural networks Convolutional neural networks GAN-Based Video Synthesis Image/Video generation MoCoGAN Cross-modal applications
8
End-to-End Speech-Driven Facial Synthesis
1 Generator ReLU > TanH 2 Discriminators ReLU > Sigmoid
9
Generator Identity Encoder Context Encoder Frame Decoder
Audio Encoder RNN Frame Decoder Noise Generator
10
Audio Encoder & Context Encoder
7 Layer CNN Extracts 256 dimensional features Passed to RNN Context Encoder 2 Layer GRU (Gated Recurrent Unit)
11
Identity Encoder & Frame Decoder
6 Layer CNN Produces identity encoding Frame Decoder Generates a frame of the sequence
12
Discriminators Frame Discriminator Sequence Discriminator 6 Layer CNN
Is frame real or not? Sequence Discriminator
13
Training Adam Learning Rate Generator: 2 * 10^-4
Loss Formula: L1 Formula: Obtain optimal generator G* Adam Learning Rate Generator: 2 * 10^-4 Frame Discriminator: 10^-3 Decay after epoch 20 (10% Rate) Sequence Discriminator: 5 * 10^-5
14
Experiments PyTorch Nvidia GTX 1080 Ti CPU Takes a week to train
Avg. generation time: 7ms 75 sequential frames synthesized in 0.5s CPU Avg. generation time: 1s 75 sequential frames synthesized in 15s
15
Experiments (2) Datasets Increased training data by mirroring Metrics
GRID TCD Increased training data by mirroring Metrics Generated video : PSNR & SSIM Frame sharpness : FDBM & CPBD Content : ACD Accuracy spoken msg : WER
16
Qualitative Results Produces realistic videos
Also works on previously unseen faces Characteristic human expressions Frowns Blinks
17
Qualitative Results (2)
GAN-based method L1 loss and adversarial loss Baseline for quantitative assessment Failures of static baseline Opening mouth when silent Neglecting previous face
18
Quantitative Results Performance measure 30-person survey Turing test
GRID & TCD datasets Compare to static baseline 30-person survey Turing test 10 videos 153 responses Avg. 63% correct
19
Quiz
20
Future work Different architectures Expressions are generated randomly
More natural sequences Expressions are generated randomly Natural extension Capture mood Reflect mood in facial expressions
21
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.