Download presentation
Presentation is loading. Please wait.
Published byMonika Niemiec Modified over 5 years ago
1
Presented by: Iwan Boksebeld and Marijn Suijten
Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion Presented by: Iwan Boksebeld and Marijn Suijten
2
Authors Tero Karras NVIDIA Timo Aila NVIDIA Samuli Laine NVIDIA
Antti Herva Remedy Entertainment Jaakko Lehtinen NVIDIA & Aalto University
3
Goals of the paper Create 3D mesh from just audio With the use of CNNs
While keeping low latency And factoring in emotions Quickly mention vision-based PC drawbacks Used in games, with audio-based to fill the gaps - but quality leaves much to be desired
4
The Problem Create a full face mesh for added realism
Use emotions in the animation Dealing with ambiguity of audio Creating a CNN and training this
5
Related Work
6
Linguistic based animation
Input audio often with transcript Animation results from language based rules The strengths of this method is the high level of control The weakness is the complexity of the system Example of such a model is the Dominance model
7
Machine learning techniques
Mostly in 2D Learn the rules given in linguistic based animation Blend and/or concatenate images to produce results Not really useful for the application here
8
Capturing emotions Mostly based on user parameters
Some work with neural networks Creates mapping from emotion parameters to facial expression
9
The technical side
10
Audio processing 16 kHz mono; normalized volume
260ms of past and future samples, total of 520ms Value empirically chosen Take 64 audio frames of 16ms 2x overlap: every 8ms used twice Hann window: remove temporal aliasing effects Empirically chosen: Enough to capture phoneme articulation; without providing too much data for overfitting “i.e., 260ms of past and future samples” (64 * ) / 2 = 520
11
Autocorrelation Calculate K=32 autocorrelation coefficients
12 enough for identifying individual phonemes Need more to identify pitch No special techniques for linear separation of phonemes Tests indicate this process is clearly superior Resonance frequencies entirely determined by autocorrelation Early tests indicate using autocorrelation coefficients directly is clearly superior
12
CNN Layout Formant analysis network:
First layer is audio-processing and autocorrelation Time axis of 64 samples 32 autocorrelation coefficients Followed by 5 convolution layers Convert formant audio features to 256 abstract feature maps Formant = “Resonance Frequency of linear filters, which carry information of a phoneme” learn to extract shortterm features that are relevant for facial animation, such as intonation, emphasis, and specific phonemes
13
CNN Layout Articulation network Analyze temporal evolution
5 layers as well Emotion vector concatenated Formant = “Resonance Frequency of linear filters, which carry information of a phoneme”
14
Working with emotions Speech highly ambiguous
Consider silence: what does it look like?
15
Representing Emotions
Emotional state stored as "meaningless" E-dimensional vector Learns with the network Vector concatenated to convolution layers in articulation network Concatenated to every layer: significantly better result Support early layers with nuanced control over details such as coarticulation Later layers have more control over the output pose Explain "meaningless": These values are changed during the learning phase (just like a regular NN layer), and carry no semantic meaning otherwise. This layer tries to "learn" all information that's not inferable from the audio. Don't forget to mention that the bottom two points are assumptions from the author
16
Training
17
Training Target Use 9 cameras to get unstructured mesh and optical flow Project template mesh onto unstructured mesh Link optical flow to template Template mesh is then tracked across performance Use some vertices to stabilize head Limitation no tongue
18
Training Data Pangrams and in-character material
3-5 minutes per actor (trade off quality vs. time/cost) Pangrams: Designed sentences with as many sounds of a language in-character: Capture emotions based on character narrative Time-shifting data augmentation
19
Loss Function Loss function in 3-terms:
Position term Motion term Regularization term Use normalization scheme to balance these terms
20
Position Term Ensure correct vertex location V: # of vertices
y: desired position ŷ: actual position
21
Motion Term Ensure correct motion Comparing paired frames
m(~): Difference between paired frames
22
Regularization Term Ensure no erratic emotion
Normalized to prevent becoming ineffective E: # of emotion components e(i):ith component of the emotion vector for sample x
23
Inference Quickly mention: Audio in, vertex delta’s out
24
Inferring emotion Step 1: Cull “wrong” emotion vectors
Bilabials -> closed mouth Vowels -> opened mouth Step 2: Visually inspect animation Remove short-term effects Step3: Use voice from different actor Unnatural -> lack of generalization Manually assign semantic meaning Interpolate emotion vectors for transition/complex emotion Step 1: ~100 left Step 2: ~86 - subdued, spurious or unnatural motion Step 3: ~33 Lack of generalization -> will go wrong at some point Possibility to blend/interpolate emotion vectors
25
Results
26
Results 2:19(General & emotions) 0:06(vs PC) 0:53(vs DM)
3:20(different languages(columns represent emotional states))
27
User Study Setup Blind user study with 20 participants
User were asked to choose between 2 which was more realistic Two sets of experiments Comparing against other methods DM vs PC vs Ours Audio from validation set not used in training 13 clips of 3-8 seconds long Generalization over language and gender 14 clips form several languages From online database without checking output
28
User Study Results
29
User Study Results
30
User Study Results Clearly better then DM
Still quite a bit worse then PC Generalizes quite well over languages Even compared to linguistic method
31
Critical Review
32
Drawbacks of solution No residual motion
No blinking No head movement Assumes higher power handles these Problems with similar looking sounds E.g. Confuse B and G Fast languages are a problem Novel data needs to be somewhat similar to training data Misses detail compared to PC Emotions have no defined meaning Penultimate according to NVIDIA main drawback
33
Questions?
34
Discussion
35
Discussion Is it useful to gather emotions like this paper describes?
Why not just tell what the emotion is?
36
Discussion Should they have used more participants?
37
Discussion Why do you think blinking and eye/head motion is not covered by the network?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.