Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005
System Description Inputs : Speech signal Outputs: Facial Animation A generic 3D face in MPEG4 standard Speech stream
Agenda MPEG4 Standard Speech Processing Different Approaches Learning Phase Face Feature Extraction Training Neural Networks Experimental Results Conclusion
MPEG4 Standard Multimeida Communication Standard 1999 / Moving Picture Expert Group High quality / Low bit rate Interaction of users with media Object Oriented Object Properties Scalable quality SNHC (Synthetic Natural Hybrid Coding) Synthetic faces and bodies
Facial Animation in MPEG4 FDP (Face Definition Parameters) Shape 84 Feature Points Texture FAP ( Face Animation Parameters) For animating feature points 68 parameter High level / Low level Global and local parameters FAP units
Face Definition Parametes
Face Animation Parameter Units
Speech Processing Phases: Noise Reduction Simple noise Framing Feature Extraction Speech features: LPC,MFCC, Delta MFCC, Delta Delta MFCC Frame 1 Frame 2 Feature Vector X 1 Feature Vector X 2
Two Approaches Phoneme-Viseme Mapping Approaches Transitions among visemes Discrete phonetic units Extremely stylized Language dependent Acoustic-Visual Mapping Approaches Relation between speech features and facial expressions Functional approximation Language independent Neural networks and HMM : learning machines for mapping
Learning Phase Speaker Video Speech stream Feature Extraction Training NN FAP Extraction FAP Player
Face Feature Extraction Deformable template based approach Semi automatic Candid model A wire frame model For model based coding Parameterized 113 vertex 168 face
Candid Model Parameters of WFM Global 3d Rotation, 2d Translation, Scale Shape Units Lip Width, Eyes Distance, … Action Units Lip Shape, Eyebrow, … Each parameter value is a real number Texture
New Face Generation
Transformation (a 1, b 1 ) P P*P*P*P* O O*O*O*O* Y X Correspondences: (a 1, b 1 ) (x 1, y 1 ), (a 2, b 2 ) (x 2, y 2 ), (a 3, b 3 ) (x 3, y 3 ), * * (a 2, b 2 ) (a 3, b 3 ) (x 2, y 2 ) (x 3, y 3 ) (x 1, y 1 ) **** source target
Transformation (cont.)
New Face Generation
Model Adaptation Selecting Optimal Parameters Global Parameters: 3d Rotation, 2d Translation, Scale Lip Parameters: Upper Lip Jaw Open Lip Width Lip Corners Vertical Movements Full Search ( expensive ) Using Previous Frame Information
Lip Reading Using of color data to guess lip area Using extracted lip area to guess lip model parameters. Upper lip, jaw open, mouth width, lip corners Using related vertex of Candide model. Two regions from first frame: Lip regions Non lip regions
Lip Area Classification Fisher Linear Discriminant Simple Fast Two point sets X, Y in n dimensions m1 is projection of X on unit vector α m2 is projection of Y on unit vector α Find α that maximizes
Estimating Lip Parameters FLD is trained by first frames pixels rgb data of pixels HSV is better than RGB. Robust in different brightness conditions
Lip Area Classification A simple approach for estimating lip parameters. Column scanning Row scanning
Generating FAPs from model Generating FAP file from model FAP file format Trial and error approach Open source FAP players FAP and wave file as input
Training Neural Networks 60 videos as data set 45 sentences for train 15 sentences for test Multilayer Perceptrons One input layer, One hidden layer, One output layer Back propagation algorithm Nine neuron in output layer Five global parameters Four lip parameters
Training Neural Networks Four speech features LPC, MFCC, Delta MFCC, Delta Delta MFCC Six networks for each speech feature One feature vector as input 30, 60, 90 neuron in hidden layer Three feature vector as input 90, 120, 150 neuron in hidden layer frame rate Video : 25 fps Speech : 50 fps
Generating Results From NNs Generating four lip parameters for each frame
Assessment Criterion A performance metric to measure the predicted accuracy of audio-visual mapping Correlation Coefficients G is one if two vectors are equal k : frame number N : number of frames in the test set
Results For LPC Networks
Results For MFCC Networks
Results For Delta MFCC Networks
Results For Delta Delta MFCC Networks
Conclusion Speech driven facial animation is possible! Delta Delta MFCC has the best performance Using previous and next speech frames improves the performance. Using combination of different speech features
Future Works More train data Speaker independent train data Multi language Other speech features Combination of speech features Facial emotions HMM for storing the mappings
Thanks…