Create Photo-Realistic Talking Face Changbo Hu * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang
Outline Introduction of talking face Motivations System overview TechniquesConclusions
Introduction What is a talking face Face (lip) animation, driven by voice Face (lip) animation, driven by voice Applications Applications The process of talking face Face model Face model Motion capture Motion capture Mapping between Mapping between audio and video audio and video Rendering, Rendering,Photo-realistic?
Literatures Walter,93, DecFace, 2Dwire frame model Walter,93, DecFace, 2Dwire frame model Terzopoulos,95, Skin and muscle model Terzopoulos,95, Skin and muscle model Breglar,97, Video Rewrite, Sample image based Breglar,97, Video Rewrite, Sample image based TS Huang,98,Mesh model from range data TS Huang,98,Mesh model from range data Poggio,98, MikeTalk, Viseme morphing Poggio,98, MikeTalk, Viseme morphing Guenter,99, Making face, 3D from multicamera Guenter,99, Making face, 3D from multicamera Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint Cosatto,00, Planar quads model Cosatto,00, Planar quads model
Some Face models
Motivations Aim: a graphics interface for conversation agent Photo-realistic Photo-realistic Driven by Chinese Driven by Chinese Smooth connection between sentences Smooth connection between sentences Extended from “Video rewrite”
System overview: Pipeline of the system(1)
System overview: Pipeline of the system(2) New text Wav sound TTS system Triphone sequence Segmentation Synthesized triphone sequence Train database Lip motion sequence Rewrite to faces Background sequence
Techniques Analysis: Audio process Audio process Image process Image processSynthesis Lip image Lip image Background image Background image Stitch together Stitch together
Audio part: Sound Segmentation Given the wav file and the script Using HMM to train the segment system Segment wav file to phoneme sequence Example of the segmentation result: SILOPEN023 SILOPEN2442 s4361 if46274 j7580 ia18197 sh98109 ang y e y in h ang
Annotation with Phoneme Using phoneme to annotate video frames Each phoneme in a sentence corresponds to a short time of video sequence Training Sentence Video Frames Frames for Phoneme1 Frames for Phoneme2 … Audio Frames Frames for Phoneme1 Frames for Phoneme2 … Phoneme Sequence Phoneme1 Phoneme2 …
Phoneme Distance Analysis Phoneme&triphone basics Chinese Phoneme vs. English Phoneme Distance Metrics definitions Results
Phoneme Basics Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes. CH, JH, S, EH, EY, OY, AE, SIL… Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information. T-IY-P, IY-P-AA, P-AA-T…
Chinese Phoneme vs. English Chinese phoneme has two basic groups: Initials and Finals. Initials: B, P, M, F, … Finals: a3, o1, e2, eng3, iang4, ue5, … Chinese finals each has 5 tones: 1,2,3,4,5. Different tones: a1, a2, a3, a4, a5. Chinese finals actually is not a basic elements of speech. For example: iang1, iao1, uang1, iong1… Chinese phoneme set is much larger than English.
Phoneme Distance Analysis Define the distance between any two phonemes. Since we only synthesis video but not sound, so tone is ignored Lip shape motion is the core element for distance metrics.
Phoneme Distance Analysis Video 1Video 2Video 4 Video 1Video 2 Video 3 Phoneme 1: Phoneme 2: Time Align to an uniform length Video 2Video 3Video 4 Video 2Video 1 Average the videos to get an average video Video Average By comparing the two aligned average videos, we generate the distance matrix of the whole phoneme set.
Image part: Pose Tracking Assume a plane model for face Standard minimization method to find transform matrix (affine transform)[Black,95] Mask is used to constrain interests part of the face Template Picture Mask Image
Pose tracking Motion prediction using parameters with physical meaning
Pose Tracking Some tracking results:
Lip Motion Tracking Using Eigen Points (Covell, 91) Feature Points include Jaw, lip and teeth Training database specified manually Auto tracking through all pose-tracked images
Lip motion tracking
Lip Motion Tracking Train Database (hand-labeled) Auto Tracking Results
Synthesis new sentences New text converted by TTS system to wav Wav is segmented to phoneme sequence Using DP to find an optimal video sequence from the training database Time-align triphone videos and stitch them together. Transform the lip sequence and paste them to background faces.
Lip sequence synthesis Optimal phoneme sequences Triphone 1 Triphone 2Triphone 5 Triphone 3 Triphone 4 Triphone 6 Triphone 7 Triphone 8Triphone B Triphone 9 Triphone A Triphone C New phoneme sequences
Dynamic Programming Begin Triphone1Triphone3Triphone2Triphone4 End Triphone5
Edge Cost Definition Two parts: 1.phoneme distance: 3 phonemes’ distances added together 2.Lip shape distance for the overlap portion of triphone video Weighted add together two part
Background video generation Background is a video sequence when the virtual character spoke something else Similarity measurement of background Select “standard frame” The frame with maximal number of frames similar to it Filter out the frames with jerkiness
Stitch the time-aligned result to background faces Write back with a mask Transform the synthesized lip to the background face
Mask image for write-back operation Original background frameWrite-back result of the same frame
More video results
Conclusion and Future Work Pose tracking and lip motion tracking Size of the train database Talking face with expression Real-time generation? Fast modeling for different person
Animation
Thank you