Create Photo-Realistic Talking Face Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang.

Slides:



Advertisements
Similar presentations
Image Rectification for Stereo Vision
Advertisements

Advanced Image Processing Student Seminar: Lipreading Method using color extraction method and eigenspace technique ( Yasuyuki Nakata and Moritoshi Ando.
Building an ASR using HTK CS4706
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
Face Alignment with Part-Based Modeling
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Verbs and Adverbs: Multidimensional Motion Interpolation Using Radial Basis Functions Presented by Sean Jellish Charles Rose Michael F. Cohen Bobby Bodenheimer.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
3D Face Modeling Michaël De Smet.
MATHIEU GAUTHIER PIERRE POULIN LIGUM, DEPT. I.R.O. UNIVERSITÉ DE MONTRÉAL GRAPHICS INTERFACE 2009 Preserving Sharp Edges in Geometry Images.
AAM based Face Tracking with Temporal Matching and Face Segmentation Dalong Du.
Summary & Homework Jinxiang Chai. Outline Motion data process paper summary Presentation tips Homework Paper assignment.
Computer Graphics Lab Electrical Engineering, Technion, Israel June 2009 [1] [1] Xuemiao Xu, Animating Animal Motion From Still, Siggraph 2008.
Introduction to Data-driven Animation Jinxiang Chai Computer Science and Engineering Texas A&M University.
Exchanging Faces in Images SIGGRAPH ’04 Blanz V., Scherbaum K., Vetter T., Seidel HP. Speaker: Alvin Date: 21 July 2004.
Video Rewrite Driving Visual Speech with Audio Christoph Bregler Michele Covell Malcolm Slaney Presenter : Jack jeryes 3/3/2008.
Video Object Tracking and Replacement for Post TV Production LYU0303 Final Year Project Spring 2004.
Efficient Moving Object Segmentation Algorithm Using Background Registration Technique Shao-Yi Chien, Shyh-Yih Ma, and Liang-Gee Chen, Fellow, IEEE Hsin-Hua.
Human Face Modeling and Animation Example of problems in application of multimedia signal processing.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Curve Analogies Aaron Hertzmann Nuria Oliver Brain Curless Steven M. Seitz University of Washington Microsoft Research Thirteenth Eurographics.
Vision-based Control of 3D Facial Animation Jin-xiang Chai Jing Xiao Jessica Hodgins Carnegie Mellon University.
SM 2233 Multimedia Production Introduction School of Creative Media.
Video Trails: Representing and Visualizing Structure in Video Sequences Vikrant Kobla David Doermann Christos Faloutsos.
Special Applications If we assume a specific application, many image- based rendering tools can be improved –The Lumigraph assumed the special domain of.
Multimedia Enabling Software. The Human Perceptual System Since the multimedia systems are intended to be used by human, it is a pragmatic approach to.
Sunee Holland University of South Australia School of Computer and Information Science Supervisor: Dr G Stewart Von Itzstein.
Automatic Camera Calibration
Facial Animation By: Shahzad Malik CSC2529 Presentation March 5, 2003.
Human Emotion Synthesis David Oziem, Lisa Gralewski, Neill Campbell, Colin Dalton, David Gibson, Barry Thomas University of Bristol, Motion Ripper, 3CR.
Multimodal Interaction Dr. Mike Spann
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
Use and Re-use of Facial Motion Capture M. Sanchez, J. Edge, S. King and S. Maddock.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Three Topics Facial Animation 2D Animated Mesh MPEG-4 Audio.
Graphite 2004 Statistical Synthesis of Facial Expressions for the Portrayal of Emotion Lisa Gralewski Bristol University United Kingdom
Geometric Operations and Morphing.
ALIGNMENT OF 3D ARTICULATE SHAPES. Articulated registration Input: Two or more 3d point clouds (possibly with connectivity information) of an articulated.
Multimedia Elements: Sound, Animation, and Video.
High-Resolution Interactive Panoramas with MPEG-4 발표자 : 김영백 임베디드시스템연구실.
Vision-based human motion analysis: An overview Computer Vision and Image Understanding(2007)
Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.
LIFESTUDIO:LIPSYNC Plug-In for Maya Quick Tour (Optimized for Microsoft Internet Explorer; may have problems in other browsers)
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
AAM based Face Tracking with Temporal Matching and Face Segmentation Mingcai Zhou 1 、 Lin Liang 2 、 Jian Sun 2 、 Yangsheng Wang 1 1 Institute of Automation.
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
Zhang & Liang, Computer Graphics Using Java 2D and 3D (c) 2007 Pearson Education, Inc. All rights reserved. 1 Chapter 11 Animation.
Performance Driven Facial Animation
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Animation From Observation: Motion Editing Dan Kong CMPS 260 Final Project.
Facial Animation Wilson Chang Paul Salmon April 9, 1999 Computer Animation University of Wisconsin-Madison.
Power Point Overview. Visual Aid Similar to other visual aids used in presentations Similar to other visual aids used in presentations.
Power Point Overview.
ECE 417 Lecture 1: Multimedia Signal Processing
Mr. Darko Pekar, Speech Morphing Inc.
Face Detection EE368 Final Project Group 14 Ping Hsin Lee
Advanced Computer Animation Techniques
Dynamical Statistical Shape Priors for Level Set Based Tracking
MikeTalk:An Adaptive Man-Machine Interface
Vehicle Segmentation and Tracking in the Presence of Occlusions
Combining Geometric- and View-Based Approaches for Articulated Pose Estimation David Demirdjian MIT Computer Science and Artificial Intelligence Laboratory.
VMorph: Motion and Feature-Based Video Metamorphosis
Turning to the Masters: Motion Capturing Cartoons
Handwritten Characters Recognition Based on an HMM Model
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Auditory Morphing Weyni Clacken
Paper presentation by: Dan Andrei Ganea and Anca Negulescu
End-to-End Speech-Driven Facial Animation with Temporal GANs
Data-Driven Approach to Synthesizing Facial Animation Using Motion Capture Ioannis Fermanis Liu Zhaopeng
Presentation transcript:

Create Photo-Realistic Talking Face Changbo Hu * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang

Outline Introduction of talking face Motivations System overview TechniquesConclusions

Introduction What is a talking face Face (lip) animation, driven by voice Face (lip) animation, driven by voice Applications Applications The process of talking face Face model Face model Motion capture Motion capture Mapping between Mapping between audio and video audio and video Rendering, Rendering,Photo-realistic?

Literatures Walter,93, DecFace, 2Dwire frame model Walter,93, DecFace, 2Dwire frame model Terzopoulos,95, Skin and muscle model Terzopoulos,95, Skin and muscle model Breglar,97, Video Rewrite, Sample image based Breglar,97, Video Rewrite, Sample image based TS Huang,98,Mesh model from range data TS Huang,98,Mesh model from range data Poggio,98, MikeTalk, Viseme morphing Poggio,98, MikeTalk, Viseme morphing Guenter,99, Making face, 3D from multicamera Guenter,99, Making face, 3D from multicamera Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint Cosatto,00, Planar quads model Cosatto,00, Planar quads model

Some Face models

Motivations Aim: a graphics interface for conversation agent Photo-realistic Photo-realistic Driven by Chinese Driven by Chinese Smooth connection between sentences Smooth connection between sentences Extended from “Video rewrite”

System overview: Pipeline of the system(1)

System overview: Pipeline of the system(2) New text Wav sound TTS system Triphone sequence Segmentation Synthesized triphone sequence Train database Lip motion sequence Rewrite to faces Background sequence

Techniques Analysis: Audio process Audio process Image process Image processSynthesis Lip image Lip image Background image Background image Stitch together Stitch together

Audio part: Sound Segmentation Given the wav file and the script Using HMM to train the segment system Segment wav file to phoneme sequence Example of the segmentation result: SILOPEN023 SILOPEN2442 s4361 if46274 j7580 ia18197 sh98109 ang y e y in h ang

Annotation with Phoneme Using phoneme to annotate video frames Each phoneme in a sentence corresponds to a short time of video sequence Training Sentence Video Frames Frames for Phoneme1 Frames for Phoneme2 … Audio Frames Frames for Phoneme1 Frames for Phoneme2 … Phoneme Sequence Phoneme1 Phoneme2 …

Phoneme Distance Analysis Phoneme&triphone basics Chinese Phoneme vs. English Phoneme Distance Metrics definitions Results

Phoneme Basics Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes. CH, JH, S, EH, EY, OY, AE, SIL… Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information. T-IY-P, IY-P-AA, P-AA-T…

Chinese Phoneme vs. English Chinese phoneme has two basic groups: Initials and Finals. Initials: B, P, M, F, … Finals: a3, o1, e2, eng3, iang4, ue5, … Chinese finals each has 5 tones: 1,2,3,4,5. Different tones: a1, a2, a3, a4, a5. Chinese finals actually is not a basic elements of speech. For example: iang1, iao1, uang1, iong1… Chinese phoneme set is much larger than English.

Phoneme Distance Analysis Define the distance between any two phonemes. Since we only synthesis video but not sound, so tone is ignored Lip shape motion is the core element for distance metrics.

Phoneme Distance Analysis Video 1Video 2Video 4 Video 1Video 2 Video 3 Phoneme 1: Phoneme 2: Time Align to an uniform length Video 2Video 3Video 4 Video 2Video 1 Average the videos to get an average video Video Average By comparing the two aligned average videos, we generate the distance matrix of the whole phoneme set.

Image part: Pose Tracking Assume a plane model for face Standard minimization method to find transform matrix (affine transform)[Black,95] Mask is used to constrain interests part of the face Template Picture Mask Image

Pose tracking Motion prediction using parameters with physical meaning

Pose Tracking Some tracking results:

Lip Motion Tracking Using Eigen Points (Covell, 91) Feature Points include Jaw, lip and teeth Training database specified manually Auto tracking through all pose-tracked images

Lip motion tracking

Lip Motion Tracking Train Database (hand-labeled) Auto Tracking Results

Synthesis new sentences New text converted by TTS system to wav Wav is segmented to phoneme sequence Using DP to find an optimal video sequence from the training database Time-align triphone videos and stitch them together. Transform the lip sequence and paste them to background faces.

Lip sequence synthesis Optimal phoneme sequences Triphone 1 Triphone 2Triphone 5 Triphone 3 Triphone 4 Triphone 6 Triphone 7 Triphone 8Triphone B Triphone 9 Triphone A Triphone C New phoneme sequences

Dynamic Programming Begin Triphone1Triphone3Triphone2Triphone4 End Triphone5

Edge Cost Definition Two parts: 1.phoneme distance: 3 phonemes’ distances added together 2.Lip shape distance for the overlap portion of triphone video Weighted add together two part

Background video generation Background is a video sequence when the virtual character spoke something else Similarity measurement of background Select “standard frame” The frame with maximal number of frames similar to it Filter out the frames with jerkiness

Stitch the time-aligned result to background faces Write back with a mask Transform the synthesized lip to the background face

Mask image for write-back operation Original background frameWrite-back result of the same frame

More video results

Conclusion and Future Work Pose tracking and lip motion tracking Size of the train database Talking face with expression Real-time generation? Fast modeling for different person

Animation

Thank you