Video Rewrite Driving Visual Speech with Audio Christoph Bregler Michele Covell Malcolm Slaney Presenter : Jack jeryes 3/3/2008.

Slides:



Advertisements
Similar presentations
Evidential modeling for pose estimation Fabio Cuzzolin, Ruggero Frezza Computer Science Department UCLA.
Advertisements

Descriptive schemes for facial expression introduction.
Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.
DDDAS: Stochastic Multicue Tracking of Objects with Many Degrees of Freedom PIs: D. Metaxas, A. Elgammal and V. Pavlovic Dept of CS, Rutgers University.
Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
Face Alignment with Part-Based Modeling
Patch to the Future: Unsupervised Visual Prediction
Automatic Feature Extraction for Multi-view 3D Face Recognition
3D Face Modeling Michaël De Smet.
Foreground Modeling The Shape of Things that Came Nathan Jacobs Advisor: Robert Pless Computer Science Washington University in St. Louis.
SecurePhone Workshop - 24/25 June Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.
Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.
Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.
ICIP 2000, Vancouver, Canada IVML, ECE, NTUA Face Detection: Is it only for Face Recognition?  A few years earlier  Face Detection Face Recognition 
Mosaics con’t CSE 455, Winter 2010 February 10, 2010.
Dmitri Bitouk Shree K. Nayar Columbia University Creating a Speech Enabled Avatar from a Single Photograph.
Contents Description of the big picture Theoretical background on this work The Algorithm Examples.
Human Face Modeling and Animation Example of problems in application of multimedia signal processing.
Probabilistic video stabilization using Kalman filtering and mosaicking.
Direct Methods for Visual Scene Reconstruction Paper by Richard Szeliski & Sing Bing Kang Presented by Kristin Branson November 7, 2002.
Face Detection: a Survey Speaker: Mine-Quan Jing National Chiao Tung University.
Create Photo-Realistic Talking Face Changbo Hu * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Segmentation by Clustering Reading: Chapter 14 (skip 14.5) Data reduction - obtain a compact representation for interesting image data in terms of a set.
Chamfer Matching & Hausdorff Distance Presented by Ankur Datta Slides Courtesy Mark Bouts Arasanathan Thayananthan.
Preprocessing ROI Image Geometry
Stereo and Multiview Sequence Processing. Outline Stereopsis Stereo Imaging Principle Disparity Estimation Intermediate View Synthesis Stereo Sequence.
SCA Introduction to Multimedia
Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.
Active Appearance Models based on the article: T.F.Cootes, G.J. Edwards and C.J.Taylor. "Active Appearance Models", presented by Denis Simakov.
Special Applications If we assume a specific application, many image- based rendering tools can be improved –The Lumigraph assumed the special domain of.
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Facial Animation By: Shahzad Malik CSC2529 Presentation March 5, 2003.
Human Emotion Synthesis David Oziem, Lisa Gralewski, Neill Campbell, Colin Dalton, David Gibson, Barry Thomas University of Bristol, Motion Ripper, 3CR.
Irfan Essa, Alex Pentland Facial Expression Recognition using a Dynamic Model and Motion Energy (a review by Paul Fitzpatrick for 6.892)
Multimodal Interaction Dr. Mike Spann
“Hello! My name is... Buffy” Automatic Naming of Characters in TV Video Mark Everingham, Josef Sivic and Andrew Zisserman Arun Shyam.
1 Interest Operators Harris Corner Detector: the first and most basic interest operator Kadir Entropy Detector and its use in object recognition SIFT interest.
1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.
Graphite 2004 Statistical Synthesis of Facial Expressions for the Portrayal of Emotion Lisa Gralewski Bristol University United Kingdom
EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:
110/20/ :06 Graphics II Paper Reviews Facial Animation Session 8.
Intelligent Scissors for Image Composition Anthony Dotterer 01/17/2006.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
112/5/ :54 Graphics II Image Based Rendering Session 11.
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Pictorial Structures and Distance Transforms Computer Vision CS 543 / ECE 549 University of Illinois Ian Endres 03/31/11.
Performance Driven Facial Animation
The Body and Health 3 Parts of the Body: The Head.
Learning video saliency from human gaze using candidate selection CVPR2013 Poster.
Facial Animation Wilson Chang Paul Salmon April 9, 1999 Computer Animation University of Wisconsin-Madison.
What is Digital Image processing?. An image can be defined as a two-dimensional function, f(x,y) # x and y are spatial (plane) coordinates # The function.
Representing Moving Images with Layers J. Y. Wang and E. H. Adelson MIT Media Lab.
SUMMERY 1. VOLUMETRIC FEATURES FOR EVENT DETECTION IN VIDEO correlate spatio-temporal shapes to video clips that have been automatically segmented we.
Occlusion Tracking Using Logical Models Summary. A Variational Partial Differential Equations based model is used for tracking objects under occlusions.
Nearest-neighbor matching to feature database
MikeTalk:An Adaptive Man-Machine Interface
Categorizing sex and identity from the biological motion of faces
Vehicle Segmentation and Tracking in the Presence of Occlusions
Vehicle Segmentation and Tracking from a Low-Angle Off-Axis Camera
Representing Moving Images with Layers
The Brightness Constraint
Representing Moving Images with Layers
Turning to the Masters: Motion Capturing Cartoons
MP4 Walkthrough Video Animation
EE 492 ENGINEERING PROJECT
Narrative Motion Graphics
Progress Report Alvaro Velasquez.
Presentation transcript:

Video Rewrite Driving Visual Speech with Audio Christoph Bregler Michele Covell Malcolm Slaney Presenter : Jack jeryes 3/3/2008

What is video rewrite ? use existing footage to create new video of a person mouthing words that he did not speak in the original footage

Example:

Why video rewrite? movie dubbing : to sync the actors’ lip motions to the new soundtrack Teleconferencing Special effects

Approach Learn from example footage how a person’s face changes during speech (dynamics and idiosyncrasies) (dynamics and idiosyncrasies)

Stages Video rewrite have two statges: Analysis stage Analysis stage Synthesis stage

Analysis stage: use the audio track to segment the video into triphones. Vision techniques find the head orientation, mouth & chin shape and position in each image

Synthesis stage: segments new audio and uses it to select triphones from the video model. Based on labels from the analysis stage, the new mouth images are morphed into a new background face

Analysis for video modeling the analysis stage creates an annotated database of example video clips, derived from unconstrained footage. (video model) -Annotation Using Image Analysis -Annotation Using Audio Analysis

Annotation Using Image Analysis As face moves within the frame, need to know -mouth position -lip shapes at all times. Using eigenpoints (good for low resolution)

Eigenpoints : A small set of hand-labeled facial images is used to train subspace models. Given a new image, the eigenpoint models tell us the positions of points on the lips and jaw

Eigenpoints (cont.) 54 eigenpoints for each image : 34 on the mouth 20 on the chin and jaw line. Only 26 images hand labeled 26 / 14,218 about 0.2% Extended the hand-annotated dataset by morphing pairs to form intermediate images

Eigenpoints (cont.) Eigenpoints doesn’t allow variety of motions. thus, warp each face image into a standard reference plane, prior to eigpoints labeling Use affine transform to minimize the mean-squared error between a large portion of the face image and a facial template

Mask to estimate global warp Each image is warped to account for changes in the head’s position, size, and rotation. The transform minimizes the difference between the transformed images and the face template. The mask (left) forces the minimization to consider only the upper face (right).

global mapping… Once the best global mapping is found, it is inverted and applied to the image, putting that face into the standard coordinate frame. We then perform eigenpoints analysis on this pre-warped image to find the fiduciary points. Finally, we back-project the fiduciary points through the global warp to place them on the original face image

Annotation Using Audio Analysis All the speech segmented into sequences of phonemes the /T/ in “beet” looks different from the /T/ in “boot.” Consider coarticulation

Annotation Using Audio Analysis Use triphones: collections of three sequential phonemes “teapot” is split into : /SIL-T-IY/ /T-IY-P/ /IY-P-AA/ /P-AA-T/ and /AA-T-SIL/

Annotation Using Audio Analysis While synthesize a video, -Emphasize the middle of each triphone. -Cross-fade the overlapping regions of neighboring triphones

Synthesis using a video model segments new audio and uses it to select triphones from the video model. Based on labels from the analysis stage, the new mouth images are morphed into a new background face

Synthesis using a video model background, head tilts and the eyes blink taken from the source footage in the same order as they were shot the triphone images include the mouth, chin, and part of the cheeks, use illumination-matching techniques to avoid visible seams

Selection of Triphone Videos choosing a sequence of clips that approximates the desired transitions and shape continuity

Selection of Triphone Videos Given a triphone in the new speech utterance, we compute a matching distance to each triphone in the video database D p = phoneme-context distance D s = lip-shape distance

D p = phoneme-context distance D p is based on categorical distances between phoneme categories and between viseme classes D p = waited sum ( viseme-distance, phonemic-distance )

26 viseme classes : 1- /CH/ /JH/ /SH/ /ZH/ 2- /K/ /G/ /N/ /L/ /T/ /D/ 3- /P/ /B/ /M/..

D p = phoneme-context distance -Phonemic-distance ( /P/, /P/ ) = 0 same phonemic category -Viseme-distance ( /P/,/IY/ ) = 1 different viseme classes D p ( /P/,/B/ ) = between 0-1 same viseme class different phonemic category

D s = lip-shape distance D s,measures how closely the mouth Contours match in overlapping segments of adjacent triphone videos In “teapot” : /IY/ and /P/ in /T-IY-P/ shall match the contours for /IY/ and /P/ in /IY-P-AA/

D s = lip-shape distance Euclidean distance frame by frame between 4-elements feature vector (overall lip width, overall lip high, inner lip height, height of visible teeth)

Stitching all Together The remaining task is to stitch the triphone videos into the background sequence