MikeTalk:An Adaptive Man-Machine Interface

MikeTalk:An Adaptive Man-Machine Interface
Tony Ezzat Volker Blanz Tomaso Poggio

TTVS Overview Input: Text
Output: Photo-realistic talking face uttering text

Desktop Agents

You have received 1 email
Desktop Agents You have received 1 from Tommy Poggio.

Customer Support

Customer Support You have bought 20 shares of SONY at $40 each.

Advertisements

Advertisements Hi Tony, would you be interested
in a ticket from Boston to New York for $50.00?

Modules

Phoneme Corpus Step 1: collect a visual corpus from a subject
corpus contains 44 words one word for each American English phoneme

6 Consonantal Visemes Step 2: extract one image per phoneme: viseme
group visemes together by visual similarity

9 Vocalic Visemes (+ 1 SilenceViseme)

Problem1:Need to Interpolate!

Simultaneous interpolation of shape & texture. (Beier & Neely 1992)
Solution: Morphing! Simultaneous interpolation of shape & texture. (Beier & Neely 1992) Problem 2: too tedious to specify correspondence by hand across many images!

Solution: Optical Flow
(Horn & Schunk 1986) (Lucas & Kanade 1988) To interpolate between two visemes, optical flow is first computed A 2D motion vector field is produced: dx(x,y) dy(x,y)

Morphing Forward warping A to B Forward warping B to A Blending
Holefilling

Synthesis Database 16 Visemes total
256 Optical flow vectors total, from every viseme to every other viseme

Concatenation and Lip Sync
Load the correct viseme transitions Concatenate viseme transitions Sample the viseme transitions using audio durations

Examples “1, 2, 3, 4, 5” “you have received 10 email messages.”
“cat, dog, pig, cow, moose, horse, sheep”

Current Work Coarticulation Eye + head movements Emotion
3D instead of 2d Psychophysics

3D With Volker Blanz

The End

Co-articulation Problem: Current method does not handle coarticulation, so speech looks overly articulated Can record all possible triphones/ quadriphones but this approach requires a lot of data! Best method is to learn a model for coarticulation, but what is the representation for the lips?

Principal Components Analysis
Each image is a vector in a high-dimensional space Using PCA, find the optimal set of vectors that span the space Project the entire corpus onto those basis vectors

Top 2 PCA Bases for /buut/

Problem: Too nonlinear!
Top 2 PCA Bases for /get/ Problem: Too nonlinear!

Flow Component Analysis
Compute optical from a reference lip image to all other images in the corpus Compute PCA on all the flows

Top 2 FPCA Bases for /buut/

Top 2 FPCA Bases for /get/
Much more linear behavior!

Current Work Now that we have parameterized the mouth, what is the model for mouth synthesis? How is that model fit to the PCA data?

MikeTalk:An Adaptive Man-Machine Interface

Similar presentations

Presentation on theme: "MikeTalk:An Adaptive Man-Machine Interface"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MikeTalk:An Adaptive Man-Machine Interface

Similar presentations

Presentation on theme: "MikeTalk:An Adaptive Man-Machine Interface"— Presentation transcript:

Similar presentations

About project

Feedback