Learning and Vision for Multimodal Conversational Interfaces Trevor Darrell Vision Interface Group MIT CSAIL Lab.

Learning and Vision for Multimodal Conversational Interfaces Trevor Darrell Vision Interface Group MIT CSAIL Lab

Natural Interfaces Conversation would improve many interactions. Currently, conversational interfaces are useless in most situations with more than one user, or with real-world references. Visual Context is missing…

Visual Context for Conversation Who is there? (presence, identity) Which person said that? (audiovisual grouping) Where are they? (location) What are they looking / pointing at? (pose, gaze) What are they doing? (activity)

Learning Visual conversational context cues are hard to model analytically. Learning methods are appropriate Different techniques for different cues, levels of representation, input modes,... (At least for now…)

Today Speaker segregation using audio-visual mutual information -discard background sounds -separate multiple conversational streams Head pose detection and tracking with multi-view appearance models -attention -agreement Articulated pose tracking by learning model constraints, or example-based inference… -gesture -“body language”

blah blah blah blah computer, show me the NIPS presentation Is that you talking? blah blah blah blah computer, show me the NIPS presentation

Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)? ?

Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)? Model-free? ?

Audio-visual synchrony Yes, by learning a model of audio-visual synchrony. Three approaches: Pixel-wise corellation with video [Hershey and Movellan] Correlation of optimal projection [Slaney and Covell] Non-parametric Mutual Information analysis on optimal projection [Fisher et al.]

Audio-based Image localization E.g., locate visual sources given audio information: Original Sequence

Audio-based Image localization Image variance (ignoring audio) will find all motion in the sequence: Image Variance

Audio-based Image localization Estimate mutual information between audio and video: Pixels which have high mutual information w.r.t audio track

A(t) V(x,y,t) time I(x,y) Evaluate Statistic Assumes jointly Gaussian audio and video Recursively estimate statistics over a window of time (~.5 sec) Calculates pixelwise mutual information / correlation (m=n=1) Determine speaker by finding “centroid” of AudioVision: Hershey and Movellan (NIPS 1999) video obtained from http://mplab.ucsd.edu/~jhershey/ A threshold and Gaussian influence function reduce the contribution of spuriously high MI values away from the centroid (shown as a + in the video). Pixel-wise correlation

Evaluate Statistic “Learned” Subspace I Uses canonical correlation to find the best projection of audio and video. i.e. Define: projection of audio projection of video and find Uses a face detector to locate and align faces in video. Training step finds  and . Testing evaluates correlation between and for new audio and video data. FaceSync: Slaney and Covell (NIPS 2001) Cannonical correlation projection

Non-parametric Mutual Information Match audio to video using adaptive feature basis Exploit joint statistics of image and audio signal Efficient non-parametric density estimation

Maximally Informative Subspace Treat each image/audio frame in the sequence as a sample of a random variable. Projections optimize the joint audio/video. statistics in the lower dimensional feature space. Approximate joint density with Parzen window nonparametric model. Gradient of approximate entropy can be computed efficiently [Fisher 97] Current work uses single projection; extending to multidimensional projection…

Audio-visual synchrony detection MI: 0.68 0.61 0.19 0.20 Compute similarity matrix for 8 subjects: No errors! No training! Also can use for audio/visual temporal alignment [Fisher and Darrell ECCV 2002]

Head pose tracking

Lots of Work on Face Pose Tracking… Cylindrical approx. [LaCascia & Sclaroff] 3D Mesh approx. [Essa] 3D Morphable model [Blanz & Vetter] Multi-view keyframes from 3D model [Vachetti et al.] View-based eigenspaces [Srinivasan & Boyer] [Pentland et al.] …

Pose Estimation 3D Pose Estimation Model ICP Optic Flow Feature Alignment …

User Dependent Keyframes ? 3D Pose Estimation 3D Pose Estimation

User-Independent Prior Model 3D Pose Estimation 3D Pose Estimation Prior Model Multi-view Reconstruction

3D View-based Eigenspaces 3D View-based Eigenspaces 3D Pose Estimation 3D Pose Estimation Multi-view Reconstruction

View-based Eigenspaces PCA

3D View-based Eigenspaces ? ? ? ? ?

Transfer weights to depth images: SVD Decomposition of intensity image: weights

Reconstruction subwindow Minimize the reconstruction error Least-square Optimal Eigenvector weights 1.For each subwindow { I t, Z t } and view i :

Reconstruction … and compute the normalized cross-correlation 2.Select the view i and the subwindow { I t, Z t } that optimize c i subwindow 1.For each subwindow { I t, Z t } and view i :

Reconstruction Input subwindow Ground truth subwindow 3.Reconstruct all views: Reconstruction

Pose Estimation View registration [ICPR 2002] 1.Search new frame for best subwindow using correlation 2.Select k best keyframes 3.Compute rigid motion using ICP + Normal Flow

Pose Estimation Observation Model: Kalman filter framework [CVPR 2003]

Experiments Image sequences from stereo cameras Prior model: 14 subjects in 28 orientations Ground truth with Inertia Cube sensor Compare with OSU pose estimator [Srinivasan & Boyer ’02] -Use same training set for eigenspaces

Results

Exploiting cascades for speed But, correlation search step is very slow! Using a cascade detection paradigm [Viola, Jones], many patterns can be quickly rejected. -Set false negative rate to be very low (e.g. 1%) per stage -each stage may have low hit rate (30-40%) but overall architecture is efficient and accurate Multi-view cascade detection to obtain coarse initial pose estimate

Pose aware interfaces Interface Agent responds to gaze of user -agent should know when it’s being attended to -turn-taking pragmatics -eventually, anaphora + object reference Prototype -Smart-room interface “sam” -Early experiments with face tracker on meeting room table…

Subject not looking at SAM ASR turned off SAM Pose tracker

Subject looking at SAM ASR turned on SAM Pose tracker

Head nod detection Track 6DOF motion of head nod and shake gestures Experiment with simple motion energy ratio test. Initial results promising

Articulated pose sensing

Learning Articulated Tracking Model-based approach works for 3-D data and pure articulation constraints… Need to learn joint limits and other behavioral constraints (with a classic model-based tracker) Without direct 3-D data, example-based techniques are most promising…

Model-based Approach depth image

Model-based Approach depth image ICP with articulation constraint model

Model-based Approach depth image ICP with articulation constraint model 1.Find closest points 2.Update poses 3.Constrain…

ICP with articulated motion constraint Minimize distance between 3D-data and 3-D articulated model -Apply ICP to each object in the articulated model to find motion (twist)  k  t ) with covariance  k for each limb. -Enforce joint constraints: find a set of motions  k ’ close to original motions that satisfy joint constraints Pure articulation can be expressed as a linear projection on stacked rigid motion

Non-linear constraints Limitations of Pure Articulation Constraints -Can not capture the limits on the range of motion of human joints -Can not capture behavioral limits of body pose Learning approach: learn a discriminative model of valid / invalid pose Train SVM for use as a Lagrangian constraint -Valid body poses extracted from mocap data (150,000 poses) -Invalid body poses generated randomly -Cross-validation classification error rates at around.061% Support Vectors

Multimodal gestures

Learning pose without 3-D observations Model based approach difficult with more impoverished observations…e.g., contour or edge features Example based learning approach -Generate corpus of training data with model (Poser) -Find nearest neighbors using fast hashing techniques (LSH) -Optionally use local regression on NN With segmented contours -shape context features -bipartite graph matching via Earth Movers’ Distance With unsegmented edge features -feature selection using paired classification problem -extend LSH to use “Parameter sensitive Hashing”

Parameter sensitive hashing When explicit feature (shape context) is not available, feature selection is needed Features for an optimal distance can be found by training a classifier on an equivalence task LSH+classifier-based feature selection=PSH e.g., hashing functions sensitive to distance in a parameter space, not feature space. “Parameter Sensitive Hashing” [Shakhnarovich et al.]

Parameter sensitive hashing (Details tomorrow…!)

Saturday Workshop

Schedule 5:30pm-5:50pm: Talk Fast Example-based Estimation with Parameter-Sensitive Hashing Greg Shakhnarovich 10:30am: Poster Contour Matching Using Approximate Earth Mover's Distance Kristen Grauman

Today Learning methods are critical for robust estimation of synchrony, pose and other conversational context cues: Speaker segregation using audiovisual mutual information Head pose estimation using multi-view manifolds and detection cascade trees Real-time articulated tracking from stereo data with SVM- based joint constraints Monocular tracking using example-based inference with fast nearest neighbor methods

Acknowledgements Greg Shakhnarovich Kristen Grauman Neal Checka David Demirdjian Theresa Ko John Fisher Louis-Philippe Morency Mike Siracusa …

Learning and Vision for Multimodal Conversational Interfaces Trevor Darrell Vision Interface Group MIT CSAIL Lab.

Similar presentations

Presentation on theme: "Learning and Vision for Multimodal Conversational Interfaces Trevor Darrell Vision Interface Group MIT CSAIL Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning and Vision for Multimodal Conversational Interfaces Trevor Darrell Vision Interface Group MIT CSAIL Lab.

Similar presentations

Presentation on theme: "Learning and Vision for Multimodal Conversational Interfaces Trevor Darrell Vision Interface Group MIT CSAIL Lab."— Presentation transcript:

Similar presentations

About project

Feedback