Learning and Vision for Multimodal Conversational Interfaces Trevor Darrell Vision Interface Group MIT CSAIL Lab.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Active Appearance Models
Face Alignment by Explicit Shape Regression
Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab.
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
An Introduction of Support Vector Machine
Actions in video Monday, April 25 Kristen Grauman UT-Austin.
Introduction To Tracking
Learning to estimate human pose with data driven belief propagation Gang Hua, Ming-Hsuan Yang, Ying Wu CVPR 05.
Reducing Drift in Parametric Motion Tracking
Model base human pose tracking. Papers Real-Time Human Pose Tracking from Range Data Simultaneous Shape and Pose Adaption of Articulated Models using.
“ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound Kidron, Schechner, Elad, CVPR
Robust Object Tracking via Sparsity-based Collaborative Model
EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Probabilistic video stabilization using Kalman filtering and mosaicking.
Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.
Rodent Behavior Analysis Tom Henderson Vision Based Behavior Analysis Universitaet Karlsruhe (TH) 12 November /9.
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
Face Poser: Interactive Modeling of 3D Facial Expressions Using Model Priors Manfred Lau 1,3 Jinxiang Chai 2 Ying-Qing Xu 3 Heung-Yeung Shum 3 1 Carnegie.
Real-time Combined 2D+3D Active Appearance Models Jing Xiao, Simon Baker,Iain Matthew, and Takeo Kanade CVPR 2004 Presented by Pat Chan 23/11/2004.
Object Detection and Tracking Mike Knowles 11 th January 2005
1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.
Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.
Face Recognition: An Introduction
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Presented By Wanchen Lu 2/25/2013
Multimodal Interaction Dr. Mike Spann
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
TP15 - Tracking Computer Vision, FCUP, 2013 Miguel Coimbra Slides by Prof. Kristen Grauman.
Human-Computer Interaction Human-Computer Interaction Tracking Hanyang University Jong-Il Park.
A General Framework for Tracking Multiple People from a Moving Camera
EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:
Face detection Slides adapted Grauman & Liebe’s tutorial
Multimodal Information Analysis for Emotion Recognition
Fast Similarity Search for Learned Metrics Prateek Jain, Brian Kulis, and Kristen Grauman Department of Computer Sciences University of Texas at Austin.
Enforcing Constraints for Human Body Tracking David Demirdjian Artificial Intelligence Laboratory, MIT.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab.
A Two-level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis J. Wu, J. M. Pedersen, D. Putthividhya, D. Norgaard,
1 E. Fatemizadeh Statistical Pattern Recognition.
Face Recognition: An Introduction
Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH Michael Black, Brown University.
Processing Sequential Sensor Data The “John Krumm perspective” Thomas Plötz November 29 th, 2011.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Final Review Course web page: vision.cis.udel.edu/~cv May 21, 2003  Lecture 37.
IEEE International Conference on Multimedia and Expo.
3D Face Recognition Using Range Images Literature Survey Joonsoo Lee 3/10/05.
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
Learning Image Statistics for Bayesian Tracking Hedvig Sidenbladh KTH, Sweden Michael Black Brown University, RI, USA
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
MIT Artificial Intelligence Laboratory — Research Directions Intelligent Perceptual Interfaces Trevor Darrell Eric Grimson.
CALO VISUAL INTERFACE RESEARCH PROGRESS
University of Ioannina
Face Recognition and Feature Subspaces
Nonparametric Semantic Segmentation
Recognition: Face Recognition
Dynamical Statistical Shape Priors for Level Set Based Tracking
Object detection as supervised classification
Combining Geometric- and View-Based Approaches for Articulated Pose Estimation David Demirdjian MIT Computer Science and Artificial Intelligence Laboratory.
Human-centered Interfaces
Image Registration 박성진.
EE 492 ENGINEERING PROJECT
Presentation transcript:

Learning and Vision for Multimodal Conversational Interfaces Trevor Darrell Vision Interface Group MIT CSAIL Lab

Natural Interfaces Conversation would improve many interactions. Currently, conversational interfaces are useless in most situations with more than one user, or with real-world references. Visual Context is missing…

Visual Context for Conversation Who is there? (presence, identity) Which person said that? (audiovisual grouping) Where are they? (location) What are they looking / pointing at? (pose, gaze) What are they doing? (activity)

Learning Visual conversational context cues are hard to model analytically. Learning methods are appropriate Different techniques for different cues, levels of representation, input modes,... (At least for now…)

Today Speaker segregation using audio-visual mutual information -discard background sounds -separate multiple conversational streams Head pose detection and tracking with multi-view appearance models -attention -agreement Articulated pose tracking by learning model constraints, or example-based inference… -gesture -“body language”

Today Speaker segregation using audio-visual mutual information -discard background sounds -separate multiple conversational streams Head pose detection and tracking with multi-view appearance models -attention -agreement Articulated pose tracking by learning model constraints, or example-based inference… -gesture -“body language”

blah blah blah blah computer, show me the NIPS presentation Is that you talking? blah blah blah blah computer, show me the NIPS presentation

Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)? ?

Audio-visual synchrony Can we find a relationship between audio and visual events (e.g., speech)? Model-free? ?

Audio-visual synchrony Yes, by learning a model of audio-visual synchrony. Three approaches: Pixel-wise corellation with video [Hershey and Movellan] Correlation of optimal projection [Slaney and Covell] Non-parametric Mutual Information analysis on optimal projection [Fisher et al.]

Audio-based Image localization E.g., locate visual sources given audio information: Original Sequence

Audio-based Image localization Image variance (ignoring audio) will find all motion in the sequence: Image Variance

Audio-based Image localization Estimate mutual information between audio and video: Pixels which have high mutual information w.r.t audio track

A(t) V(x,y,t) time I(x,y) Evaluate Statistic Assumes jointly Gaussian audio and video Recursively estimate statistics over a window of time (~.5 sec) Calculates pixelwise mutual information / correlation (m=n=1) Determine speaker by finding “centroid” of AudioVision: Hershey and Movellan (NIPS 1999) video obtained from A threshold and Gaussian influence function reduce the contribution of spuriously high MI values away from the centroid (shown as a + in the video). Pixel-wise correlation

Evaluate Statistic “Learned” Subspace I Uses canonical correlation to find the best projection of audio and video. i.e. Define: projection of audio projection of video and find Uses a face detector to locate and align faces in video. Training step finds  and . Testing evaluates correlation between and for new audio and video data. FaceSync: Slaney and Covell (NIPS 2001) Cannonical correlation projection

Non-parametric Mutual Information Match audio to video using adaptive feature basis Exploit joint statistics of image and audio signal Efficient non-parametric density estimation

Maximally Informative Subspace Treat each image/audio frame in the sequence as a sample of a random variable. Projections optimize the joint audio/video. statistics in the lower dimensional feature space. Approximate joint density with Parzen window nonparametric model. Gradient of approximate entropy can be computed efficiently [Fisher 97] Current work uses single projection; extending to multidimensional projection…

Audio-visual synchrony detection MI: Compute similarity matrix for 8 subjects: No errors! No training! Also can use for audio/visual temporal alignment [Fisher and Darrell ECCV 2002]

Today Speaker segregation using audio-visual mutual information -discard background sounds -separate multiple conversational streams Head pose detection and tracking with multi-view appearance models -attention -agreement Articulated pose tracking by learning model constraints, or example-based inference… -gesture -“body language”

Head pose tracking

Lots of Work on Face Pose Tracking… Cylindrical approx. [LaCascia & Sclaroff] 3D Mesh approx. [Essa] 3D Morphable model [Blanz & Vetter] Multi-view keyframes from 3D model [Vachetti et al.] View-based eigenspaces [Srinivasan & Boyer] [Pentland et al.] …

Pose Estimation 3D Pose Estimation Model ICP Optic Flow Feature Alignment …

User Dependent Keyframes ? 3D Pose Estimation 3D Pose Estimation

User-Independent Prior Model 3D Pose Estimation 3D Pose Estimation Prior Model Multi-view Reconstruction

3D View-based Eigenspaces 3D View-based Eigenspaces 3D Pose Estimation 3D Pose Estimation Multi-view Reconstruction

View-based Eigenspaces PCA

3D View-based Eigenspaces ? ? ? ? ?

Transfer weights to depth images: SVD Decomposition of intensity image: weights

Reconstruction subwindow Minimize the reconstruction error Least-square Optimal Eigenvector weights 1.For each subwindow { I t, Z t } and view i :

Reconstruction … and compute the normalized cross-correlation 2.Select the view i and the subwindow { I t, Z t } that optimize c i subwindow 1.For each subwindow { I t, Z t } and view i :

Reconstruction Input subwindow Ground truth subwindow 3.Reconstruct all views: Reconstruction

Pose Estimation View registration [ICPR 2002] 1.Search new frame for best subwindow using correlation 2.Select k best keyframes 3.Compute rigid motion using ICP + Normal Flow

Pose Estimation Observation Model: Kalman filter framework [CVPR 2003]

Experiments Image sequences from stereo cameras Prior model: 14 subjects in 28 orientations Ground truth with Inertia Cube sensor Compare with OSU pose estimator [Srinivasan & Boyer ’02] -Use same training set for eigenspaces

Results

Exploiting cascades for speed But, correlation search step is very slow! Using a cascade detection paradigm [Viola, Jones], many patterns can be quickly rejected. -Set false negative rate to be very low (e.g. 1%) per stage -each stage may have low hit rate (30-40%) but overall architecture is efficient and accurate Multi-view cascade detection to obtain coarse initial pose estimate

Pose aware interfaces Interface Agent responds to gaze of user -agent should know when it’s being attended to -turn-taking pragmatics -eventually, anaphora + object reference Prototype -Smart-room interface “sam” -Early experiments with face tracker on meeting room table…

Subject not looking at SAM ASR turned off SAM Pose tracker

Subject looking at SAM ASR turned on SAM Pose tracker

Head nod detection Track 6DOF motion of head nod and shake gestures Experiment with simple motion energy ratio test. Initial results promising

Today Speaker segregation using audio-visual mutual information -discard background sounds -separate multiple conversational streams Head pose detection and tracking with multi-view appearance models -attention -agreement Articulated pose tracking by learning model constraints, or example-based inference… -gesture -“body language”

Articulated pose sensing

Learning Articulated Tracking Model-based approach works for 3-D data and pure articulation constraints… Need to learn joint limits and other behavioral constraints (with a classic model-based tracker) Without direct 3-D data, example-based techniques are most promising…

Model-based Approach depth image

Model-based Approach depth image ICP with articulation constraint model

Model-based Approach depth image ICP with articulation constraint model 1.Find closest points 2.Update poses 3.Constrain…

ICP with articulated motion constraint Minimize distance between 3D-data and 3-D articulated model -Apply ICP to each object in the articulated model to find motion (twist)  k  t ) with covariance  k for each limb. -Enforce joint constraints: find a set of motions  k ’ close to original motions that satisfy joint constraints Pure articulation can be expressed as a linear projection on stacked rigid motion

Non-linear constraints Limitations of Pure Articulation Constraints -Can not capture the limits on the range of motion of human joints -Can not capture behavioral limits of body pose Learning approach: learn a discriminative model of valid / invalid pose Train SVM for use as a Lagrangian constraint -Valid body poses extracted from mocap data (150,000 poses) -Invalid body poses generated randomly -Cross-validation classification error rates at around.061% Support Vectors

Video

Multimodal gestures

Learning pose without 3-D observations Model based approach difficult with more impoverished observations…e.g., contour or edge features Example based learning approach -Generate corpus of training data with model (Poser) -Find nearest neighbors using fast hashing techniques (LSH) -Optionally use local regression on NN With segmented contours -shape context features -bipartite graph matching via Earth Movers’ Distance With unsegmented edge features -feature selection using paired classification problem -extend LSH to use “Parameter sensitive Hashing”

Parameter sensitive hashing When explicit feature (shape context) is not available, feature selection is needed Features for an optimal distance can be found by training a classifier on an equivalence task LSH+classifier-based feature selection=PSH e.g., hashing functions sensitive to distance in a parameter space, not feature space. “Parameter Sensitive Hashing” [Shakhnarovich et al.]

Parameter sensitive hashing (Details tomorrow…!)

Saturday Workshop

Schedule 5:30pm-5:50pm: Talk Fast Example-based Estimation with Parameter-Sensitive Hashing Greg Shakhnarovich 10:30am: Poster Contour Matching Using Approximate Earth Mover's Distance Kristen Grauman

Today Learning methods are critical for robust estimation of synchrony, pose and other conversational context cues: Speaker segregation using audiovisual mutual information Head pose estimation using multi-view manifolds and detection cascade trees Real-time articulated tracking from stereo data with SVM- based joint constraints Monocular tracking using example-based inference with fast nearest neighbor methods

Acknowledgements Greg Shakhnarovich Kristen Grauman Neal Checka David Demirdjian Theresa Ko John Fisher Louis-Philippe Morency Mike Siracusa …