Download presentation
Presentation is loading. Please wait.
1
CALO VISUAL INTERFACE RESEARCH PROGRESS
David Demirdjian Trevor Darrell MIT CSAIL
2
pTablet (or pLaptop!) Goal: visual cues to conversation or interaction state: presence attention turn-taking agreement and grounding gestures emotion and expression cues visual speech features
3
Functional Capabilities
Help CALO infer: whether the user is still participating in a conversation or interaction, is focused on the interface or listening to another person. when the user is speaking, further features pertaining to visual speech non-verbal means to observe whether a user is confirming understanding of or agreement with the current topic or question, is confused or irritated both for meeting understanding, and CALO UI…
4
Machine Learning Research Challenges
Focusing on learning methods which capture personalized interaction Articulatory models of visual speech Sample-based methods for body tracking Hidden-state conditional random fields Context-based gesture recognition (Not all are yet in deployed demo…)
5
Articulatory models of visual speech:
Traditional models of visual speech presume synchronous units based on visimes, the visual correlate of phonemes. Audiovisual speech production is often asynchronous Model with formed with a series of loosely coupled streams of articulatory features. (See Saenko and Darrell, ICMI 2004, and Saenko et al., ICCV 2005, for more information.)
6
Sample-based methods for body tracking
Tracking human bodies requires exploration of a high-dimensional state space Estimated posteriors are often sharp and multimodal. New tracking techniques based on novel approximate nearest neighbor hashing method which have comprehensive pose coverage, and optimally integrate information over time. These techniques are suitable for real-time markerless motion capture, and for tracking the human body to infer attention and gesture. (See Demirdjian et al. ICCV 2005, and Taycher et al. CVPR 2006 for more information.)
7
Hidden-state conditional random fields
Discriminative techniques are efficient and accurate, and learn to represent only the portion of a state necessary for a specific task. Conditional random fields are effective at recognizing visual gestures, but lack the ability of generative models to capture gesture substructure through hidden state. We have developed a hidden-state conditional random field formulation. (See Wang et al. CVPR 2006.)
8
Hidden Conditional Random Fields for Head Gesture Recognition
3 classes – Nods, Shakes, Junk Models Accuracy(%) HMM W = 0 46.33 CRF W = 0 38.42 HCRF(multiclass) W = 0 45.37 HCRF(multiclass) W = 1 64.44 HOW ABOUT HCRF ONE-VS-ALL?????? Confusion matrices….or ROC curves might be be better…..ROC curves of one-vs-all… Challenging data because the data is about pple interacting with a robot
9
Context-based gesture recognition
Recognition of user’s gesture should be done in the context of the current interaction Visual recognition can be augmented with context cues from the interaction state conversational dialog with an embodied agent interaction with a conventional windows and mouse interface. See Morency, Sidner and Darrell, ICMI 2005 and Morency and Darrell, IUI 2006
10
User Adaptive Agreement Recognition
Person’s idiolect User agreement from recognized speech and head gestures multimodal co-training Challenges: Asynchrony between modalities “Missing data” problem Just as people amongst communities speak a dialect of a language, each individual speaker has their own unique understanding and use of language called their idiolect. We are interested in the application and development of machine learning algorithms that will enable human-computer interfaces to adapt to their users’ idiolects to create a more natural and efficient interaction. The specific problem that we are addressing is agreement recognition in conversational agents, where the agent uses keywords in the recognized speech and recognized head gestures to determine user agreement (show video). Our goal is to apply multimodal co-training between the linguistic and visual classifiers to learn new, user-specific agreement keywords and head gestures thus enabling the user to interact more naturally with the agent. The main challenges in applying multimodal co-training in this setting are due to the asynchrony between the user’s speech and head gestures; in general an agreement keyword does not occur at the same time as a head gesture, instead both are within a time window of one-another. Also, an agreement keyword need not co-occur with a head gesture at all, so there needs to be a mechanism for “missing data” detection. We are currently looking into using data clustering techniques in combination with co-training to design a multimodal co-training algorithm that will overcome these challenges. Video Notes: This is an example sequence from a dataset collected at MERL that consists of a set of subjects interacting with an embodied conversational agent (Mel). We are using these sequences in our experiments on user adaptive agreement recognition. The subjects use speech and head gesture to interact with Mel. In this sequence Mel is explaining a MERL invention called iGlassware. Notice the asynchrony between the first 'yes' keyword and the associated head nod; also, that the second 'yes' keyword has no corresponding head nod.
11
Status New pTablet functionalities: A/V Integration:
Face/gaze tracking Head gesture recognition (nod/shake) + Gaze Lip/Mouth motion detection User enrollment/recognition (ongoing work) A/V Integration: Audio-visual sync./calibration Meeting visualization/understanding
12
pTablet system pTablet camera Head Gesture Recognition VTracker Speech
user model (frontal view) pose (6D) VTracker OAA messages person ID head pose gesture lips moving pTablet camera Head Gesture Recognition Speech audio
14
Speaking activity detection
Face tracking as: Rigid pose
15
Speaking activity detection
Face tracking as: Rigid pose + Non-rigid facial deformations
16
Speaking activity detection
~ high motion energy in Mouth/lips region weak assumption (eg. hand moving in front of mouth will trigger speaking activity detection) But complement well audio-based speaker detection
17
Speaking activity detection
18
User enrollment/recognition
Idea: At start the user is automatically identified and logged in by the pTablet. If the user is not recognized or misrecognized, he will have to login manually. Face recognition based on a Feature Set Matching algorithm (The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. Grauman and Darrell ICCV’05)
19
Audio-Visual Calibration
Temporal calibration: aligning audio with visual data How? by aligning lip motion energy in images with audio energy Geometric calibration: estimate camera location/orientation in the world coordinate system
20
Audio-Visual Integration
CAMEO pTablets
21
Calibration Alternative approach to estimate the position/orientation of the pTablets with or without global view (eg. from CAMEO) Idea: use discourse information (eg. who is talking to who, dialog bw. 2 people) and local head pose to find the location of the pTablets…
22
A/V Integration AVIntegrator:
Same functionalities as Year 2 (eg. includes activity recognition, etc…) modified to accept calibration data estimated externally
23
Integration and activity estimation
A/V integration Activity estimation: Who’s in the room Who is looking at who? Who is talking to who? …
24
A/V Integrator system Calibration information VTracker VTracker
OAA/MOKB messages user list speaker agrees/disagrees who to whom VTracker VTracker pTablet A/V Integrator CAMEO eg. current speaker Discourse/Dialog Speech recognition ?
25
A/V Integration
26
Demonstration? Real-time meeting understanding
Use of pTablet suite for interaction with personal CALO: eg.: use of head pose/lip motion for speaking activity detection Yes/No answer by head nods/shakes Visual login
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.