CALO VISUAL INTERFACE RESEARCH PROGRESS

Slides:



Advertisements
Similar presentations
ARTIFICIAL PASSENGER.
Advertisements

National Technical University of Athens Department of Electrical and Computer Engineering Image, Video and Multimedia Systems Laboratory
Communication Theory Lecture 1: Introduction to Communication Theory and Novel Technology Dr. Danaë Stanton Fraser.
Descriptive schemes for facial expression introduction.
Model-based Image Interpretation with Application to Facial Expression Recognition Matthias Wimmer
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
人機介面 Gesture Recognition
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu.
Perception and Perspective in Robotics Paul Fitzpatrick MIT Computer Science and Artificial Intelligence Laboratory Humanoid Robotics Group Goal To build.
Learning and Vision for Multimodal Conversational Interfaces Trevor Darrell Vision Interface Group MIT CSAIL Lab.
Joint Eye Tracking and Head Pose Estimation for Gaze Estimation
Human interaction is not constructed as a single channel – it is multimodal. Speech and gestures correlate to convey meaning. Moreover, human interaction.
Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.
 INTRODUCTION  STEPS OF GESTURE RECOGNITION  TRACKING TECHNOLOGIES  SPEECH WITH GESTURE  APPLICATIONS.
RECOGNIZING FACIAL EXPRESSIONS THROUGH TRACKING Salih Burak Gokturk.
CS335 Principles of Multimedia Systems Multimedia and Human Computer Interfaces Hao Jiang Computer Science Department Boston College Nov. 20, 2007.
Building the Design Studio of the Future Aaron Adler Jacob Eisenstein Michael Oltmans Lisa Guttentag Randall Davis October 23, 2004.
Multimedia Specification Design and Production 2013 / Semester 2 / week 8 Lecturer: Dr. Nikos Gazepidis
Recognition of meeting actions using information obtained from different modalities Natasa Jovanovic TKI University of Twente.
1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 7: Focusing on Users and Their Tasks.
CVPR Workshop on RTV4HCI 7/2/2004, Washington D.C. Gesture Recognition Using 3D Appearance and Motion Features Guangqi Ye, Jason J. Corso, Gregory D. Hager.
Multimodal Information Analysis for Emotion Recognition
Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.
Perceptive Context for Pervasive Computing Trevor Darrell Vision Interface Group MIT AI Lab.
Computer Vision Michael Isard and Dimitris Metaxas.
ENTERFACE 08 Project 1 “MultiParty Communication with a Tour Guide ECA” Mid-term presentation August 19th, 2008.
CAMEO: Year 1 Progress and Year 2 Goals Manuela Veloso, Takeo Kanade, Fernando de la Torre, Paul Rybski, Brett Browning, Raju Patil, Carlos Vallespi, Betsy.
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Many slides adapt from Steve Gu.
1 Workshop « Multimodal Corpora » Jean-Claude MARTIN Patrizia PAGGIO Peter KÜEHNLEIN Rainer STIEFELHAGEN Fabio PIANESI.
CAMEO: Face Recognition Year 1 Progress and Year 2 Goals Fernando de la Torre, Carlos Vallespi, Takeo Kanade.
ACADS-SVMConclusions Introduction CMU-MMAC Unsupervised and weakly-supervised discovery of events in video (and audio) Fernando De la Torre.
Aiming Computing Technology at Enhancing the Quality of Life of People with ALS Some Sketches on Directions in Minimal Signaling Communication Communication.
It Starts with iGaze: Visual Attention Driven Networking with Smart Glasses It Starts with iGaze: Visual Attention Driven Networking with Smart Glasses.
Gesture Recognition 12/3/2009.
Stanford hci group / cs376 u Jeffrey Heer · 19 May 2009 Speech & Multimodal Interfaces.
WP6 Emotion in Interaction Embodied Conversational Agents WP6 core task: describe an interactive ECA system with capabilities beyond those of present day.
PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.
Shape2Pose: Human Centric Shape Analysis CMPT888 Vladimir G. Kim Siddhartha Chaudhuri Leonidas Guibas Thomas Funkhouser Stanford University Princeton University.
MIT Artificial Intelligence Laboratory — Research Directions Intelligent Perceptual Interfaces Trevor Darrell Eric Grimson.
Bayesian Decision Theory Case Studies CS479/679 Pattern Recognition Dr. George Bebis.
Vineeth Balasubramanian Shayok Chakraborty Sreekar Krishna Sethuraman Panchanathan C ENTER FOR C OGNITIVE U BIQUITOUS C OMPUTING CUbiC Human Centered Machine.
 ASMARUL SHAZILA BINTI ADNAN  Word Emotion comes from Latin word, meaning to move out.  Human emotion can be recognize from facial expression,
Presented By Bhargav (08BQ1A0435).  Images play an important role in todays information because A single image represents a thousand words.  Google's.
Perceptive Computing Democracy Communism Architecture The Steam Engine WheelFire Zero Domestication Iron Ships Electricity The Vacuum tube E=mc 2 The.
San Diego May 22, 2013 Giovanni Saponaro Giampiero Salvi
Attention Tracking Tool
Guillaume-Alexandre Bilodeau
Artificial Intelligence for Speech Recognition
Ubiquitous Computing and Augmented Realities
ROBUST FACE NAME GRAPH MATCHING FOR MOVIE CHARACTER IDENTIFICATION
FACE DETECTION USING ARTIFICIAL INTELLIGENCE
When to engage in interaction – and how
Fast Preprocessing for Robust Face Sketch Synthesis
NBKeyboard: An Arm-based Word-gesture keyboard
Video-based human motion recognition using 3D mocap data
The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression By: Patrick Lucey, Jeffrey F. Cohn, Takeo.
Developing systems with advanced perception, cognition, and interaction capabilities for learning a robotic assembly in one day Dr. Dimitrios Tzovaras.
Iterative Optimization
Brief Review of Recognition + Context
SECOND LANGUAGE LISTENING Comprehension: Process and Pedagogy
Human-centered Interfaces
Multimodal Caricatural Mirror
Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu
Presented by: Mónica Domínguez
John H.L. Hansen & Taufiq Al Babba Hasan
AHED Automatic Human Emotion Detection
Computer Vision Readings
Presentation transcript:

CALO VISUAL INTERFACE RESEARCH PROGRESS David Demirdjian Trevor Darrell MIT CSAIL

pTablet (or pLaptop!) Goal: visual cues to conversation or interaction state: presence attention turn-taking agreement and grounding gestures emotion and expression cues visual speech features

Functional Capabilities Help CALO infer: whether the user is still participating in a conversation or interaction, is focused on the interface or listening to another person. when the user is speaking, further features pertaining to visual speech non-verbal means to observe whether a user is confirming understanding of or agreement with the current topic or question, is confused or irritated both for meeting understanding, and CALO UI…

Machine Learning Research Challenges Focusing on learning methods which capture personalized interaction Articulatory models of visual speech Sample-based methods for body tracking Hidden-state conditional random fields Context-based gesture recognition (Not all are yet in deployed demo…)

Articulatory models of visual speech: Traditional models of visual speech presume synchronous units based on visimes, the visual correlate of phonemes. Audiovisual speech production is often asynchronous Model with formed with a series of loosely coupled streams of articulatory features. (See Saenko and Darrell, ICMI 2004, and Saenko et al., ICCV 2005, for more information.)

Sample-based methods for body tracking Tracking human bodies requires exploration of a high-dimensional state space Estimated posteriors are often sharp and multimodal. New tracking techniques based on novel approximate nearest neighbor hashing method which have comprehensive pose coverage, and optimally integrate information over time. These techniques are suitable for real-time markerless motion capture, and for tracking the human body to infer attention and gesture. (See Demirdjian et al. ICCV 2005, and Taycher et al. CVPR 2006 for more information.)

Hidden-state conditional random fields Discriminative techniques are efficient and accurate, and learn to represent only the portion of a state necessary for a specific task. Conditional random fields are effective at recognizing visual gestures, but lack the ability of generative models to capture gesture substructure through hidden state. We have developed a hidden-state conditional random field formulation. (See Wang et al. CVPR 2006.)

Hidden Conditional Random Fields for Head Gesture Recognition 3 classes – Nods, Shakes, Junk Models Accuracy(%) HMM W = 0 46.33 CRF W = 0 38.42 HCRF(multiclass) W = 0 45.37 HCRF(multiclass) W = 1 64.44 HOW ABOUT HCRF ONE-VS-ALL?????? Confusion matrices….or ROC curves might be be better…..ROC curves of one-vs-all… Challenging data because the data is about pple interacting with a robot

Context-based gesture recognition Recognition of user’s gesture should be done in the context of the current interaction Visual recognition can be augmented with context cues from the interaction state conversational dialog with an embodied agent interaction with a conventional windows and mouse interface. See Morency, Sidner and Darrell, ICMI 2005 and Morency and Darrell, IUI 2006

User Adaptive Agreement Recognition Person’s idiolect User agreement from recognized speech and head gestures multimodal co-training Challenges: Asynchrony between modalities “Missing data” problem Just as people amongst communities speak a dialect of a language, each individual speaker has their own unique understanding and use of language called their idiolect. We are interested in the application and development of machine learning algorithms that will enable human-computer interfaces to adapt to their users’ idiolects to create a more natural and efficient interaction. The specific problem that we are addressing is agreement recognition in conversational agents, where the agent uses keywords in the recognized speech and recognized head gestures to determine user agreement (show video). Our goal is to apply multimodal co-training between the linguistic and visual classifiers to learn new, user-specific agreement keywords and head gestures thus enabling the user to interact more naturally with the agent. The main challenges in applying multimodal co-training in this setting are due to the asynchrony between the user’s speech and head gestures; in general an agreement keyword does not occur at the same time as a head gesture, instead both are within a time window of one-another. Also, an agreement keyword need not co-occur with a head gesture at all, so there needs to be a mechanism for “missing data” detection. We are currently looking into using data clustering techniques in combination with co-training to design a multimodal co-training algorithm that will overcome these challenges. Video Notes: This is an example sequence from a dataset collected at MERL that consists of a set of subjects interacting with an embodied conversational agent (Mel). We are using these sequences in our experiments on user adaptive agreement recognition. The subjects use speech and head gesture to interact with Mel. In this sequence Mel is explaining a MERL invention called iGlassware. Notice the asynchrony between the first 'yes' keyword and the associated head nod; also, that the second 'yes' keyword has no corresponding head nod.

Status New pTablet functionalities: A/V Integration: Face/gaze tracking Head gesture recognition (nod/shake) + Gaze Lip/Mouth motion detection User enrollment/recognition (ongoing work) A/V Integration: Audio-visual sync./calibration Meeting visualization/understanding

pTablet system pTablet camera Head Gesture Recognition VTracker Speech user model (frontal view) pose (6D) VTracker OAA messages person ID head pose gesture lips moving pTablet camera Head Gesture Recognition Speech audio

Speaking activity detection Face tracking as: Rigid pose

Speaking activity detection Face tracking as: Rigid pose + Non-rigid facial deformations

Speaking activity detection ~ high motion energy in Mouth/lips region weak assumption (eg. hand moving in front of mouth will trigger speaking activity detection) But complement well audio-based speaker detection

Speaking activity detection

User enrollment/recognition Idea: At start the user is automatically identified and logged in by the pTablet. If the user is not recognized or misrecognized, he will have to login manually. Face recognition based on a Feature Set Matching algorithm (The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. Grauman and Darrell ICCV’05)

Audio-Visual Calibration Temporal calibration: aligning audio with visual data How? by aligning lip motion energy in images with audio energy Geometric calibration: estimate camera location/orientation in the world coordinate system

Audio-Visual Integration CAMEO pTablets

Calibration Alternative approach to estimate the position/orientation of the pTablets with or without global view (eg. from CAMEO) Idea: use discourse information (eg. who is talking to who, dialog bw. 2 people) and local head pose to find the location of the pTablets…

A/V Integration AVIntegrator: Same functionalities as Year 2 (eg. includes activity recognition, etc…) modified to accept calibration data estimated externally

Integration and activity estimation A/V integration Activity estimation: Who’s in the room Who is looking at who? Who is talking to who? …

A/V Integrator system Calibration information VTracker VTracker OAA/MOKB messages user list speaker agrees/disagrees who to whom VTracker VTracker pTablet A/V Integrator CAMEO eg. current speaker Discourse/Dialog Speech recognition ?

A/V Integration

Demonstration? Real-time meeting understanding Use of pTablet suite for interaction with personal CALO: eg.: use of head pose/lip motion for speaking activity detection Yes/No answer by head nods/shakes Visual login