Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.

Slides:



Advertisements
Similar presentations
Vogler and Metaxas University of Toronto Computer Science CSC 2528: Handshapes and Movements: Multiple- channel ASL recognition Christian Vogler and Dimitris.
Advertisements

A Discriminative Key Pose Sequence Model for Recognizing Human Interactions Arash Vahdat, Bo Gao, Mani Ranjbar, and Greg Mori ICCV2011.
CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Audio-visual speaker verification using continuous fused HMMs.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Patch to the Future: Unsupervised Visual Prediction
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.
ENTERFACE’08 Multimodal high-level data integration Project 2 1.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.
Building the Design Studio of the Future Aaron Adler Jacob Eisenstein Michael Oltmans Lisa Guttentag Randall Davis October 23, 2004.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Introduction to Automatic Speech Recognition
DIVA - University of Fribourg - Switzerland Seminar presentation, jan Lawrence Michel, MSc Student Portable Meeting Recorder.
Object detection, tracking and event recognition: the ETISEO experience Andrea Cavallaro Multimedia and Vision Lab Queen Mary, University of London
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
New Meeting IDIAP Daniel Gatica-Perez, Iain McCowan, Samy Bengio Corpus Administration – Joanne Schulz Technical Assistance – Thierry Collado,
A Talking Elevator, WS2006 UdS, Speaker Recognition 1.
Exploiting video information for Meeting Structuring ….
Recognition of meeting actions using information obtained from different modalities Natasa Jovanovic TKI University of Twente.
Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {
1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.
Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.
Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,
Summary  The task of extractive speech summarization is to select a set of salient sentences from an original spoken document and concatenate them to.
Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.
Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)
CVPR Workshop on RTV4HCI 7/2/2004, Washington D.C. Gesture Recognition Using 3D Appearance and Motion Features Guangqi Ye, Jason J. Corso, Gregory D. Hager.
Multimodal Information Analysis for Emotion Recognition
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
Structure Discovery of Pop Music Using HHMM E6820 Project Jessie Hsu 03/09/05.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Algoritmi e Programmazione Avanzata
ENTERFACE 08 Project 2 “multimodal high-level data integration” Mid-term presentation August 19th, 2008.
Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.
Using Inactivity to Detect Unusual behavior Presenter : Siang Wang Advisor : Dr. Yen - Ting Chen Date : Motion and video Computing, WMVC.
ENTERFACE 08 Project 1 “MultiParty Communication with a Tour Guide ECA” Mid-term presentation August 19th, 2008.
Relative Hidden Markov Models Qiang Zhang, Baoxin Li Arizona State University.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Network Community Behavior to Infer Human Activities.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Performance Comparison of Speaker and Emotion Recognition
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.
1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
WP6 Emotion in Interaction Embodied Conversational Agents WP6 core task: describe an interactive ECA system with capabilities beyond those of present day.
Hierarchical Motion Evolution for Action Recognition Authors: Hongsong Wang, Wei Wang, Liang Wang Center for Research on Intelligent Perception and Computing,
Visual Information Retrieval
CALO VISUAL INTERFACE RESEARCH PROGRESS
Hierarchical Multi-Stream Posterior Based Speech Recognition System
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Conditional Random Fields for ASR
Intelligent Information System Lab
Multimodal Human-Computer Interaction New Interaction Techniques 22. 1
Learning Long-Term Temporal Features
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap research institute martigny, switzerland

meetings as sequences of actions –human interaction similar/complementary roles individuals constrained by group –agenda: prior sequence discussion points presentations decisions to be made –minutes: posterior sequence key phases summarized discussions decisions made

the goal: recognizing sequences of meeting actions PresentationGroup Discussion WhetherBudget High Neutral Discussion Phase Topic Group Interest Level Information SharingDecision MakingGroup Task Timeline meeting views group-level actions = meeting actions

our work: two-layer HMMs decompose the recognition problem both layers use HMMs –individual action layer: I-HMM: various models –group action layer: G-HMM

our work in detail 1.definition of meeting actions 2.audio-visual observations 3.action recognition 4.results D. Zhang et al, “Modeling Individual and Group Actions in Meetings with Layered HMMs”, IEEE CVPR Workshop on Event Mining, N. Oliver et al, ICMI I. McCowan et al, ICASSP 2003, PAMI 2005.

1. defining meeting actions multiple parallel views –tech-based: what we can recognize? –application-based: respond to user needs –psychology-based: coding schemes from social psychology each view a set of actions A = { A 1, A 2, A 3, A 4, …, A N } actions in a set –consistent: one view, answering one question –mutually exclusive –exhaustive

multi-modal turn-taking describes the group discussion state A = { ‘discussion’, ‘monologue’ (x4), ‘white-board’, ‘presentation’, ‘note-taking’, ‘monologue + note-taking’ (x4), ‘white-board + note-taking’, ‘presentation + note-taking’} individual actions I = { ‘speaking’, ‘writing’, ‘idle’} actions are multi-modal in nature

example PresentationUsed Person 2WSW Person 1SSW Person 3WSSW Person 4SWS WhiteboardUsed Monologue1 + Note-taking Group ActionDiscussion Presentation + Note-taking Whiteboard + Note-taking W W

2. audio-visual observations audio 12 channels, 48 kHz 4 lapel microphones 1 microphone array video 3 CCTV cameras all synchronized

multimodal feature extraction: audio microphone array –speech activity (SRP-PHAT) seats presentation/whiteboard area –speech/silence segmentation lapel microphones –speech pitch –speech energy –speaking rate

multimodal feature extraction: video head + hands blobs –skin colour models (GMM) –head position –hands position + features (eccentricity,size,orientation) –head + hands blob motion moving blobs from background subtraction

3. recognition with two-layer HMM each layer trained independently trained as in ASR (Torch) simultaneous segmentation and recognition compared with single-layer HMM –smaller observation spaces –I-HMM trained with much more data –G-HMM less sensitive to feature variations –combinations can be explored

models for I-HMM early integration –all observations concatenated –correlation between streams –frame-synchronous streams asynchronous (Bengio, NIPS 2002) –a and v streams with single state sequence –states emit on one or both streams, given a sync variable – inter-stream asynchrony multi-stream (Dupont, TMM 2000) –HMM per stream (a or v), trained independently –decoding: weighted likelihoods combined at each frame –little inter-stream asynchrony –multi-band and a-v ASR

linking the two layers hard decision i-action model with highest probability outputs 1; all other models output 0. soft decision outputs probability for each individual action model Audio-visual features HD: (1, 0, 0) SD: (0.9, 0.05, 0.05)

59 meetings (30/29 train/test) four-people, five-minute scripts –schedule of actions –natural behavior features: 5 f/s 4. experiments: data + setup mmm.idiap.ch

performance measures individual actions: frame error rate (FER) group actions: action error rate (AER) Subs: number of substituted actions Del: number of deleted actions Ins: number of added actions Total actions: number of target actions

results: individual actions (0.8,0.2) frames ( s) visual-only audio-only audio-visual asynchronous effects between modalities accuracy: speaking: 96.6 %, writing: 90.8%, idle: 81.5%

results: group actions multi-modality outperforms single modalities two-layer HMM outperforms single- layer HMM for a- only, v-only and a-v best model: A-HMM soft decision slightly better than hard decision 8% improvement, significant at 96% level

action-based meeting structuring

conclusions structuring meetings as sequences of meeting actions –layered HMMs successful for recognition –turn-taking patterns: useful for browsing –public dataset, standard evaluation procedures open issues –less training data (unsupervised, acm mm04) –other relevant actions (interest-level, icassp05) –other features (words, emotions) –efficient models for many interacting streams

Linking Two Layers (1)

Linking Two Layers (2) Normalization Please refer to: D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two- Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.