1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Descriptive schemes for facial expression introduction.
DDDAS: Stochastic Multicue Tracking of Objects with Many Degrees of Freedom PIs: D. Metaxas, A. Elgammal and V. Pavlovic Dept of CS, Rutgers University.
Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.
SmartPlayer: User-Centric Video Fast-Forwarding K.-Y. Cheng, S.-J. Luo, B.-Y. Chen, and H.-H. Chu ACM CHI 2009 (international conference on Human factors.
Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu.
Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
GMM-Based Multimodal Biometric Verification Yannis Stylianou Yannis Pantazis Felipe Calderero Pedro Larroy François Severin Sascha Schimke Rolando Bonal.
SecurePhone Workshop - 24/25 June Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.
Accelerometer-based Transportation Mode Detection on Smartphones
Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.
Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.
Sriram Tata SID: Introduction: Large digital video libraries require tools for representing, searching, and retrieving content. One possibility.
EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.
MUSCLE- Network of Excellence Movie Summarization and Skimming Demonstrator ICCS-NTUA (P. Maragos, K. Rapantzikos, G. Evangelopoulos, I. Avrithis) AUTH.
MUSCLE movie data base is a multimodal movie corpus collected to develop content- based multimedia processing like: - speaker clustering - speaker turn.
CS335 Principles of Multimedia Systems Multimedia and Human Computer Interfaces Hao Jiang Computer Science Department Boston College Nov. 20, 2007.
ICME 2004 Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld A Dynamic Probabilistic Multimedia Retrieval Model.
1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.
ICASSP'06 1 S. Y. Kung 1 and M. W. Mak 2 1 Dept. of Electrical Engineering, Princeton University 2 Dept. of Electronic and Information Engineering, The.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.
WEEK VI Malcolm Collins-Sibley Mentor: Shervin Ardeshir.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
Image Annotation and Feature Extraction
What’s Making That Sound ?
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
New Meeting IDIAP Daniel Gatica-Perez, Iain McCowan, Samy Bengio Corpus Administration – Joanne Schulz Technical Assistance – Thierry Collado,
Multimedia Specification Design and Production 2013 / Semester 2 / week 8 Lecturer: Dr. Nikos Gazepidis
Exploiting video information for Meeting Structuring ….
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Application of Audio and Video Processing Methods for Language.
Recognition of meeting actions using information obtained from different modalities Natasa Jovanovic TKI University of Twente.
Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {
1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.
Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,
Page 1 Audiovisual Speech Analysis Ouisper Project - Silent Speech Interface.
Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.
Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)
Multimodal Information Analysis for Emotion Recognition
M Institute for Human-Machine Communication Munich University of Technology Sascha Schreiber Face Tracking and Person Action.
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP.
Modeling individual and group actions in meetings with layered HMMs dong zhang, daniel gatica-perez samy bengio, iain mccowan, guillaume lathoud idiap.
ECE 8443 – Pattern Recognition EE 3512 – Signals: Continuous and Discrete Objectives: Spectrograms Revisited Feature Extraction Filter Bank Analysis EEG.
Head Tracking in Meeting Scenarios Sascha Schreiber.
Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science.
Mixed Reality: A Model of Mixed Interaction Céline Coutrix, Laurence Nigay User Interface Engineering Team CLIPS-IMAG Laboratory, University of Grenoble.
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
Roland Goecke Trent Lewis Michael Wagner 1Big ASC Meeting April 2010.
1 Machine Vision. 2 VISION the most powerful sense.
Detection of Illicit Content in Video Streams Niall Rea & Rozenn Dahyot
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
MIT Artificial Intelligence Laboratory — Research Directions The Next Generation of Robots? Rodney Brooks.
Visual focus of attention (VFOA) recognizer jean-marc odobez, sileye ba daniel gatica-perez idiap research institute switzerland ami coi workshop martigny,
Motion Detection Frame 1Frame 2 Anomalous activity.
ENTERFACE’08 Multimodal Communication with Robots and Virtual Agents mid-term presentation.
ENTERFACE 08 Project 9 “ Tracking-dependent and interactive video projection ” Mid-term presentation August 19th, 2008.
Faculty of Information Technology, Brno University of Technology, CZ
The AMI Meeting Corpus: A Pre-Announcement
Hierarchical Multi-Stream Posterior Based Speech Recognition System
Presenter: Ibrahim A. Zedan
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
V. Mezaris, I. Kompatsiaris, N. V. Boulgouris, and M. G. Strintzis
Tremor Detection Using Motion Filtering and SVM Bilge Soran, Jenq-Neng Hwang, Linda Shapiro, ICPR, /16/2018.
Mentor: Salman Khokhar
Learning complex visual concepts
Multimodal Caricatural Mirror
Automatic Prosodic Event Detection
Presentation transcript:

1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland

2 Outline The Goal Our approach Meeting Corpus Audio-Visual Features Experiments  Performances Measures  Feature Selection  Results Conclusions

3 The Goal Extract relevant segments in meetings Relevant segments are defined based on group interest-level (degree of engagement in participants’ interactions)

4 Our Approach Microphones Cameras Person 1 AV Features Person 2 AV Features Person N AV Features Early Integration Multi-stream HMM Statistical ModelsLow-level AV Features

5 Our Approach Early integration HMM: concatenate audio and visual features to form the observation vector Multi-stream HMM: audio and visual streams are trained independently; outputs are merged at the state level during decoding

6 Meeting Corpus (mmm.idiap.ch) 50 meetings: 30 for training, 20 for testing Each meeting: 5 minutes, 4 participants Recorded based on topic and action scripts Behavior and emotion of participants are natural

7 Annotating Group-Interest Level Interval coding scheme  (a) discrete scale: 1-5  (b) 15-second interval unit  (c) 2 independent annotators Post-processing  (a) normalization (for annotator bias)  (b) analysis of inter-annotator agreement  (c) average of the two annotators

8 Annotating Group-Interest Level 1 NEUTRAL 4 HIGH 3 NEUTRAL 5 HIGH Time

9 Audio-Visual Features ModalityDescription Visual head orientation from skin color blobs right hand orientation from skin color blobs right hand eccentricity from skin color blobs head and hand motion from skin color blobs Audio SRP-PHAT from microphone array speech relative pitch from lapels speech energy from lapels speech rate from lapels

10 Performance Measures - Nc: high-level frames correctly detected - Nf : high-level frames falsely accepted - Nd: high-level frames falsely rejected precision = Nc / (Nc + Nf) recall = Nc / (Nc + Nd) - Expected Performance Curve (EPC): ep = alpha*precision + (1-alpha)*recall

11 Feature Selection Selected AV features: (3 audio + 2 visual features) Audio: speech energy, speaking rate, speech pitch Visual: person motion, head angle

12 Results (Single-modal vs. Multimodel)

13 Results (Single-stream vs. Multi-stream)

14 Overall Results Method alpha = 0alpha = 0.5alpha = 1 prrcprrcprrc Audio-only Audio-only (Feature fusion) MS-HMM MS-HMM (Feature fusion)

15 Conclusions Audio modality is dominant Modality combination improves performance in some regions Multi-stream better than early integration Feature fusion at the group level is beneficial