Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu.

Slides:

Advertisements

Similar presentations

The Sociometer: A Wearable Device for Understanding Human Networks Tanzeem Choudhury and Alex Pentland MIT Media Laboratory.

Advertisements

Descriptive schemes for facial expression introduction.

Combining Detectors for Human Hand Detection Antonio Hernández, Petia Radeva and Sergio Escalera Computer Vision Center, Universitat Autònoma de Barcelona,

DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Limin Wang, Yu Qiao, and Xiaoou Tang

Víctor Ponce Miguel Reyes Xavier Baró Mario Gorga Sergio Escalera Two-level GMM Clustering of Human Poses for Automatic Human Behavior Analysis Departament.

Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Probability-based Dynamic Time Warping for Gesture Recognition on RGB-D data All rights reserved HuBPA© Human Pose Recovery and Behavior Analysis Group.

Personalized Abstraction of Broadcasted American Football Video by Highlight Selection Noboru Babaguchi (Professor at Osaka Univ.) Yoshihiko Kawai and.

Accelerometer-based Transportation Mode Detection on Smartphones

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.

Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.

Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Recent Developments in Human Motion Analysis

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

Presented by Zeehasham Rasheed

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.

A PRESENTATION BY SHAMALEE DESHPANDE

Face Recognition and Retrieval in Video Basic concept of Face Recog. & retrieval And their basic methods. C.S.E. Kwon Min Hyuk.

Multimodal Analysis Video Representation Video Highlights Extraction Video Browsing Video Retrieval Video Summarization.

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.

Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)

Recognition of meeting actions using information obtained from different modalities Natasa Jovanovic TKI University of Twente.

Exploiting lexical information for Meeting Structuring Alfred Dielmann, Steve Renals (University of Edinburgh) {

1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.

The persuasive import of gesture and gaze. The importance of bodily behaviour in persuasive discourse has been acknowledged as early as in ancient Rome.

Multimodal Integration for Meeting Group Action Segmentation and Recognition M. Al-Hames, A. Dielmann, D. Gatica-Perez, S. Reiter, S. Renals, G. Rigoll,

Multimodal Information Analysis for Emotion Recognition

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Action as Space-Time Shapes

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

The Sociometer: A Wearable Device for Understanding Human Networks

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Speed improvements to information retrieval-based dynamic time warping using hierarchical K-MEANS clustering Presenter: Kai-Wun Shih Gautam Mantena 1,2.

Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Its a specific set of linkages among a defined set of persons with the additional property that the characteristics of these linkages as a whole may be.

1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.

Using Cross-Media Correlation for Scene Detection in Travel Videos.

Learning video saliency from human gaze using candidate selection CVPR2013 Poster.

UCD Electronic and Electrical Engineering Robust Multi-modal Person Identification with Tolerance of Facial Expression Niall Fox Dr Richard Reilly University.

Conversational role assignment problem in multi-party dialogues Natasa Jovanovic Dennis Reidsma Rutger Rienks TKI group University of Twente.

WP6 Emotion in Interaction Embodied Conversational Agents WP6 core task: describe an interactive ECA system with capabilities beyond those of present day.

National Taiwan Normal A System to Detect Complex Motion of Nearby Vehicles on Freeways C. Y. Fang Department of Information.

Student Gesture Recognition System in Classroom 2.0 Chiung-Yao Fang, Min-Han Kuo, Greg-C Lee, and Sei-Wang Chen Department of Computer Science and Information.

Social Networks Analysis

CALO VISUAL INTERFACE RESEARCH PROGRESS

Supervised Time Series Pattern Discovery through Local Importance

Proposed architecture of a Fully Integrated

V. Mezaris, I. Kompatsiaris, N. V. Boulgouris, and M. G. Strintzis

An Infant Facial Expression Recognition System Based on Moment Feature Extraction C. Y. Fang, H. W. Lin, S. W. Chen Department of Computer Science and.

Multimodal Caricatural Mirror

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Speaking patterns -MAS.662J, Fall 2004

Presenter: Shih-Hsiang(士翔)

End-to-End Speech-Driven Facial Animation with Temporal GANs

Presentation transcript:

Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu

Outline 1.Introduction 2.Audio – Visual cues extraction and fusion 3.Social Network extraction and analysis 4.Experimental Results 5.Conclusions and future work

1.Introduction -Social interactions play a very important role in people’s daily lives. -Present trend: analysis of human behavior based on electronic communications: SMS, s, chat -New trend: analysis of human behavior based on nonverbal communication: social signals -Quantification of social signals represents a powerful cue to characterize human behavior: facial expression, hand and body gestures, focus of attention, voice prosody, etc.

Social Network Analysis (SNA) has been developed as a tool to model social interactions in terms of a graph- based structure: - ‘Nodes’ represent the ‘actors’: persons, communities, institutions, etc. - ‘Links’ represent a specific type of interdepency: friendship, familiarity, business transactions, etc. A common way to characterize the information ‘encoded’ in a SNA is to use several centrality measures.

Our contribution: -In this work, we propose an integrated framework for extraction and analysis of a SNA from multimodal (A/V) dyadic interactions* -The advantage is represented by the fact that it is based on a totally non-intrunsive technology -First: we perform speech segmentation through an audio/visual fusion scheme - In the audio domain, speech is detected through clusterization of audio features - In the visual domain, speech is detected through differential-based feature extraction from the segmented mouth region - The fusion scheme is based on stacked sequential learning *We used a set of videos belonging to the New York Times’ Blogging heads opinion blog. The videos depict two persons talking on different subject in front of a webcam

Block-diagram representation of our integrated framework - Second: To quantify the dyadic interaction, we used the ‘Influence Model’, whose states encode previously integrated audio-visual data - Third: The Social Network is extracted based on the estimated influences* and its properties are characterized based on several centrality measures. * The use of term ‘influence’ is inspired by the previous work of Choudhury: T. Choudhury, “Sensing and Modelling Human Networks”, Ph.D. Thesis, MIT Media Lab

2. Audio – Visual cues extraction and fusion Audio cue –Description 12 first MFCC coefficients Signal energy Temporal cepstral derivatives (Δ and Δ 2 )

Audio cue –Diarization process Segmentation –Coarse segmentation according Generalized Likelihood ratio between consecutive windows Clustering –Agglomerative hierarchical clustering with a BIC stopping scheme Segments boundaries are adjusted at the end

Visual cue –Description: Face segmentation based on Viola-Jones detector Mouth region segmentation Vector of HOG descriptors for for the mouth region

Visual cue –Classification: Non-Speech class modelling One-class Dynamic Time warping based on the following dynamic programming equation

Fusion scheme –Stacked sequential learning (suitable for problems characterized by long runs of identical labels) Fusion of audio-visual modalities Determining temporal relations of both feature sets for learning a two-stage classifier (based on Ada- Boost)

3. Social Network extraction and analysis -Influence Model (IM), was a tool introduced for quantification of interacting processes using a coupled Hidden Markov Model (HMM) -In the case of social interaction, the states of IM encode automatically extracted audio-visual features Influence Model Architecture parameters represent the ‘influences’

- The construction of the Social Network is based on ‘influences’ values -A directed link between two nodes A and B (designated by A → B) implies that ‘A has influence over B’ -The SNA is based on several centrality measures: - degree centrality (indegree and outdegree) - Refers to the number of direct connections with other persons - closeness centrality - Refers to the facility between two persons to communicate - betweeness centrality - Refers to the relevance of a person to act as a ‘bridge’ between two sub-groups of the network - eigenvector centrality - Refers to the importance of a person in the network

4. Experimental results -We collected a subset of videos from the New York Blogging Heads’ opinion blog -We used 17 videos from 15 persons -Videos depict two persons having a conversation in front of their webcam on different topics (politics, economy,…) -The conversations have an informal character and sometimes frequent interruptions can occur Snapshot from a video

-Audio features - The audio stream has been analyzed using sliding windows of 25 ms with an overlapping factor of 50%. - Each window is characterized by 13 features (12 MFCC +E), complemented with Δ and Δ 2 - The shortest length of a valid audio segment was set to 2.5 ms -Video features - 32 oriented features (corresponding to the mouth region) have been extracted using the HOG descriptor - the length of the DTW sequences has been set to 18 frames (which corresponds to 1.5 s) -Fusion process - stacked sequential learning was used to fusion the audio-visual features - Adaboost was chosen as classifier

Visual and audio-visual speaker segmentation accuracy

The extracted social network showing participants’ label and influence directions

Centrality measures table

5. Conclusions and future work - We presented an integrated framework for automatic extraction and analysis of a social network from im- plicit input (multimodal dyadic interactions), based on the integration of audio/visual features. - In the future, we are planning to extend the current work to study the problem of social interactions at larger scale and in different scenarios - Starting from the premise that people's lives are more structured than it might seem a priori, we plan to study long- term interactions between persons, with the aim to discover underlying behavioral patterns present in our day-to-day existence