Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science.

Slides:



Advertisements
Similar presentations
Detecting Statistical Interactions with Additive Groves of Trees
Advertisements

Perceiving Motion Transitions in Pedestrian Crowds Qin Gu, University of Houston Chang Yun, University of Houston Zhigang Deng, University of Houston Virtual.
Correlational and Differential Research
Effects of Competence, Exposure, and Linguistic Backgrounds on Accurate Production of English Pure Vowels by Native Japanese and Mandarin Speakers Malcolm.
Perceptually Guided Expressive Facial Animation Zhigang Deng and Xiaohan Ma Computer Graphics and Interactive Media Lab Department of Computer Science.
Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
On the parameterization of clapping Herwin van Welbergen Zsófia Ruttkay Human Media Interaction, University of Twente.
Introduction to Data-driven Animation Jinxiang Chai Computer Science and Engineering Texas A&M University.
Forearm Electromyography Muscle-Computer Interfaces Demonstrating the Feasibility of Using Forearm Electromyography for Muscle-Computer Interfaces T. Scott.
Independent Component Analysis (ICA)
May 10, 2004Facial Tracking and Animation Todd Belote Bryan Harris David Brown Brad Busse.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Probabilistic video stabilization using Kalman filtering and mosaicking.
Successful Multiparty Audio Communication over the Internet Vicky Hardman, M. Angela Sasse and Isidor Kouvelas Department of Computer Science University.
Vision-based Control of 3D Facial Animation Jin-xiang Chai Jing Xiao Jessica Hodgins Carnegie Mellon University.
Successful Multiparty Audio Communication over the Internet Vicky Hardman, M. Angela Sasse and Isidor Kouvelas Department of Computer Science University.
Warped Linear Prediction Concept: Warp the spectrum to emulate human perception; then perform linear prediction on the result Approaches to warp the spectrum:
Development of an Affect-Sensitive Agent for an Intelligent Tutor for Algebra Thor Collin S. Andallaza August 4, 2012.
Nathan Babcock and Robert R. Stewart Department of Earth and Atmospheric Sciences University of Houston.
Building the Design Studio of the Future Aaron Adler Jacob Eisenstein Michael Oltmans Lisa Guttentag Randall Davis October 23, 2004.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
ACCURATE TELEMONITORING OF PARKINSON’S DISEASE SYMPTOM SEVERITY USING SPEECH SIGNALS Schematic representation of the UPDRS estimation process Athanasios.
[1] Processing the Prosody of Oral Presentations Rebecca Hincks KTH, The Royal Institute of Technology Department of Speech, Music and Hearing The Unit.
What’s Making That Sound ?
Eyes Alive Sooha Park - Lee Jeremy B. Badler - Norman I. Badler University of Pennsylvania - The Smith-Kettlewell Eye Research Institute Presentation Prepared.
Multimodal Interaction Dr. Mike Spann
Estimate of Swimming Energy Expenditure Utilizing an Omnidirectional Accelerometer and Swim Performance Measures Jeanne D. Johnston and Joel M. Stager,
A Study in Cross-Cultural Interpretations of Back-Channeling Behavior Yaffa Al Bayyari Nigel Ward The University of Texas at El Paso Department of Computer.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
CSE 681 Review: Transformations. CSE 681 Transformations Modeling transformations build complex models by positioning (transforming) simple components.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
An Information Fusion Approach for Multiview Feature Tracking Esra Ataer-Cansizoglu and Margrit Betke ) Image and.
Human Activity Recognition Using Accelerometer on Smartphones
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
DR.D.Y.PATIL POLYTECHNIC, AMBI COMPUTER DEPARTMENT TOPIC : VOICE MORPHING.
Goal and Motivation To study our (in)ability to detect inconsistencies in the illumination of objects in images Invited Talk! – Hany Farid: Photo Forensincs:
What We Know People Know About Gesture Barbara Kelly and Lauren Gawne University of Melbourne.
1 1 Spatialized Haptic Rendering: Providing Impact Position Information in 6DOF Haptic Simulations Using Vibrations 9/12/2008 Jean Sreng, Anatole Lécuyer,
Animated Speech Therapist for Individuals with Parkinson Disease Supported by the Coleman Institute for Cognitive Disabilities J. Yan, L. Ramig and R.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Predicting Voice Elicited Emotions
1 Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents S. Kawamoto, et al. October 27, 2004.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
MIT Artificial Intelligence Laboratory — Research Directions The Next Generation of Robots? Rodney Brooks.
1 Detecting Group Interest-level in Meetings Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio IDIAP Research Institute, Martigny, Switzerland.
1 Using vibration patterns to provide impact position information in haptic manipulation of virtual objects 13/06/2008 Jean Sreng Anatole Lécuyer Claude.
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
Electronic visualization laboratory, university of illinois at chicago Towards Lifelike Interfaces That Learn Jason Leigh, Andrew Johnson, Luc Renambot,
Language in Cognitive Science. Research Areas for Language Computational models of speech production and perception Signal processing for speech analysis,
Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
R ITSUMEIKAN 13 th International Conference on Multimodal Interaction (ICMI 2011) Alicante, Spain, Nov. 16th, 2011 COMmunication Software Lab. Making Virtual.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Virtual Tutor Application v1.0 Ruth Agada Dr. Jie Yan Bowie State University Computer Science Department.
A. Freise1 Phase and alignment noise in grating interferometers Andreas Freise QND Meeting, Hannover
What can we expect of cochlear implants for listening to speech in noisy environments? Andrew Faulkner: UCL Speech Hearing and Phonetic Sciences.
MMC LAB Secure Spread Spectrum Watermarking for Multimedia KAIST MMC LAB Seung jin Ryu 1MMC LAB.
“Articulatory Talking Head” Showcase Project, INRIA, KTH. Articulatory Talking Head driven by Automatic Speech Recognition INRIA, Parole Team KTH, Centre.
Ruth Agada Dr. Jie Yan Bowie State University Computer Science Department S.O.A.R Summer Research Presentations September 15 th, 2009.
CS 591 S1 – Computational Audio -- Spring, 2017
Within a Mixed-Frequency Visual Environment
Voice source characterisation
The origins of motor noise
Patricia Keating, Marco Baroni, Sven Mattys, Rebecca Scarborough,
Data-Driven Approach to Synthesizing Facial Animation Using Motion Capture Ioannis Fermanis Liu Zhaopeng
Presentation transcript:

Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science University of Houston

Motivation Avatars have been increasingly used in Human-Computer Interfaces –Teleconferencing, computer-mediated communication, distance education, online virtual worlds, etc. Human-like avatar gestures influence human perception significantly –Facial expressions –Hand gestures –Lip movements –head movements One of the crucial visual cues to facilitate engaging social interaction and communication

How do talking head movements affect perception?

Our Quantitative Perspective Uncover how talking avatar head movements affect human perception –User-rated head animations’ naturalness –Joint features extracted from head animations (with audio) Acoustic speech features Head motion patterns –Quantitatively analyze the association between extracted joint features and user ratings Joint Features Perception (rating) Analysis of the association Talking Avatar Head Animations User evaluation Feature extraction

Data Acquisition and Processing Acquisition of the audio-head motion dataset –Head & speech were recorded simultaneously –Head motion: optical motion capture system (120 Hz) –Speech: microphone (48 kHz) Processing of the captured audio-head motion dataset –Head motion: 3 Euler rotation angles per frame –Speech: pitches and RMS energy –Aligned head & speech datasets to the same frame rate (24 FPS) Y-axis rotation X-axis rotation Z-axis rotation

Subjective Evaluation Using the captured dataset, we generated 60 head animation clips –Based on 15 recorded speech clips –4 different audio-head motion generation techniques –Mosaic on the mouth region User study –18 participants –Ages: 23~28 –Gender: female (16.67%), male (83.33%) –Language: fluent English- speakers –User rating: 1~5 Original dataPlay back the captured HMMs[Busso et al. 05] Mood-Swings[Chuang et al. 05] RandomRandomly generated

Speech-Head Motion Features and Perception Measure the correlation between head motion and speech features –Canonical Correlation Analysis (CCA) Pitch-Head motion and human perception –Computed Pearson coefficient: Energy-Head motion and human perception –Seem random, definitely not linear.

Speech-Head Motion Features and Perception Implications for CHI –Validate the tight coordination between speech and head motion: Precise timing in generation is required Delayed head movement generation may significantly degrade human perception –An approximate linear correlation between user ratings and CCA for Pitch-head motion Prosody driven head motion synthesis could be fundamentally sound. –No a simple linear correlation between user ratings and CCA for RMS Energy-head motion RMS energy may vary among sentences

Frequency-Domain Analysis of Head Motion Frequency-domain analysis of head motion –Head motion: rotation angles –Frequency spectrum: FFT transform applied to the head rotation angle vector Association between head motion spectrum and human perception –With squared magnitude less than 5 degree. - X-axis: average user rating (2.1 ~ 4.2) - Y-axis: the squared magnitude of three Euler angles in the head rotation (0 ~ 5 degree) - Z-axis: Frequency spectrum (0 ~ 19 Hz) X-axis Y-axis Z-axis

Frequency-Domain Analysis of Head Motion Key observations –Highly rated: low-frequency Natural head motion: less than 10 Hz –Lowly rated: high-frequency Typically lager than 12 Hz With a small range of head movements Implications for HCI –The comfortable head motion frequency zone: 0~12 Hz –Smooth post-processing for head motion generations of talking avatar Smooth: Post-process the synthesized head motions Simply crop the high frequency part from the synthesized head motions Low-frequency patterns High-frequency patterns

Conclusion and Future Work Summary of our findings –The coupling between the pitch and head motion has a strong linear correlation with human perception –The perceived-natural head motions mainly consist of low- frequency motion components and those high-frequency components (>12 Hz) will damage human perception significantly. Future work –Multi-party conversation scenario –Analysis of other fundamental speech features: pause, repetitions, etc. Acknowledgments: This work is in part supported by NSF IIS , Texas Norman Hackerman Advanced Research , and research gifts from Google and Nokia.