Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.

Slides:

Advertisements

Similar presentations

Artificial Neural Networks

Advertisements

Applications of one-class classification

Face Recognition and Biometric Systems Eigenfaces (2)

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

For Wednesday Read chapter 19, sections 1-3 No homework.

Automatic classification of weld cracks using artificial intelligence and statistical methods Ryszard SIKORA, Piotr BANIUKIEWICZ, Marcin CARYK Szczecin.

A SOFTWARE TOOL DEVELOPED FOR THE CLASSIFICATION OF REMOTE SENSING SPECTRAL REFLECTANCE DATA Abdullah Faruque School of Computing & Software Engineering.

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.

Speech Group INRIA Lorraine

S-SENCE Signal processing for chemical sensors Martin Holmberg S-SENCE Applied Physics, Department of Physics and Measurement Technology (IFM) Linköping.

Face Recognition and Biometric Systems

Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.

RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.

A Computer Aided Detection System For Digital Mammograms Based on Radial Basis Functions and Feature Extraction Techniques By Mohammed Jirari Shanghai,

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Radial Basis Functions

LYU0603 A Generic Real-Time Facial Expression Modelling System Supervisor: Prof. Michael R. Lyu Group Member: Cheung Ka Shun ( ) Wong Chi Kin ( )

Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap

Speaker Adaptation for Vowel Classification

Face Recognition Jeremy Wyatt.

Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.

October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.

ICA Alphan Altinok. Outline  PCA  ICA  Foundation  Ambiguities  Algorithms  Examples  Papers.

Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.

Radial-Basis Function Networks

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

SVD(Singular Value Decomposition) and Its Applications

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.

Human Emotion Synthesis David Oziem, Lisa Gralewski, Neill Campbell, Colin Dalton, David Gibson, Barry Thomas University of Bristol, Motion Ripper, 3CR.

Multimodal Interaction Dr. Mike Spann

Multiple-Layer Networks and Backpropagation Algorithms

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

Graphite 2004 Statistical Synthesis of Facial Expressions for the Portrayal of Emotion Lisa Gralewski Bristol University United Kingdom

EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

COMPARISON OF IMAGE ANALYSIS FOR THAI HANDWRITTEN CHARACTER RECOGNITION Olarik Surinta, chatklaw Jareanpon Department of Management Information System.

Jacob Zurasky ECE5526 – Spring 2011

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

BARCODE IDENTIFICATION BY USING WAVELET BASED ENERGY Soundararajan Ezekiel, Gary Greenwood, David Pazzaglia Computer Science Department Indiana University.

Handwritten Recognition with Neural Network Chatklaw Jareanpon, Olarik Surinta Mahasarakham University.

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.

Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.

EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

Performance Comparison of Speaker and Emotion Recognition

Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.

Professor : Ming – Shyan Wang Department of Electrical Engineering Southern Taiwan University Thesis progress report Sensorless Operation of PMSM Using.

C - IT Acumens. COMIT Acumens. COM. To demonstrate the use of Neural Networks in the field of Character and Pattern Recognition by simulating a neural.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Supervised Learning – Network is presented with the input and the desired output. – Uses a set of inputs for which the desired outputs results / classes.

Machine Learning Supervised Learning Classification and Regression

ARTIFICIAL NEURAL NETWORKS

Final Year Project Presentation --- Magic Paint Face

Blind Signal Separation using Principal Components Analysis

CS 332 Visual Processing in Computer and Biological Vision Systems

Object Modeling with Layers

Chapter 3. Artificial Neural Networks - Introduction -

EE513 Audio Signals and Systems

Parallelization of Sparse Coding & Dictionary Learning

Analysis of Audio Using PCA

EE 492 ENGINEERING PROJECT

Presentation transcript:

Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter Pazmany Catholic University, Faculty of Information Technology 1083 Budapest, Práter u. 50/a. Hungary

Specialities of the Faculty of Information Technology at Peter Pazmany Catholic University

Main concept in the project for communicatoin aid for deaf people Deaf people have fantastic abilities in speech understanding based purely on lip reading. The gap between hearing and deaf persons can be eliminated by such an everyday equipment as a high- end class second or third generation mobile phone. In our system we provide a part of an animated human face on a display as output to deaf user. Közvetlenül a beszédből számolja… Jeltolmács A rather promising feature of our system is the potentially language independent operation

The system

System components The input speech sound is sampled at 16 bit/48 kHz and then acoustic feature vectors based on Mel-Frequency Cepstrum Coefficients (MFCC) are extracted from the signal. The feature vectors are sent to the neural network (NN), which computes a special weighting vector [w1,…w6] that is a compressed representation of the target frame of the animation. The coordinates of our selected feature point set used to drive the animation are obtained by linear combination of our component vectors with the weights coming from the neural net. The FP positions are computed in this way for 25 frames per second. The final component in our system is a modified LUCIA talking head model.

Database

we have decided to employ interpreters to record our audiovisual database So the database contains two-digit frames, names of months, numbers, days. The MPEG-4 standard describes the face with 86 feature points (FP). We selected 15 FP around the mouth according to our preliminary results We used commercially available video cameras with 720x576 resolution, 25 fps PAL format video – which means 40ms for audio and video frames Automatic marker tracking. Frames were binarized according to a statistically determinded threshold value. To get a fused patch a dilation process was perfected. Step-by step eroding provides marker patches only one pixel, which are the centres of FP-s.

Since each video frame has 15 FPs, one frame is a vector in a 30 dimensional space. The training of the system is much more efficient by compressing 30 coordinate parameters into 6 weight parameters. The first 6 PCA vectors have been selected as a basis, the database was encoded in 6 dimension.

Where P is the matrix of PCA vectors (30x30) ordered by the eigenvalue, B is the 30 dimensional vector set, and c is the chosen orgio which is the PCA weights of the neutral face. This data reduction causes 1-3% error (1-2 pixel variance in FP x-y coordinates) which is an acceptable approximation

The FP positions expressed by the 1st, 2nd 3rd and 4th principal components using

NN Matrix Back-propagation algorithm (elaborated by David Anguita) was used to train the network [8]. The NN has 3 layers: 80 elements in the input layer to receive 16 MFCC from 5 frames. The hidden layer consists of 40 nodes. The output layer has 6 nodes, providing 6 centered PCA weights to reproduce the FPs. The training file contained 5450 frames. The network was trained by epochs. This NN model uses interval [-1,1] both for inputs and outputs. MFCC and PCA vectors are scaled linearly by the same value to fit to the appropriate range except the energy coefficient of MFCC.

Lucia

Conclusion The visual word recognition even in the case of natural and professional speaker has about 3 % of errors. The animated face model controlled by 15 FP parameters following accurately the FP parameters of interpreter/lip-speaker’s face resulted about 42% of errors. After test discussions it was clarified, that the visible parts of tongue and movement of parts of the face others then the mouth convey additional information to help the correct recognition. Probably the face model itself needs further improvements. The decreasing of correct recognition only by about 7% as a result the complete changing of face model control from natural parameters to calculated parameters seems to be the fundamental result of our system.