Acoustic to Articoulatory Speech Inversion by Dynamic Time Warping

Slides:



Advertisements
Similar presentations
Autodirective Dual Microphone Digital Signal Processing technology to build an optimal directional microphone Presented by Alexander Goldin Copyright.
Advertisements

A. Hatzis, P.D. Green, S. Howard (1) Optical Logo-Therapy (OLT) : Visual displays in practical auditory phonetics teaching. Introduction What.
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Masters Presentation at Griffith University Master of Computer and Information Engineering Magnus Nilsson
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Speech Group INRIA Lorraine
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
A PRESENTATION BY SHAMALEE DESHPANDE
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
Representing Acoustic Information
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
Interarticulator programming in VCV sequences: Effects of closure duration on lip and tongue coordination Anders Löfqvist Haskins Laboratories New Haven,
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Björkner, Eva Researcher, Doctoral Student Address Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing P.O. Box 3000.
Speech Science Fall 2009 Oct 26, Consonants Resonant Consonants They are produced in a similar way as vowels i.e., filtering the complex wave produced.
Time state Athanassios Katsamanis, George Papandreou, Petros Maragos School of E.C.E., National Technical University of Athens, Athens 15773, Greece Audiovisual-to-articulatory.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Motion Capture and Analysis of Tongue During Speech by Jared Kiraly.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
ICVGIP 2012 ICVGIP 2012 Speech training aids Visual feedback of the articulatory efforts during acquisition of speech production by a hearing-impaired.
Microphone Array Project ECE5525 – Speech Processing Robert Villmow 12/11/03.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Chapter 8. Learning of Gestures by Imitation in a Humanoid Robot in Imitation and Social Learning in Robots, Calinon and Billard. Course: Robots Learning.
Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University.
The Body and Health 3 Parts of the Body: The Head.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Speaker Verification System Middle Term Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabag.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Sound Source Location Stand Group 72: Hiroshi Fujii Chase Zhou Bill Wang TA: Katherine O’Kane.
“Articulatory Talking Head” Showcase Project, INRIA, KTH. Articulatory Talking Head driven by Automatic Speech Recognition INRIA, Parole Team KTH, Centre.
AN ANALOG INTEGRATED- CIRCUIT VOCAL TRACT PRESENTED BY: NIEL V JOSEPH S7 AEI ROLL NO-46 GUIDED BY: MR.SANTHOSHKUMAR.S ASST.PROFESSOR E&C DEPARTMENT.
Robert Wielgat, Daniel Król Department of Technology
ACOUSTIC CAMERA Copyright © PechoM Tüm hakları saklıdır.
ACOUSTICAL BIRD MONITORING SYSTEM - ELECTRONIC EQUIPMENT
Automatic Speed Control Using Distance Measurement By Single Camera
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
GENERAL CONCEPT OF THE ACOUSTICAL AVIAN MONITORING SYSTEM
Artificial Intelligence for Speech Recognition
Correlational and Regressive Analysis of the Relationship between Tongue and Lips Motion - An EMA and Video Study of Selected Polish Speech Sounds Robert.
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Multimedia Production
Kocaeli University Introduction to Engineering Applications
Voice source characterisation
Isolated word, speaker independent speech recognition
Parts of the Body: The Head
Neuro-Fuzzy and Soft Computing for Speaker Recognition (語者辨識)
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Microphone Array Project
A Japanese trilogy: Segment duration, articulatory kinematics, and interarticulator programming Anders Löfqvist Haskins Laboratories New Haven, CT.
Recap In previous lessons we have looked at how numbers can be stored as binary. We have also seen how images are stored as binary. This lesson we are.
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Keyword Spotting Dynamic Time Warping
Combination of Feature and Channel Compensation (1/2)
Elmar Nöth, Andreas Maier, Michael Stürmer, Maria Schuster Towards Multimodal Evaluation of Speech Pathologies Friday, 13 September 2019.
Presentation transcript:

Acoustic to Articoulatory Speech Inversion by Dynamic Time Warping Robert Wielgat Polytechnic Institute, State Higher Vocational School in Tarnów, Tarnów, POLAND Anita Lorenc Department of Speech Pathology and Applied Linguistics, Maria Curie-Sklodowska University Lublin, POLAND 1

Presentation Plan The presentation is divided into four parts: About electromagnetic articulography - EMA Speech inversion by dynamic time warping Preliminary results of the speech inversion Conclusion and future research Introduction Speech Inversion Results and Discusion Conclusion

Presentation Plan The presentation is divided into four parts: About electromagnetic articulography - EMA Speech inversion by dynamic time warping Preliminary results of the speech inversion Conclusion and future research Introduction Speech Inversion Results and Discusion Conclusion

Presentation Plan The presentation is divided into four parts: About electromagnetic articulography - EMA Speech inversion by dynamic time warping Preliminary results of the speech inversion Conclusion and future research Introduction Speech Inversion Results and Discusion Conclusion

Presentation Plan The presentation is divided into four parts: About electromagnetic articulography - EMA Speech inversion by dynamic time warping Preliminary results of the speech inversion Conclusion and future research Introduction Speech Inversion Results and Discusion Conclusion

Electromagnetic Articulography (EMA) Research Electromagnetic Articoulography (EMA) – 3D imaging of speech articoulator movements (tongue, lips, palate, jaw). The movement of articulators is visualised due to small sensors (coils) fixed to articoulators e.g. the tongue. The sensors are moving in an electromagnetic field produced by 6 transmitters. Each of the transmitters produces an alternating magnetic field at different frequencies. The alternating magnetic field induces an alternating current in the sensors, and allows to obtain the distances of each sensor from the six transmitters. It is then possible to calculate in the real time the XYZ coordinates as well as two angles of the sensors. Introduction Speech Inversion Results and Discusion Conclusion

EMA Sensor Placement Introduction Speech Inversion REFERENCE SENSORS RE – behind right ear LE – behind left ear N – between nose and forehead TONGUE SENSORS TT – tongue tip TF – tongue front TD – tongue dorsum TB – tongue back TLS – tongue left side LIP and JAW SENSORS J – jaw LL – lower lip UL – upper lip PALATE SENSOR P – sensor used for making palate contour Introduction Speech Inversion Results and Discusion REFERENCE SENSORS were necesary in order to obtain XYZ coordinates of the sensor in the cartesian coordinate system associated with the speaker’s head. X direction: front-end Y direction: left-right Z direction: up-down Conclusion

EMA Sensor Placement in the Research Introduction Speech Inversion Results and Discusion 5 sensors on the tongue 2 sensors on lips 1 sensor on the border of lower inscisors and gums 1 sensor for making palate contour 3 reference sensors (placed on forehead and bones behind ears) Conclusion

Block Diagram of the EMA Acquisition System screen Introduction Micro- phone Array EMA Sensors Speech Inversion Computer Results and Discusion Digital Multichannel Audio Recorder Electromagnetic Articulograph AG 500 Synchronizer Conclusion

Data Analysis – Feature Vectors Acoustic parameters of speech are feature vectors of mel-frequency cepstral coefficients (MFCC). These vectors are necessary for acoustic to articulatory speech inversion. Each vector is accompanied by articulatory features being XYZ coordinates of EMA sensors. Introduction 21.3 ms Speech Inversion t [ms] 5 ms Results and Discusion Acoustic parameters - MFCC vectors Vector 1 Vector 2 Vector 3 Vector 4 Vector 5 Conclusion Articoulatory features - XYZ coordinates of EMA sensors

Speech Inversion by DTW Method Introduction Speech Inversion Results and Discusion Conclusion

Articulograph For the research purpose the AG-500 EMA will be used. Introduction The position of the sensors were recorded every 5ms in the XYZ coordinate system The accuracy of the measurement is 0.5 mm The length of the vocal tract of about 16.7 cm Speech Inversion Results and Discusion Conclusion

Microphone Array An array of 16 omnidirectional electret microphones Panasonic WM-61 have been used for speech signal acquisition. For the present research signal from only one channel has been used. Frequency response of the single WM-61 microphone. Introduction Speech Inversion Results and Discusion Conclusion

16-channel Audio Recorder DSP Board Introduction Speech Inversion Results and Discusion Conclusion

16-channel Audio Recorder Features: Simultaneous recording in 16 channels Resolution: 16 bits Frequency sampling: 96kHz 32 bit Floating point Digital Signal Processor with the core: Cortex M4F from NXP Semiconductors Two 8 channel/16 bit fast and linear SAR AD converters Introduction Speech Inversion Results and Discusion Conclusion Control Board

DTW Speech Inversion -Preliminary Results Results of acoustic to articulatory speech inversion for Polish word „Andrzej” Sensor Direction Errors (see explanation below) [mm] min mean max UL X 0,00 0,12 0,31 Y 0,01 0,16 0,36 Z 0,95 1,44 LL 0,78 2,31 0,28 0,81 JAW 0,68 1,89 1,29 3,42 TB 2,90 11,24 0,02 1,82 3,17 1,97 3,43 TT 0,92 2,04 0,71 2,01 2,72 7,07 TD 1,06 2,36 2,43 4,57 TF 0,03 2,50 6,11 0,0 6,33 13,16 2,66 11,04 TL 2,56 9,84 12,33 290,43 344,38 Introduction Speech Inversion Results and Discusion Speech Inversion Errors min – minimal absolute error mean – root mean square error max – maximal absolute error Conclusion

DTW Speech Inversion – Main Sources of Errors Errors were caused probably by: accuracy of the measurement by AG 500 reported as 0.5 mm (can be higher if EMA sensors are not firmly fixed to the speaker’s head and articulators) intraspeaker variability of tongue position – from the unpublished research of authors results that it can reach up to 5.8 mm for two realizations of the same phoneme by one speaker errors in speech alignment by DTW methods. Very high errors for sensor TL in Z direction are definitely caused by noises during EMA sensor signals acquisition or EMA sensor damage, because values of these errors exceed oral cavity size in Z direction which is maximally about 40÷50 mm. Introduction Speech Inversion Results and Discusion Conclusion

Future research Acoustic to articulatory speech inversion by HMM – Hidden Markov Models Introduction Speech Inversion Results and Discusion Conclusion

Summary What has been done so far? Preliminary research on acoustic to articulatory speech inversion by dynamic time warping for Polish word „Andrzej” What is to be done? DTW speech inversion for larger testing set Speech inversion by HMM and Bayesian nets using another features for speech inversion, for example video markers or images obtained from acoustic camera Introduction Speech Inversion Results and Discusion Conclusion

Acknowledgment Research was supported by grant No. 2012/05/E/HS2/03770 titled: “Polish Language Pronunciation. Analysis Using 3-dimensional Articulography” with A. Lorenc as the principal investigator. The project is financed by The Polish National Science Centre on the basis of the decision No. DEC-2012/05/E/HS2/03770.

Thank you for your attention