Download presentation
Presentation is loading. Please wait.
Published byEvangeline Matthews Modified over 6 years ago
1
Acoustic to Articoulatory Speech Inversion by Dynamic Time Warping
Robert Wielgat Polytechnic Institute, State Higher Vocational School in Tarnów, Tarnów, POLAND Anita Lorenc Department of Speech Pathology and Applied Linguistics, Maria Curie-Sklodowska University Lublin, POLAND 1
2
Presentation Plan The presentation is divided into four parts:
About electromagnetic articulography - EMA Speech inversion by dynamic time warping Preliminary results of the speech inversion Conclusion and future research Introduction Speech Inversion Results and Discusion Conclusion
3
Presentation Plan The presentation is divided into four parts:
About electromagnetic articulography - EMA Speech inversion by dynamic time warping Preliminary results of the speech inversion Conclusion and future research Introduction Speech Inversion Results and Discusion Conclusion
4
Presentation Plan The presentation is divided into four parts:
About electromagnetic articulography - EMA Speech inversion by dynamic time warping Preliminary results of the speech inversion Conclusion and future research Introduction Speech Inversion Results and Discusion Conclusion
5
Presentation Plan The presentation is divided into four parts:
About electromagnetic articulography - EMA Speech inversion by dynamic time warping Preliminary results of the speech inversion Conclusion and future research Introduction Speech Inversion Results and Discusion Conclusion
6
Electromagnetic Articulography (EMA) Research
Electromagnetic Articoulography (EMA) – 3D imaging of speech articoulator movements (tongue, lips, palate, jaw). The movement of articulators is visualised due to small sensors (coils) fixed to articoulators e.g. the tongue. The sensors are moving in an electromagnetic field produced by 6 transmitters. Each of the transmitters produces an alternating magnetic field at different frequencies. The alternating magnetic field induces an alternating current in the sensors, and allows to obtain the distances of each sensor from the six transmitters. It is then possible to calculate in the real time the XYZ coordinates as well as two angles of the sensors. Introduction Speech Inversion Results and Discusion Conclusion
7
EMA Sensor Placement Introduction Speech Inversion
REFERENCE SENSORS RE – behind right ear LE – behind left ear N – between nose and forehead TONGUE SENSORS TT – tongue tip TF – tongue front TD – tongue dorsum TB – tongue back TLS – tongue left side LIP and JAW SENSORS J – jaw LL – lower lip UL – upper lip PALATE SENSOR P – sensor used for making palate contour Introduction Speech Inversion Results and Discusion REFERENCE SENSORS were necesary in order to obtain XYZ coordinates of the sensor in the cartesian coordinate system associated with the speaker’s head. X direction: front-end Y direction: left-right Z direction: up-down Conclusion
8
EMA Sensor Placement in the Research
Introduction Speech Inversion Results and Discusion 5 sensors on the tongue 2 sensors on lips 1 sensor on the border of lower inscisors and gums 1 sensor for making palate contour 3 reference sensors (placed on forehead and bones behind ears) Conclusion
9
Block Diagram of the EMA Acquisition System
screen Introduction Micro- phone Array EMA Sensors Speech Inversion Computer Results and Discusion Digital Multichannel Audio Recorder Electromagnetic Articulograph AG 500 Synchronizer Conclusion
10
Data Analysis – Feature Vectors
Acoustic parameters of speech are feature vectors of mel-frequency cepstral coefficients (MFCC). These vectors are necessary for acoustic to articulatory speech inversion. Each vector is accompanied by articulatory features being XYZ coordinates of EMA sensors. Introduction 21.3 ms Speech Inversion t [ms] 5 ms Results and Discusion Acoustic parameters - MFCC vectors Vector 1 Vector 2 Vector 3 Vector 4 Vector 5 Conclusion Articoulatory features - XYZ coordinates of EMA sensors
11
Speech Inversion by DTW Method
Introduction Speech Inversion Results and Discusion Conclusion
12
Articulograph For the research purpose the AG-500 EMA will be used.
Introduction The position of the sensors were recorded every 5ms in the XYZ coordinate system The accuracy of the measurement is 0.5 mm The length of the vocal tract of about 16.7 cm Speech Inversion Results and Discusion Conclusion
13
Microphone Array An array of 16 omnidirectional electret microphones Panasonic WM-61 have been used for speech signal acquisition. For the present research signal from only one channel has been used. Frequency response of the single WM-61 microphone. Introduction Speech Inversion Results and Discusion Conclusion
14
16-channel Audio Recorder
DSP Board Introduction Speech Inversion Results and Discusion Conclusion
15
16-channel Audio Recorder
Features: Simultaneous recording in 16 channels Resolution: 16 bits Frequency sampling: 96kHz 32 bit Floating point Digital Signal Processor with the core: Cortex M4F from NXP Semiconductors Two 8 channel/16 bit fast and linear SAR AD converters Introduction Speech Inversion Results and Discusion Conclusion Control Board
16
DTW Speech Inversion -Preliminary Results
Results of acoustic to articulatory speech inversion for Polish word „Andrzej” Sensor Direction Errors (see explanation below) [mm] min mean max UL X 0,00 0,12 0,31 Y 0,01 0,16 0,36 Z 0,95 1,44 LL 0,78 2,31 0,28 0,81 JAW 0,68 1,89 1,29 3,42 TB 2,90 11,24 0,02 1,82 3,17 1,97 3,43 TT 0,92 2,04 0,71 2,01 2,72 7,07 TD 1,06 2,36 2,43 4,57 TF 0,03 2,50 6,11 0,0 6,33 13,16 2,66 11,04 TL 2,56 9,84 12,33 290,43 344,38 Introduction Speech Inversion Results and Discusion Speech Inversion Errors min – minimal absolute error mean – root mean square error max – maximal absolute error Conclusion
17
DTW Speech Inversion – Main Sources of Errors
Errors were caused probably by: accuracy of the measurement by AG 500 reported as 0.5 mm (can be higher if EMA sensors are not firmly fixed to the speaker’s head and articulators) intraspeaker variability of tongue position – from the unpublished research of authors results that it can reach up to 5.8 mm for two realizations of the same phoneme by one speaker errors in speech alignment by DTW methods. Very high errors for sensor TL in Z direction are definitely caused by noises during EMA sensor signals acquisition or EMA sensor damage, because values of these errors exceed oral cavity size in Z direction which is maximally about 40÷50 mm. Introduction Speech Inversion Results and Discusion Conclusion
18
Future research Acoustic to articulatory speech inversion by HMM – Hidden Markov Models Introduction Speech Inversion Results and Discusion Conclusion
19
Summary What has been done so far?
Preliminary research on acoustic to articulatory speech inversion by dynamic time warping for Polish word „Andrzej” What is to be done? DTW speech inversion for larger testing set Speech inversion by HMM and Bayesian nets using another features for speech inversion, for example video markers or images obtained from acoustic camera Introduction Speech Inversion Results and Discusion Conclusion
20
Acknowledgment Research was supported by grant
No. 2012/05/E/HS2/03770 titled: “Polish Language Pronunciation. Analysis Using 3-dimensional Articulography” with A. Lorenc as the principal investigator. The project is financed by The Polish National Science Centre on the basis of the decision No. DEC-2012/05/E/HS2/03770.
21
Thank you for your attention
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.