Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

Slides:



Advertisements
Similar presentations
Face Recognition and Biometric Systems Eigenfaces (2)
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Speech Recognition with Hidden Markov Models Winter 2011
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
THE AUSTRALIAN NATIONAL UNIVERSITY Infrasound Technology Workshop, November 2007, Tokyo, Japan OPTIMUM ARRAY DESIGN FOR THE DETECTION OF DISTANT.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
Basics: Notation: Sum:. PARAMETERS MEAN: Sample Variance: Standard Deviation: * the statistical average * the central tendency * the spread of the values.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Why is ASR Hard? Natural speech is continuous
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Advisor: Prof. Tony Jebara
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
Study of Word-Level Accent Classification and Gender Factors
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
To examine the feasibility of using confusion matrices from speech recognition tests to identify impaired channels, impairments in this study were simulated.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
Tom Ko and Brian Mak The Hong Kong University of Science and Technology.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Online Multiscale Dynamic Topic Models
EE513 Audio Signals and Systems
Analyzing F0 and vowel formants of Persian based on long-term features
Within-speaker variability in long-term F0
Presentation transcript:

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute of Technology Tokyo, Japan

2 Introduction(1/2) Background  Present speech recognition technology  High recognition accuracy for read speech  Rather poor accuracy for spontaneous speech  Improvement of recognition accuracy for spontaneous speech is necessary.  What are the differences between spontaneous and read speech?  Why is the recognition accuracy for spontaneous speech low? What are differences?

3 Introduction(2/2) Goals  Statistical and quantitative analysis of acoustic and linguistic differences between spontaneous and read speech.  Investigation of acoustic and linguistic characteristics which affect speech recognition performance in spontaneous speech.

4 Corpus of Spontaneous Japanese (CSJ) A large-scale spontaneous speech corpus  Roughly 7M words (morphemes) with a total speech length of 650 hours  Orthographic and phonetic transcription are manually given. Speaking styles Academic presentations(AP)  Live recordings of academic presentations  The fields of Engineering, social science, and humanities Extemporaneous presentations(EP)  Studio recordings of paid layman speakers’ speech  Small audience and relatively relaxed atmosphere  More informal than AP Dialogue speech(D)  Interview, task oriented dialogue, and free dialogue Read speech(R)  Reading transcription of AP or EP by the same speaker

5 Disfluency ratio Filled pauses (F), word fragments (W), and reduced articulation or mispronunciation (M) Approximately one-tenth of the words are disfluencies in the spontaneous speech in the CSJ. The ratio of F is significantly higher than that of W and M.

Acoustic characteristics

7 Acoustic feature extraction dimensional feature vectors  12-dimensional MFCC, log-energy, and their first and second derivatives  25 ms-length window shifted every 10 ms  CMS is applied to each utterance. 2. HMMs  Mono-phone HMMs with a single Gaussian mixture  Left-to-right topology with three self-loops  Trained using samples of every combination of phonemes, speakers, and utterance styles 3. Acoustic features for each phoneme  Mean and variance vectors of 12-dimensional MFCC at the 2nd state of the HMM Target phonemes  31 Japanese phonemes (10 vowels and 21 consonants) Vowel /a, i, u, e, o, a:, i:, u:, e:, o:/ Conso- nant /w, y, r, p, t, k, b, d, g, j, ts, ch, z, s, sh, h, f, N, N:, m, n/

8 Reduction ratio Quantitative analysis of the spectral space reduction for spontaneous speech Definition   p ( X ) is the mean vector of a phoneme p uttered with a speaking style X.   p ( R ) is the mean vector of read speech.  Av: average over all phonemes  || || : Euclidean norm / distance Speaking style X Read speech Center of the distribution of all phonemes Phoneme p

9 Reduction ratio averaged over 10 speakers MFCC space is reduced for almost all the phonemes, and this is most significant for dialogue utterances. red p (X) = 1

10 Reduction ratio averaged over vowels and consonants Reduction of the distribution of spontaneous speech in comparison with read speech is observed for all the speaking styles, and this is most significant for dialogue speech.

11 Between-phoneme distances The reduction of the MFCC distance between each phoneme pair is measured by using Mahalanobis distance. Phoneme cepstrum Mahalanobis distance between each phoneme pair a r n u m k ar n u m k MFCC space

12 Mahalanobis distance Mahalanobis distance D ij ( X ) between phoneme i and j :  K : dimension of MFCC vector ( K = 12)   ik and  ik 2 : k th elements of the mean and variance vector of MFCC for phoneme i uttered with a speaking style X.

13 Cumulative frequency of distribution of Mahalanobis distances Mahalanobis distances between every phoneme pair for each speaking style Mahalanobis distance between phonemes decreases as the spontaneity of utterances increases. The more spontaneous the utterances become, the more reduced the cepstrum space becomes. Increase of spontaneity

14 Relationship between phoneme distances and phoneme accuracy (1/2) Investigation of relationship between mean phoneme distances and phoneme recognition accuracy Acoustic model  A common model for all speaking styles  Trained on the data from 100 males and 100 females for AP and 150 males and 150 females for EP (about 2M phoneme samples, respectively) Language model  Phoneme network constrained by phoneme-class probabilities

15 Strong correlation between mean phoneme distance and phoneme accuracy Reduction of the distances between phonemes is a major factor contributing to the degradation of spontaneous speech recognition accuracy. Relationship between phoneme distances and phoneme accuracy (2/2) Correlation coefficient 0.97

Linguistic characteristics

17 Written text and spontaneous speech corpora Mainichi newspaper(NP)  Written text corpus News commentary(NC)  Transcription of utterances spoken based on prepared text Academic presentations(AP) (in CSJ) Extemporaneous presentations(EP) (in CSJ) Dialogue(D) (in CSJ)

18 Part-of-speech observation frequency The frequency of nouns is much higher in the newspaper corpus than in the spontaneous speech. The frequency of fillers is much higher in the dialogue than in news commentary and presentations. NounFillers

19 Trigrams are built as statistical language models for each speaking style, and test-set perplexity is measured for every combination of the styles. Test-set perplexity for spontaneous speech is roughly five times larger than that for written newspaper texts. Perplexity matrix Diagonal elements

20 Distance matrix for visualization Visualization of relationships between the language models Symmetrization of the perplexity matrix as follows: Symmetri- zation PP matrix (PP(a ij )) Distance matrix (D(d ij )) Visualization

21 Correction Equation (3) in the paper is wrong. Correct equation (3) is as follows:

22 Difference between language models Relationship between the language models projected onto a two-dimensional space derived from the distance matrix using MDS (Multidimensional scaling) method Newspaper text and dialogue are situated at two extreme positions. Presentations and news commentary are situated in between.

23 Relationship between perplexity and word accuracy (1/2) Investigation of relationship between test-set perplexity and word accuracy Acoustic model  A common model for all speaking styles  Trained on the data from 10 males and 10 females for each speaking style (about 750K phoneme samples) Language models  Separate models for each speaking style

24 Relationship between perplexity and word accuracy (2/2) The test-set perplexity (diagonal elements in the PP matrix) and word accuracy Experimental results indicate that they have a high correlation of –0.98 between the test-set perplexity and recognition accuracy across different speaking styles. Correlation coefficient –0.98

25 Conclusion(1/2) Clarified differences of acoustic and linguistic characteristics between spontaneous speech and read speech. Acoustic characteristics  Spectral distribution of spontaneous speech is reduced in comparison with that of read speech.  The more spontaneous, the smaller the distances between phonemes.  There is a high correlation between the mean phoneme distance and the phoneme recognition accuracy.  Spontaneous speech can be characterized by the reduction of spectral space in comparison with that of read speech, and this is one of the major factors contributing to the decrease in phoneme recognition accuracy.

26 Conclusion(2/2) Linguistic characteristics  The perplexity for language models of spontaneous speech is significantly higher than that for written text.  Spontaneous speech frequently includes ungrammatical phenomena and linguistic variations, including repetitions and repairs  There is a high correlation between the test-set perplexity and the word recognition accuracy.  Increment of the test-set perplexity of spontaneous speech is one of the major factors contributing to the decrease in word recognition accuracy

27 Future research Analysis over wider ranges of spontaneous speech using utterances other than those included in the CSJ  Is the relationship between phoneme distances and phoneme recognition accuracy general?  Is the relationship between test-set perplexity and word recognition accuracy general? How to incorporate filled pauses, repairs, hesitations, repetitions, partial words, and disfluencies for spontaneous speech Investigations of how we can use these results obtained in this paper for improving recognition performance of spontaneous speech  Creating methods for adapting acoustic and language models to spontaneous speech

Thank you very much for your kind attention!