2015/9/131 Stress Detection J.-S. Roger Jang ( 張智星 ) MIR LabMIR Lab, CSIE Dept., National Taiwan Univ.

Slides:



Advertisements
Similar presentations
Dynamic Time Warping (DTW)
Advertisements

Pattern Recognition and Machine Learning
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Basic Features of Audio Signals ( 音訊的基本特徵 ) Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CS Dept, Tsing Hua Univ. Hsinchu, Taiwan.
Chapter three Phonology
Optimal Adaptation for Statistical Classifiers Xiao Li.
Performance Evaluation: Estimation of Recognition rates J.-S. Roger Jang ( 張智星 ) CSIE Dept., National Taiwan Univ.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
PCA & LDA for Face Recognition
NM7613: Music Signal Analysis and Retrieval 音樂訊號分析與檢索 Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech.
Study of Word-Level Accent Classification and Gender Factors
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Principal Component Analysis (PCA)
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Endpoint Detection ( 端點偵測 ) Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CSIE Dept National Taiwan Univ., Taiwan.
Speech Assessment 語音評測 J.-S. Roger Jang ( 張智星 ) Multimedia Information Retrieval Lab CS Dept, Tsing.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
Demos for QBSH J.-S. Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
RuSSIR 2013 QBSH and AFP as Two Successful Paradigms of Music Information Retrieval Jyh-Shing Roger Jang ( 張智星 ) MIR Lab, CSIE Dept.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Quadratic Classifiers (QC) J.-S. Roger Jang ( 張智星 ) CS Dept., National Taiwan Univ Scientific Computing.
National Taiwan University, Taiwan
Phonetics, part III: Suprasegmentals October 19, 2012.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
TEACHING PRONUNCIATION
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010 Speaker: Hsiao-Tsung Hung.
Linear Classifiers (LC) J.-S. Roger Jang ( 張智星 ) MIR Lab, CSIE Dept. National Taiwan University.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Search in Google's N-grams
Intro. to Audio Signals Jyh-Shing Roger Jang (張智星)
Quadratic Classifiers (QC)
MIR Lab: R&D Foci and Demos ( MIR實驗室:研發重點及展示)
Investigating Pitch Accent Recognition in Non-native Speech
Discrete Fourier Transform (DFT)
University of Rochester
Introduction to Pattern Recognition
Singing Voice Separation via Active Noise Cancellation 使用主動式雜訊消除於歌聲分離
ASRA: Automatic Speech Recognition & Assessment
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
National Taiwan University
Closing Remarks on MSAR-2017
Intro. to Audio Signals Jyh-Shing Roger Jang (張智星)
Intro. to Audio Signals Jyh-Shing Roger Jang (張智星)
Pattern Recognition and Machine Learning
Endpoint Detection ( 端點偵測)
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Naive Bayes Classifiers (NBC)
Game Trees and Minimax Algorithm
Duration & Pitch Modification via WSOLA
Low Level Cues to Emotion
Presentation transcript:

2015/9/131 Stress Detection J.-S. Roger Jang ( 張智星 ) MIR LabMIR Lab, CSIE Dept., National Taiwan Univ.

-2- Intro to Stress Detection zStress detection (SD) for English yGiven an English word and its pronunciation yDetect the stress position of the pronunciation zApplications yComputer-assisted pronunciation training (CAPT) zSimilar to… yTone recognition in Mandarin Chinese yIntonation scoring

-3- Examples of Stress in English Words zFor multi-syllablic English word, there is a stressed syllable zExample yDictionary: stressed at syllable 1 yTomorrow: stressed at syllable 2 yInternational: stressed at syllable 3

-4- Steps in Stress Detection zPreprocessing yUse forced alignment to find vowel locations zFeature extraction yExtract feature for each vowel zModel construction yBuild a classifier for vowel-based stress detection zPost processing yCreate a word-based stress detection

-5- Forced Alignment (1/2) zA process used for align an utterance and the corresponding canonical phonetic alphabets zExample: International

-6- Forced Alignment (2/2) zApplications of forced alignment ySpeech scoring (based on timber only) yUtterance verification zOur forced alignment engine yASRA (Automatic Speech Recognition & Assessment): For voice command recognition and speech assessment (scoring)

-7- Corpora for Stress Detection zMerriam Webster dictionary yWebsiteWebsite zSome statistics y# pronunciations: yUsable files: xNo. of syllables > 1 xAvailable in our dictionary xValid output from ASRA zIn-house recordings yRecordings from MSAR for several years yAvailable upon request

-8- Speech Corpus for Lexical Stress Detection z Merriam Webster Online Dictionary’s Lexical Pronunciation – –All utterance are pronunciated by Native Speakers Stress Position Number of Syllable Total Total utterances14992 Total Syllables43212 Stressed Syllables14992 Unstressed Syllables28220 Stressed : Unstressed1 : 1.9 Sample Rate16000 Resolution16 Channelmono

-9- Stress Detection based on Vowel Classification zSD is based on vowel classification due to the following observations yEach word has a stressed syllable yEach syllable is usually composed of a consonant and a vowel yVowels are always voiced (have pitch) zTherefore yEach vowel is classified into “unstressed” or “stressed” yTo determine stressed syllable in an utterance xMax likelihood of the class “Stressed” xMin likelihood of the class “Unstressed” xDifference of the above two

-10- Features for vowels zVowel-based features yPitch: min, mean, max, range, std, slope, etc. yVolume: min, mean, max, range, std, slope, etc. yDuration (normalized by speech rate) yLegendre polynomial fitting for pitch & volume ySpectral emphasized version of the above y…

-11- Lexical Stress Detection – Experiment 1 Feature Set E : Root Mean Square Energy D : Duration P : Pitch S : Root Mean Square Spectral Emphasis Energy PS: Pitch Slope CE: Legendre Coefficient of Root Mean Square Energy Contour CP: Legendre Coefficient of Pitch Contour CS: Legendre Coefficient of Spectral Emphasis Energy Contour 10-fold Cross Validation Classifier: SVM

Syllables word 1 st 2 nd 3 rd 1 st 96.08%3.10%0.83% 2 nd 8.28%86.58%5.14% 3 rd 31.90%5.75%62.36% 4 Syllables word 1 st 2 nd 3 rd 4 th 1 st 96.13%2.37%1.51%0% 2 nd 8.91%87.76%2.34%0.98% 3 rd 21.62%2.46%73.95%0.97% 4 th 38.24%5.88%2.94%52.94% 5 Syllables word 1 st 2 nd 3 rd 4 th 5 th 1 st 100%0% 2 nd 8.16%88.44%2.72%0.68%0% 3 rd 19.33%1.78%76.67%1.78%0.44% 4 th 13.64%13.22%2.48%70.66%0% 5 th 100%0% 2 Syllables word 1 st 2 nd 1 st 95.13%4.87% 2 nd 25.67%74.33%

-13- Lexical Stress Detection – Experiment 2 10-fold Cross Validation Classifier: SVM Syllable Number-Independent Classifier vs. Syllable Number-dependent Classifier Feature Set Max. Root Mean Square Energy Mean Root Mean Square Energy Max. Pitch Median Pitch Duration Max. Spectral Emphasis Root Mean Square Energy Mean Spectral Emphasis Root Mean Square Energy Pseudo-Slope of Pitch Contour Legendre Polynomials Coefficients of Pitch Contour Legendre Polynomials Coefficients of RMS Energy Contour Legendre Polynomials Coefficients of Spectral Emphasis RMS Energy

-14- Lexical Stress Detection – Experiment 3 GMMC: Gaussian Mixture Model Classifier NBC: Naïve Bayes Classifier QC: Quadratic Classifier SVMC: Support Vector Machine Classifier Feature Set Max. Root Mean Square Energy Mean Root Mean Square Energy Max. Pitch Median Pitch Duration Max. Spectral Emphasis Root Mean Square Energy Mean Spectral Emphasis Root Mean Square Energy Pseudo-Slope of Pitch Contour Legendre Polynomials Coefficients of Pitch Contour Legendre Polynomials Coefficients of RMS Energy Contour Legendre Polynomials Coefficients of Spectral Emphasis RMS Energy 10-fold Cross Validation

-15- Lexical Stress Detection – Error Analysis z Error Types: 1.Wrong ground truth / More than 1 pronunciations of the word –conduct 2 [kənˋdʌkt] / [ˋkɑndʌkt] 2.Complex Word with 2 primary stressed syllables –worldwide 2 [`wɝld`waɪd] –histochemistry 5 [ˋhɪstəˋkɛmɪstrɪ] 3.Word with Primary stressed and Secondary stressed syllable –deposition 4 [͵dɛpəˋzɪʃən] –cafeteria 5 [͵kæfəˋtɪrɪə]

-16- Lexical Stress Detection – Error Analysis z Error Types: 4.Wrong result from Pitch Tracking –elegant 3 [ˋɛləgənt] 5.Wrong result from Forced Alignment –peremptory 4 [pəˋrɛmptərɪ]

-17- More on Stress Detection zASRA yChapter 20 of online tutorial on Audio Signal ProcessingAudio Signal Processing yDemo xRecognition goDemoVc.m in ASR Web xAssessment goDemoSa.m in ASR Web zStress detection yApplication noteApplication note yDemoDemo