國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Distinctive Feature Detection For Automatic Speech Recognition
Building an ASR using HTK CS4706
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
The Center for Signal & Image Processing Georgia Institute of Technology Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Oriented Wavelet 國立交通大學電子工程學系 陳奕安 Outline Background Background Beyond Wavelet Beyond Wavelet Simulation Result Simulation Result Conclusion.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Speaker Adaptation for Vowel Classification
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics in a segmental-HMM recognizer using intermediate.
A new predictive search area approach for fast block motion estimation Kuo-Liang Chung ( 鍾國亮 ) Lung-Chun Chang ( 張隆君 ) 國立台灣科技大學資訊工程系暨研究所 IEEE TRANSACTIONS.
A Simulation Study of the PWM Strategy for Inverters
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
Page 0 of 14 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.
1 Wavelets, Ridgelets, and Curvelets for Poisson Noise Removal 國立交通大學電子研究所 張瑞男
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Experimental Results ■ Observations:  Overall detection accuracy increases as the length of observation window increases.  An observation window of 100.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Subband Feature Statistics Normalization Techniques Based on a Discrete Wavelet Transform for Robust Speech Recognition Jeih-weih Hung, Member, IEEE, and.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
Page 1 國立交通大學電力電子晶片設計與 DSP 控制實驗室 Power Electronics IC Design & DSP Control Lab., NCTU, Taiwan 年 10 月 13 日 賴 逸 軒賴 逸.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Automated Detection of Speech Landmarks Using
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
CRANDEM: Conditional Random Fields for ASR
Endpoint Detection ( 端點偵測)
Speaker Identification:
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Combination of Feature and Channel Compensation (1/2)
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Presentation transcript:

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters Yih-Ru Wang Institute of Communication Engineering, National Chiao Tung University, Hsinchu, Taiwan, ROC 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 2 Outline Motivation, Background Why sample-based? Sample-based Acoustic Parameters & Phone Boundary Detector Experimental results Conclusions and Future works 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 3 Motivation Find the synchronous “clock” for Detection-based ASR, Computer Aided Language Learning(CALL) System Speech Attribution Detectors Speech signal Phone Boundary Detector Synchronous “clock” for the system Segment-based system Detection-based ASR, CALL system 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 4 Background Tasks of Phonetic Segmentation –Phone alignment, 87% inclusion rate for 10 msec tolerance for experts –Phone boundary detection Phone alignment : using Model-based method –HMM, MBE-HMM (Minimum Boundary Error HMM), HMM + fine tuning using SVM, … Phone boundary detection : using Metric-based method –a measure of speech signal change –norm of delta MFCC feature vector (Rabiner, 2006) –KL distance or BIC of speech signal The frame-based features, like MFCC, were used 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Why sample-based? Transient vs. Stationary Accuracy and precision – especially for ‘short’ phones, e.g. plosives Acoustic feature used high frequency resolution, like MFCC  to ‘recognize’ phones in speech To detect the pronunciation manner/position (acoustics) changes in speech signal  increase time resolution and decrease frequency resolution of the features /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 6 To find the useful measures of speech signal change in sample-based system –Sample-based Acoustic Parameters were proposed PROs of sample-based method –Better accuracy and precision –Properly detect the boundary of short phones CONs of sample-based method –Complexity of system? –Higher false alarm? 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 7 Sample-based Acoustic Parameters & Phone boundary detector Sub-band signal envelope –Six sub-bands used for landmark detection (Liu, 1996) ROR (rate of raising) of Sub-band signal envelope –The delta-term of a feature Bandpass freq. 5.0 – 8.0 k Hz 3.5 – 5.0 k Hz 2.0 – 3.5 k Hz 1.5 – 2.0 k Hz 0.8 – 1.2 k Hz 0.0 – 0.4 k Hz 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering | Stop | Glide | Vowel 8 Waveform Envelope Sub-band signal envelope 5.0 – 8.0 k Hz | Fricative | Vowel | Silence | Nasal | Fricative | Vowel | Nasal | Vowel 0.0 – 0.4 k Hz TIMIT: FDRW0/sx293Please take this dirty table cloth to the cleaners for me 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 9 ROR of Sub-band signal envelope ROR of signal envelope ~20ms Please take this dirty cloth… 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 10 Norm of sub-band signal envelopes can be a useful measure of signal change Sample-based spectral entropy can be defined as where is the i-th normalized sub-band signal envelope Sample-based spectral KL distance between speech signals at two adjacent times [n, n +1] can be defined as 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 11 Sample-based Spectral entropy ROR of Spectral entropy An example of sample-based spectral entropy and its ROR 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering An Example of sample-based spectral KL distance 12 Sample-based spectral KL distance It can be used to find the signal change points more accurately and precisely. 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering A MLP was used as the Phone Boundary detector The block diagram of proposed training/test procedure /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 14 Candidates Pre-selection – find all the speech samples, with index n, which satisfied Pre-selection can be used to reduce the complexity and FA of sample-based system. After candidate pre-selection, a MLP was used as the boundary detector 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering The AP features used for MLP detector /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 16 Iterative training procedure 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 2nd stage : –use similarity measure of segmental acoustic signals –Using GMM to model the pdf of a speech segment –The KL1 distance of CCGMM (Wang, 2004) Using a common GMM to represent the pdfs of two segments /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Similarity measure of two speech segments: –Discrete KL-1 distance of CCGMM coefficient –Discrete KL-2 distance using CCGMM coefficient /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering –Discrete KL-1 distance is the mean of log-likelihood of two pdfs –The similarity of two pdfs –Find high order statistics of log-likelihood pdfs (Wang, 2008) –Variance, skewness of log-likelihood pdfs /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering –use segmental similarity /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Experimental Results Database : TIMIT. After candidates pre-selection, –1 over 116 samples was selected –0.9% MD due to candidate pre-selection Performance of MLP boundary detector: 21 TIMIT corpus SamplePhone boundary Training set Test set /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Performance of the sample-based boundary detector /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering An example of proposed phone boundary detector /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Accuracy of the sample-based boundary detector /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Compare to Dr. Rabiner’s work [2006] : 25 Absolute error  5ms  10ms  15ms same frame ±1 frame Inclusion rate (1-stage) 41.5%69.7%81.1%37.3%77.0% Inclusion rate (2 stage) 42.1%70.3%81.9%37.8%77.8% Dr. Rabiner’s result : ( 22.8% 59.2% ) 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Error analysis – MAE of detected boundary 26 AffricateFricativeStopGlideVowelNasalSilence Affricate-6.4/6.5*10.1/6.9*7.3/10.0*6.8/ /15.3*6.1/12.8 Fricative2.3/ / /13.1*9.5/ / / /11.7 Stop-6.1/ /12.0*11.2/ / /9.67.1/14.4 Glide-7.0/ / / / / /12.7 Vowel-6.3/9.87.9/ / / / /13.6 Nasal7.6/11.3*6.2/ / / / /11.2*6.9/12.1 Silence6.3/ /7.57.3/ / / /9.97.0/18.9 Overall : 7.6/12.4 Sample-based/HMM system (unit ms) * no. of sample less than /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Accuracy of proposed method – /7/12 NGASR 研討會 Systems MAE/RMSE (ms) MAE/RMSE (frame) MAE/RMSE (normalized to phone duration) HMM12.4/ / / stage (RNN) 7.6/ / /0.197

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering Error analysis (1 stage) – MDR and FAR 28 Pronunciation manners Deletion Insertion next phoneAffricateFricativeStopGlideVowelNasalSilence Affricate HMM-0.0%25.0%11.8%4.4%0.0%2.9%9.2% RNN-0.0% 17.6%5.7%7.7%8.0%11.1% Fricative HMM0.0%2.3%13.3%10.6%4.8%5.0%6.2%7.5% RNN0.0%3.3%16.3%20.5%8.2%7.0%7.4%10.5% Stop HMM-1.6%12.6%14.1%5.7%4.1%2.0%3.8% RNN-2.3%14.9%22.7%8.2%10.3%2.6%9.0% Glide HMM-2.8%16.1%28.2%5.6%4.5%5.2%3.8% RNN-5.8%6.3%6.5%6.6%8.4%7.3%9.4% Vowel HMM-2.9%6.7%6.6%6.5%10.3%4.4%7.8% RNN-6.2%9.2%6.9%6.4%10.0%7.3%10.1% Nasal HMM7.1%3.5%17.7%7.8%5.4%2.5%18.4%5.7% RNN7.1%9.8%16.1%18.8%8.3%2.5%8.7%9.4% Silence HMM2.1%0.9%6.2%6.6%4.3%4.8%3.0%5.7% RNN5.0%3.9%10.2%10.0%7.6%5.0%3.0%6.5% overallHMM : 6.4% (EER)Sample-based : 8.7% (EER) 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 29 Conclusions & Future works Several sampled-based acoustic parameters, which could properly model the speech signal change, were proposed Using the sample-based APs in phone boundary detector, better precision and accuracy were achieved Segment-based speech attribution detectors 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 30 Segment-based Attribution detector Segment based Attribution Recognizer Coding each contour using Legendre polynomial Operation point : 3% MDR, 20% FAR 2011/7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering –Set the operation point to low MD, high FA rate segments / phones. –Feature extraction using the Legendre coefficients of the AP contours /7/12 NGASR 研討會

國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering –Pre-limitary result frame-based system using 9 frames feature. –Change into accuracy over time : 81.2% only 6 band-pass envelopes were used  phone alignment 32 Pronunciation manner Segment-based Recog. Rate (%) Frame-base Recog. Rate(%) Fricative Stop Glide Vowel Nasal Silence /7/12 NGASR 研討會