PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui.

Slides:

Advertisements

Similar presentations

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.

Advertisements

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.

Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)

Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.

Speaker Adaptation for Vowel Classification

Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.

Optimal Adaptation for Statistical Classifiers Xiao Li.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

Introduction to Automatic Speech Recognition

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

Presented by Tienwei Tsai July, 2005

7-Speech Recognition Speech Recognition Concepts

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Survey of ICASSP 2013 section: feature for robust automatic speech recognition Repoter: Yi-Ting Wang 2013/06/19.

Jacob Zurasky ECE5526 – Spring 2011

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Understanding The Semantics of Media Chapter 8 Camilo A. Celis.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

National Taiwan University, Taiwan

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Performance Comparison of Speaker and Emotion Recognition

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,

Online Multiscale Dynamic Topic Models

Compact Query Term Selection Using Topically Related Text

Statistical Models for Automatic Speech Recognition

DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.

Presenter: Shih-Hsiang(士翔)

Presentation transcript:

PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui CHU

2005/10/24 Speech Lab. NTNU 2 Reference Combination of Acoustic and Prosodic Information for Robust Speaker Identification, Yuan-Fu Liao, Zhi-Xian Zhuang, Zi-He Chen and Yau-Tarng Juang, ROCLING 2005 Eigen-Prosody Analysis for Robust Speaker Recognition under Mismatch Handset Environment, Zi-He Chen, Yuan-Fu Liao and Yau-Tarng Juang, ICSLP 2004

2005/10/24 Speech Lab. NTNU 3 Outline Introduction VQ-Based Prosodic Modeling AKI Unseen Handset Estimation Eigen-Prosody Analysis Speaker Identification Experiments Conclusions

2005/10/24 Speech Lab. NTNU 4 Introduction A speaker identification system in telecommunication network environment needs to be robust against distortion of mismatch handsets. Prosodic features are known to be less sensitive to handset mismatch Several successful techniques : GMMs : the per-frame pitch and energy values are extracted and modeled using traditional distribution models May not adequately capture the temporal dynamic information of the prosodic feature contours

2005/10/24 Speech Lab. NTNU 5 Introduction (cont.) Several successful techniques (cont.) : N-gram : the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statistics DHMM : the sequences of prosody symbols are further modeled by the state observation and transition probabilities usually require large amount of training/test data to reach a reasonable performance In this paper : a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features- based GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited training/test data

2005/10/24 Speech Lab. NTNU 6 Introduction (cont.) Speaker’s enrollment speech Sequences of Prosody states EPA VQ-Based Prosodic Modeling Text document records the detail prosody/speaking style of the speaker Eigen-prosody space to represent the constellation of speakers LSA By this way, the speaker identification problem is transformed into a full-text document retrieval-similar task.

2005/10/24 Speech Lab. NTNU 7 Introduction (cont.) Since EPA utilizes prosody-level information, it could be further fused with acoustic-level information to complement each other The a priori knowledge interpolation (AKI) approach is chosen It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatch

2005/10/24 Speech Lab. NTNU 8 VQ-Based Prosodic Modeling In this study, syllables are chosen as the basic processing units Five types of prosodic features : the slope of pitch contour of a vowel segment lengthening factor of a vowel segment average log-energy difference between two vowels value of pitch jump between two vowels pause duration between two syllables The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information) : mean of the vowel class of whole corpus : variance of the vowel class of whole corpus

2005/10/24 Speech Lab. NTNU 9 Prosodic Modeling

2005/10/24 Speech Lab. NTNU 10 Prosodic Modeling (cont.) To build the prosodic model, the prosodic features are vector quantized into M codewords using the expectation-maximum (EM) algorithm

2005/10/24 Speech Lab. NTNU 11 Prosodic Modeling (cont.) For example, state 6 is the phrase-start; state 3 and 4 are the major-breaks

2005/10/24 Speech Lab. NTNU 12 Prosodic labeling

2005/10/24 Speech Lab. NTNU 13 AKI Unseen Handset Estimation The concept of AKI is to first collect a set of characteristics of seen handset as the a priori knowledge to construct a space of handsets Model-space AKI using the MLLR model transformation is chosen to compensate the handset mismatch The estimate of the characteristic of a test handset is defined: where H = {h n =(A n, b n, T n ), n=1~N} is the set of a priori knowledge μ and  are the original and adapted mixture mean vectors Σ and  are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matrices B is the inverse function of Choleski factor of the original variance matrix Σ -1 is the bias of the mixture component

2005/10/24 Speech Lab. NTNU 14 AKI Unseen Handset Estimation (cont.)

2005/10/24 Speech Lab. NTNU 15 Eigen-Prosody Analysis The procedures of the EPA includes: VQ-based prosodic modeling and labeling Segmenting the sequences of prosody states to extract important prosody keywords Calculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrix Applying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody space Measuring the speaker distance using the cosine of the angle between two speaker vectors

2005/10/24 Speech Lab. NTNU 16 Eigen-Prosody Analysis (cont.)

2005/10/24 Speech Lab. NTNU 17 Prosody keyword extraction After the prosodic state labeling, the prosody text documents of all speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionary First, all possible combinations of the prosody words, including single words and word pairs (uni-gram and bi-gram), are listed and their frequency statistics are computed After calculating the histogram of all prosody words, frequency thresholds are set to leave only high frequency ones

2005/10/24 Speech Lab. NTNU 18 Prosody keyword-speaker occurrence matrix statistics The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer words The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the specific speaker The prosody keyword-speaker occurrence matrix A is made up of the collection of all speaker prosody keyword lists vectors To emphasize the uncommon keywords and to deemphasize the very common ones, the inverse document frequency (IDF) weighting method is applied

2005/10/24 Speech Lab. NTNU 19 Eigen-Prosody analysis In order to reduce the dimension of the prosody space, the sparse prosody keyword-speaker occurrence matrix A is further analyzed using SVD to find a compact eigen-prosody space Given an m by n (m>>n) matrix A of rank R, A is decomposed and further approximated using only the largest K singular values as: Where A K, U K, V K and Σ K matrices are the rank reduced matrices of the respective matrices Where A K, U K, V K and Σ K matrices are the rank reduced matrices of the respective matrices

2005/10/24 Speech Lab. NTNU 20 Eigen-Prosody Analysis

2005/10/24 Speech Lab. NTNU 21 Score measurement The test utterances of a test speaker are first labeled and parsed to form the pseudo query document y Q, and then transformed into the query vector v Q in the eigen-prosody speaker space by

2005/10/24 Speech Lab. NTNU 22 Speaker Identification Experiments HTIMIT database and experiment conditions total 384 speakers, each gave ten utterances The set of 384*10 utterances was then playback and recorded through nine other different handsets include four carbon button ( 碳墨式 ), four electret ( 電子式 ) handsets, and one portable cordless phone In this paper, all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterances For training the speaker models, the first seven utterances of each speaker from the senh handset were used as the enrollment speech The other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data, respectively

2005/10/24 Speech Lab. NTNU 23 Speaker Identification Experiments (cont.) To construct the speaker models, a 256-mixture universal background model (UBM) was first built from the enrollment speech of all 302 speakers For each speaker, a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built a 32-state prosodic modeling was trained 367 prosody keywords were extracted to form a sparse 367*302 dimensional matrix A the dimension of eigen-prosody space is 3 38 mel-frequency cepstral coefficiences (MFCCs) including Window size of 30 ms and frame shift of 10ms

2005/10/24 Speech Lab. NTNU 24 Speaker Identification Experiments (cont.)

2005/10/24 Speech Lab. NTNU 25 Conclusions This paper presents an EPA+AKI+MAP-GMM/CMS fusion approach Unlike conventional fusion approaches, the proposed method requires only few training/test utterances and could alleviate the distortion of unseen mismatch handsets It is therefore a promising method for robust speaker identification under mismatch environment and limited available data

2005/10/24 Speech Lab. NTNU 26