Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India Received 8 December 2011; received in revised form 12 March 2012; accepted 23 April 2012 Available online 1 June 2012 在頻率時間 分類 送氣 塞音 清晰度 Chairman:Hung-Chi Yang Presenter: Yue-Fong Guo Advisor: Dr. Yeou-Jiunn Chen Date: 2013.3.20
Outline Introduction MFCC 2D-DCT Polynomial surface
Outline GMM Results Conclusion
Introduction Automatic speech recognition (ASR) system The goal is the lexical content of the human voice is converted to a computer-readable input Attempt to identify or confirm issue voice speaker rather than the content of the terms contained therein 自動語音辨認 其目標是將人類的語音中的詞彙內容轉換為計算機可讀的輸入 後者嘗試識別或確認發出語音的說話人而非其中所包含的詞彙內容。 簡單的說就是讓機器聽的懂人說的話
Introduction Automatic speech recognition (ASR) system Acoustics feature Signal processing and feature extraction Mel frequency cepstral coefficients (MFCC) Acoustics model Statistically speech model Gaussian mixture model (GMM) 系統構成 聲學特徵 訊號處理 特徵的提取 聲學特徵的提取與選擇是語音識別的一個重要環節。 Mel倒譜係數 此參數考慮到人耳對不同頻率的感受程度,因此特別適合用在語音辨識。 聲學模型 以統計語音模型的方式建構 高斯混合模型 可以讓多個訊號平滑話在一起
MFCC Mel frequency cepstral coefficients (MFCC) MFCC takes human perception sensitivity with respect to frequencies into consideration, and therefore are best for speech/speaker recognition. 此參數考慮到人耳對不同頻率的感受程度,因此特別適合用在語音辨識。 所以是現在常用的 對人的聽覺機理的研究發現,當兩個頻率相近的音調同時發出時,人只能聽到一個音調。臨界頻寬指的就是這樣一種令人的主觀感覺發生突變的頻寬邊界,當兩個音調的頻率差小於臨界頻寬時,人就會把兩個音調聽成一個,這稱之為屏蔽效應。Mel刻度是對這一臨界頻寬的度量方法之一。
MFCC Pre-emphasis Frame blocking Hamming windowing The speech signal s(n) is sent to a high-pass filter Frame blocking Hamming windowing Each frame has to be multiplied with a hamming window in order to keep the continuity of the first and the last points in the frame 預強調 將語音訊號 s(n) 通過一個高通濾波器 音框化 漢明窗 將每一個音框乘上漢明窗,以增加音框左端和右端的連續性
MFCC Fast Fourier Transform or FFT Triangular Bandpass Filters The time domain signal into a frequency domain Triangular Bandpass Filters Smooth the magnitude spectrum such that the harmonics are flattened in order to obtain the envelop of the spectrum with harmonics. Discrete cosine transform or DCT 快速傅利葉轉換 時域信號轉化成頻域 三角帶通濾波器 對頻譜進行平滑化,並消除諧波的作用,突顯原先語音的共振峰。 離散餘弦轉換 問學長 MFCC的計算首先用FFT將時域信號轉化成頻域,之後對其對數能量譜用依照Mel刻度分布的三角濾波器組進行卷積,最後對各個濾波器的輸出構成的向量進行離散餘弦變換DCT,取前N個係數。PLP仍用德賓法去計算LPC參數,但在計算自相關參數時用的也是對聽覺激勵的對數能量譜進行DCT的方法。
MFCC Log energy Delta cepstrum The energy within a frame is also an important feature that can be easily obtained Delta cepstrum Actually used in speech recognition, we usually coupled differential cepstrum parameters to show the changes of the the cepstrum parameters of the time 對數能量 一個音框的音量(即能量),也是語音的重要特徵,而且非常容易計算 差量倒頻譜參數 而在實際應用於語音辨識時,我們通常會再加上差量倒頻譜參數,以顯示倒頻譜參數對時間的變化。 MFCC的計算首先用FFT將時域信號轉化成頻域,之後對其對數能量譜用依照Mel刻度分布的三角濾波器組進行卷積,最後對各個濾波器的輸出構成的向量進行離散餘弦變換DCT,取前N個係數。PLP仍用德賓法去計算LPC參數,但在計算自相關參數時用的也是對聽覺激勵的對數能量譜進行DCT的方法。
2D-DCT 2D-DCT modeling 二維的離散餘弦轉換 說明跟一維的差別 (在問學長)
Polynomial surface Polynomial surface modeling 多項式曲面模型 把它弄亂
Polynomial surface Polynomial surface modeling
Polynomial surface Polynomial surface modeling
Polynomial surface Polynomial surface modeling
GMM Gaussian mixture model (GMM) Is an effective tool for data modeling and pattern classification Speaker acoustic characteristics for clustering, and then each group of acoustic characteristics described with a Gaussian density distribution 聲學模型 高斯混合模型 是一種有效的數據建模工具和模式分類 做法是把語者的聲學特性作分群,然後每一群聲學特性 用一個高斯密度分佈描述 換句話說 可以讓多個訊號平滑話在一起
Databases Databases Evaluated on two distinct datasets American English continuous speech as provided in the TIMIT database Marathi words database specially created for the purpose 在兩個不同的數據集進行評估 提供的在TIMIT數據庫所提供的美國英語連續語音 馬拉地語字庫專門設立的數據庫
Results 結果
Conclusion A comparison of performance with published results on the same task revealed that the spectro-temporal feature systems tested in this work improve upon the best previous systems’ performances in terms of classification accuracies on the specified datasets. 結論 這次的工作中,在指定數據庫比較性能與公佈的相同任務的結果說明,在spectro-temporal feature systems 比以前的systems 的準確度是有提高的。
The End