Download presentation
Presentation is loading. Please wait.
Published byChristiana Rose Modified over 8 years ago
1
by Vitaly Horban Speech processing ® Intel ® Integrated Performance Primitives vs. Speech Libraries & Toolkits Math Inside & Outside vgorban@unicyb.kiev.ua Seminar NNSU Lab Summer2003
2
Seminar NNSU Lab Summer2003 Speech processing Agenda Comparison Intel® IPP 3.0 and speech libraries & toolkits Comparison Intel® IPP 3.0 and speech libraries & toolkits Overview mathematical methods for speech processing Overview mathematical methods for speech processing General assessment of Intel® IPP 3.0 General assessment of Intel® IPP 3.0 Summary Summary
3
Seminar NNSU Lab Summer2003 Speech processing Acronyms LPLinear Prediction LPLinear Prediction RELPResidual Linear Prediction RELPResidual Linear Prediction PLPPerceptual Linear Prediction PLPPerceptual Linear Prediction ARArea Ratios or Autoregressive ARArea Ratios or Autoregressive LSPLine Spectrum Pairs LSPLine Spectrum Pairs LSFLine Spectral Frequencies LSFLine Spectral Frequencies MFCCMel-Frequency Cepstrum Coefficients MFCCMel-Frequency Cepstrum Coefficients MLSAMel Log Spectral Approximation MLSAMel Log Spectral Approximation DCTDiscrete Cosine Transform DCTDiscrete Cosine Transform DTWDynamic Time Warping DTWDynamic Time Warping SVDSingle Value Decomposition SVDSingle Value Decomposition VQVector Quantization VQVector Quantization RFCRise/Fall/Connections RFCRise/Fall/Connections HMMHidden Markov Model HMMHidden Markov Model ANNArtificial Neural Network ANNArtificial Neural Network EMExpectation/Maximization EMExpectation/Maximization
4
Seminar NNSU Lab Summer2003 Speech processing Acronyms (continue) CMSCepstral Mean Subtraction CMSCepstral Mean Subtraction MLPMulti Layer Perception MLPMulti Layer Perception LDALinear Discriminant Analysis LDALinear Discriminant Analysis QDAQuadratic Discriminant Analysis QDAQuadratic Discriminant Analysis NLDANon-Linear Discriminant Analysis NLDANon-Linear Discriminant Analysis SVMSupport Vector Machine SVMSupport Vector Machine DWTDiscrete Wavelet Transformation DWTDiscrete Wavelet Transformation LARLog Area Ratio LARLog Area Ratio PLARPseudo Log Area Ratio PLARPseudo Log Area Ratio GMMGaussian Mixture Model GMMGaussian Mixture Model WFSTWeighted Finite State Transducer WFSTWeighted Finite State Transducer CARTClassification and Regression Trees CARTClassification and Regression Trees HNMHarmonic plus Noise Modeling HNMHarmonic plus Noise Modeling MBRMinimum Bayes Risk MBRMinimum Bayes Risk SRSpeech Recognition SRSpeech Recognition TTSText-To-Speech synthesis TTSText-To-Speech synthesis
5
Seminar NNSU Lab Summer2003 Speech processing IPP vs. CMU Sphinx Feature processing Feature processing LP Spectrum Cepstrum MEL: filter, cepstrum, filter bank PLP: filter, cepstrum, filter bank Language model Language model Context-free grammar N-gram model Acoustic model based on HMM Acoustic model based on HMM Each HMM state – set of Gaussian mixture HMM order HMM position HMM transition matrix Baum-Welch training Feature processing Feature processing LP Power Spectrum Cepstrum LSP Mel-scale values Mel-frequency filter bank Mel-cepstrum Linear scale values Acoustic & Language models Acoustic & Language models Gaussian mixture Likelihood of an HMM state cluster HMM transition matrix
6
Seminar NNSU Lab Summer2003 Speech processing IPP vs. CSLU Toolkit Feature processing Feature processing Power spectral analysis (FFT) Linear predictive analysis (LPC) PLP Mel-scale cepstral analysis (MEL) Relative spectra filtering of log domain coefficients (RASTA) First order derivative (DELTA) Energy normalization Language model Language model Word pronunciation Lexical trees Grammars Acoustic model based on HMM/ANN Acoustic model based on HMM/ANN VQ initialisation EM training Viterbi decoding Feature processing Feature processing Power Spectral analysis (FFT) Linear predictive analysis (LPC) LP reflection coefficients LSP DCT RFC Cross correlation coefficients Covariance matrix Mel-scale cepstral analysis Derivative functions Energy normalization Acoustic & Language model Acoustic & Language model VQ Weights, Means and Variances EM re-estimation Viterbi decoding
7
Seminar NNSU Lab Summer2003 Speech processing IPP vs. Festival Feature processing Feature processing Power spectrum Tilt to RFC, RFC to Tilt, RFC to F0 LPC MEL LSF LP Reflection coefficients Fundamental frequency (pitch) Root mean square energy Language model Language model N-gram model Context-free grammar Regular expressions CART trees WFST Acoustic model Acoustic model Viterbi decoding Feature processing Feature processing Power Spectrum Reflection to Tilt, PitchmarkToF0, Unit Curve (RFC) LPC MEL LSF LP Reflection coefficients Energy normalization Acoustic & Language model Acoustic & Language model Viterbi decoding
8
Seminar NNSU Lab Summer2003 Speech processing IPP vs. ISIP Feature processing Feature processing Derivative functions Spectrum Cepstrum Cross correlation Covariance matrix Covariance (Cholesky) Energy (Log, dB, RMS, Power) Filter bank Log Area Ratio (Kelly-Lochbaum) Autocorrelation (Durbin recursion, Leroux-Guegen) Lattice (Burg) Reflection coefficients Gaussian probability Acoustic & Language model (HMM) Acoustic & Language model (HMM) N-gram model Viterbi decoding Baum-Welch training Feature processing Feature processing Derivative functions Spectrum Cepstrum Cross correlation Covariance matrix Energy normalization Filter bank Area Ratio Durbin’s recursion Reflection coefficients (Schur) Gaussian probability Acoustic & Language model Acoustic & Language model Viterbi decoding
9
Seminar NNSU Lab Summer2003 Speech processing IPP vs. MATLAB Frequency Scale Conversion Frequency Scale Conversion Mel scale Equivalent rectangular Bandwidths (ERB) Transforms Transforms FFT (real data) DCT (real data) Hartley (real data) Diagonalisation of two Hermitian matrices (LDA, IMELDA) Vector distance Vector distance Euclidean Squared Euclidean Mahalanobis Itakura (AR, Power spectra) Itakura-Saito (AR, Power spectra) COSH (AR, Power spectra) Speech enhancement Speech enhancement Martin spectral subtraction algorithm Frequency Scale Conversion Frequency Scale Conversion Mel scale Linear scale Transforms Transforms DFT FFT DCT Distance Distance Euclidean Mahalanobis DTW (observation and reference vector sequences) Bhattacharya
10
Seminar NNSU Lab Summer2003 Speech processing IPP vs. MATLAB (continue) LPC analysis and transforms LPC analysis and transforms Area ratios Autoregressive or AR Power spectrum Cepstrum DCT Impulse response (IR) LSP LSF Reflection coefficients Unit-triangular matrix containing the AR coefficients Autocorrelation coefficients Expand formant bandwidths of LPC filter Warp cepstral (Mel, Linear) Feature processing Feature processing LPC Area ratio Spectrum Cepstrum RFC DCT LSP LSF Reflection coefficients RFC Autocorrelation coefficients Cross correlation coefficients Covariance matrix Mel-scale cepstral analysis Derivative functions Energy normalization
11
Seminar NNSU Lab Summer2003 Speech processing IPP vs. MATLAB (continue) Speech synthesis Speech synthesis Rosenberg glottal model Liljencrants-Fant glottal model Speech Recognition Speech Recognition Mel-cepstrum Mel-filter bank Cepstral & variances to power domain Gaussian Mixture Speech coding (ITU G.711) Speech coding (ITU G.711) Linear PCM A-law Mu-law VQ using K-means algorithm VQ using the Linde-Buzo-Gray algorithm Speech Recognition Speech Recognition Feature processing Model Evaluation Model Estimation Model Adaptation Vector Quantization Speech coding (ITU G.711, G.723.1, G.729) Speech coding (ITU G.711, G.723.1, G.729) Linear PCM A-law Mu-law VQ given codebook
12
Seminar NNSU Lab Summer2003 Speech processing IPP vs. HTK Feature processing Feature processing LPC Spectral coefficients Cepstral coefficients Reflection coefficients Gaussian distribution K-means procedure PLP Autocorrelation coefficients Covariance matrix Mel-scale filter bank MFCC Third differential Energy VQ codebook Feature processing Feature processing LPC Area ratio Spectrum Cepstrum RFC DCT LSP LSF Reflection coefficients Autocorrelation coefficients Cross correlation coefficients Covariance matrix Mel-scale cepstral analysis Derivative functions Energy normalization VQ
13
Seminar NNSU Lab Summer2003 Speech processing IPP vs. HTK (continue) Model adaptation Model adaptation Maximum Likelihood Linear Regression (MLLR) EM technique Bayesian adaptation or Maximum Aposteriori Approach (MAP) Acoustic & Language model based on HMM Acoustic & Language model based on HMM Grammar N-gram model Viterbi training Baum-Welch training Speech coding Speech coding Linear PCM A-law Mu-law Model adaptation Model adaptation EM training algorithm Acoustic & Language model Acoustic & Language model Viterbi decoding Likelihood of an HMM state cluster HMM transition matrix Speech coding (ITU G.711, G.723.1, G.729) Speech coding (ITU G.711, G.723.1, G.729) Linear PCM A-law Mu-law VQ given codebook
14
Seminar NNSU Lab Summer2003 Speech processing Possible extension IPP 3.0 Model adaptation Model adaptation Maximum Likelihood Linear Regression (MLLR) Bayesian adaptation or Maximum Aposteriori Approach (MAP) Model evaluation Model evaluation Itakura (AR, Power spectra) Itakura-Saito (AR, Power spectra) COSH (AR, Power spectra) Speech synthesis Speech synthesis Rosenberg glottal model Liljencrants-Fant glottal model Speech enhancement Speech enhancement Martin spectral subtraction Speech coding Speech coding VQ using K-means algorithm VQ using the Linde-Buzo-Graym Acoustic model based on HMM Acoustic model based on HMM Baum-Welch training Feature processing Feature processing PLP: filter, cepstrum, filter bank Relative spectra filtering of log domain coefficients (RASTA) Fundamental frequency (pitch) RMS energy Covariance (Cholesky) Energy (Log, dB, RMS, Power) LAR (Kelly-Lochbaum) Autocorrelation (Leroux-Guegen) Lattice (Burg) Equivalent Rectangular Bandwidths (ERB) Unit-triangular matrix (AR coef.) Expand formant bandwidths (LP) Third differential Hartley transform Diagonalisation of two Hermitian matrices (LDA, IMELDA)
15
Seminar NNSU Lab Summer2003 Speech processing Speaker Characteristics Feature processing Feature processing Preemphasize Cepstral Energy Cepstral Mean Subtraction (CMS) MFCC, LPCC, LFCC LPC (to Cepstral, to LSF) Residual Prediction Mel-cepstral Fundamental frequency (F0) LSF (Bark scale) RMS energy Levinson-Durbin recursion Covariance (Cholesky) Delta cepstral (Milner, High order) Pseudo Log Area Ratio (PLAR) DWT VQ Acoustic model Acoustic model Distance Bhattacharya Euclidean DTW Viterbi decoding EM (Lloyd) K-means (Lloyd) PLP MLP Twin-output MLP LDA NLDA Generative models Generative models GMM HMM (Baum-Welch)
16
Seminar NNSU Lab Summer2003 Speech processing Speech Processing Feature processing Feature processing LPC LSP F0 Levinson-Durbin recursion Tilt Gaussian Acoustic & Language model Acoustic & Language model Baum-Welch training Viterbi decoding CART Statistical language modeling Speech enhancement Speech enhancement Speech Analysis Speech Analysis Discrete Wigner Distribution DWT Pitch Determination Code Excited Linear Predictor (CELP)
17
Seminar NNSU Lab Summer2003 Speech processing Speech Recognition Feature processing Feature processing DCT MFCC Mel-frequency log energy coefficients (MFLEC) Subband (SB-MFCC) CMS Within Vector Filtered (WVF-MFCC) Robust Formant (RF) algorithm Split Levinson Algorithm (SLA) Vector Quantization Vector Quantization VQ correlation Single VQ Joint VQ Acoustic & Language model Acoustic & Language model Viterbi decoding LDA QDA MLP PLP EM re-estimation Minimum Bayes Risk (MBR) Maximum Likelihood Estimation (MLE) NN (Elman predictive) HMM (Baum-Welch) GMM Buried Markov Model Decision tree state clustering WFST Dynamic Bayesian Networks
18
Seminar NNSU Lab Summer2003 Speech processing Speech Synthesis Feature processing Feature processing MFCC Log Area Ratio (LAR) Bark frequency scale FFT Power Spectrum LPC LSF F0 Likelihood Ratio Residual LP Mel Log Spectral Approximation (MLSA) MLSA filter Covariance Energy Delta, DeltaDelta Acoustic & Language model Acoustic & Language model Viterbi decoding HMM (Baum-Welch) EM training WFST CART Harmonic plus Noise Modeling (HNM) Distance Distance Euclidean Kullback-Leibler Mean Squared Log Spectral Distance (MS-LSD) Mahalanobis Itakura-Saito Symmetries Itakura RMS (root mean squared log spectral)
19
Seminar NNSU Lab Summer2003 Speech processing New Speech Functionality Feature processing Feature processing Bark scale Fundamental frequency Likelihood Ratio Covariance (Cholesky) MLSA CMS SB-MFCC WVF-MFCC Robust formant algorithm Split Levinson algorithm LPCC LFCC RMS energy Delta cepstral (Milner, High order) Pseudo LAR PLP Acoustic & Language model Acoustic & Language model HMM (Baum-Welch) HNM MLP WFST CART LDA, NLDA, QDA Minimum Bayes Risk (MBR) Maximum Likelihood Estimation NN (Elman predictive) Discrete Wigner Distribution Code Excited Linear Predictor Distance Distance Kullback-Leibler Mean Squared Log Spectral Distance (MS-LSD) Itakura-Saito Symmetries Itakura RMS
20
Seminar NNSU Lab Summer2003 Speech processing Summary Intel® IPP 3.0 now covers most useful primitives for speech processing Intel® IPP 3.0 now covers most useful primitives for speech processing Speech enabled applications require still more primitives Speech enabled applications require still more primitives Developers and researches need more samples Developers and researches need more samples
21
Seminar NNSU Lab Summer2003 Speech processing Thank You ! Vitaly Horban vgorban@unicyb.kiev.ua
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.