By Vitaly Horban Speech processing ® Intel ® Integrated Performance Primitives vs. Speech Libraries & Toolkits Math Inside & Outside

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Pattern Recognition and Machine Learning
Supervised Learning Recap
Speech Recognition Chapter 3
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen.
Speech & Audio Processing
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
Speaker Adaptation for Vowel Classification
Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000.
Speech Recognition in Noise
COMP 4060 Natural Language Processing Speech Processing.
Optimal Adaptation for Statistical Classifiers Xiao Li.
Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Representing Acoustic Information
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Topics covered in this chapter
Isolated-Word Speech Recognition Using Hidden Markov Models
Gaussian Mixture Model and the EM algorithm in Speech Recognition
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
1 Linear Prediction. 2 Linear Prediction (Introduction) : The object of linear prediction is to estimate the output sequence from a linear combination.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Basics of Neural Networks Neural Network Topologies.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Linear Predictive Analysis 主講人:虞台文. Contents Introduction Basic Principles of Linear Predictive Analysis The Autocorrelation Method The Covariance Method.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.
PATTERN COMPARISON TECHNIQUES
CS 224S / LINGUIST 285 Spoken Language Processing
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
ARTIFICIAL NEURAL NETWORKS
Vocoders.
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Linear Prediction.
Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan
Speech Processing Speech Recognition
8-Speech Recognition Speech Recognition Concepts
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
EE513 Audio Signals and Systems
LECTURE 15: REESTIMATION, EM AND MIXTURES
Presentation transcript:

by Vitaly Horban Speech processing ® Intel ® Integrated Performance Primitives vs. Speech Libraries & Toolkits Math Inside & Outside Seminar NNSU Lab Summer2003

Seminar NNSU Lab Summer2003 Speech processing Agenda Comparison Intel® IPP 3.0 and speech libraries & toolkits Comparison Intel® IPP 3.0 and speech libraries & toolkits Overview mathematical methods for speech processing Overview mathematical methods for speech processing General assessment of Intel® IPP 3.0 General assessment of Intel® IPP 3.0 Summary Summary

Seminar NNSU Lab Summer2003 Speech processing Acronyms LPLinear Prediction LPLinear Prediction RELPResidual Linear Prediction RELPResidual Linear Prediction PLPPerceptual Linear Prediction PLPPerceptual Linear Prediction ARArea Ratios or Autoregressive ARArea Ratios or Autoregressive LSPLine Spectrum Pairs LSPLine Spectrum Pairs LSFLine Spectral Frequencies LSFLine Spectral Frequencies MFCCMel-Frequency Cepstrum Coefficients MFCCMel-Frequency Cepstrum Coefficients MLSAMel Log Spectral Approximation MLSAMel Log Spectral Approximation DCTDiscrete Cosine Transform DCTDiscrete Cosine Transform DTWDynamic Time Warping DTWDynamic Time Warping SVDSingle Value Decomposition SVDSingle Value Decomposition VQVector Quantization VQVector Quantization RFCRise/Fall/Connections RFCRise/Fall/Connections HMMHidden Markov Model HMMHidden Markov Model ANNArtificial Neural Network ANNArtificial Neural Network EMExpectation/Maximization EMExpectation/Maximization

Seminar NNSU Lab Summer2003 Speech processing Acronyms (continue) CMSCepstral Mean Subtraction CMSCepstral Mean Subtraction MLPMulti Layer Perception MLPMulti Layer Perception LDALinear Discriminant Analysis LDALinear Discriminant Analysis QDAQuadratic Discriminant Analysis QDAQuadratic Discriminant Analysis NLDANon-Linear Discriminant Analysis NLDANon-Linear Discriminant Analysis SVMSupport Vector Machine SVMSupport Vector Machine DWTDiscrete Wavelet Transformation DWTDiscrete Wavelet Transformation LARLog Area Ratio LARLog Area Ratio PLARPseudo Log Area Ratio PLARPseudo Log Area Ratio GMMGaussian Mixture Model GMMGaussian Mixture Model WFSTWeighted Finite State Transducer WFSTWeighted Finite State Transducer CARTClassification and Regression Trees CARTClassification and Regression Trees HNMHarmonic plus Noise Modeling HNMHarmonic plus Noise Modeling MBRMinimum Bayes Risk MBRMinimum Bayes Risk SRSpeech Recognition SRSpeech Recognition TTSText-To-Speech synthesis TTSText-To-Speech synthesis

Seminar NNSU Lab Summer2003 Speech processing IPP vs. CMU Sphinx Feature processing Feature processing  LP  Spectrum  Cepstrum  MEL: filter, cepstrum, filter bank  PLP: filter, cepstrum, filter bank Language model Language model  Context-free grammar  N-gram model Acoustic model based on HMM Acoustic model based on HMM  Each HMM state – set of Gaussian mixture  HMM order  HMM position  HMM transition matrix  Baum-Welch training Feature processing Feature processing  LP  Power Spectrum  Cepstrum  LSP  Mel-scale values  Mel-frequency filter bank  Mel-cepstrum  Linear scale values Acoustic & Language models Acoustic & Language models  Gaussian mixture  Likelihood of an HMM state cluster  HMM transition matrix

Seminar NNSU Lab Summer2003 Speech processing IPP vs. CSLU Toolkit Feature processing Feature processing  Power spectral analysis (FFT)  Linear predictive analysis (LPC)  PLP  Mel-scale cepstral analysis (MEL)  Relative spectra filtering of log domain coefficients (RASTA)  First order derivative (DELTA)  Energy normalization Language model Language model  Word pronunciation  Lexical trees  Grammars Acoustic model based on HMM/ANN Acoustic model based on HMM/ANN  VQ initialisation  EM training  Viterbi decoding Feature processing Feature processing  Power Spectral analysis (FFT)  Linear predictive analysis (LPC)  LP reflection coefficients  LSP  DCT  RFC  Cross correlation coefficients  Covariance matrix  Mel-scale cepstral analysis  Derivative functions  Energy normalization Acoustic & Language model Acoustic & Language model  VQ  Weights, Means and Variances EM re-estimation  Viterbi decoding

Seminar NNSU Lab Summer2003 Speech processing IPP vs. Festival Feature processing Feature processing  Power spectrum  Tilt to RFC, RFC to Tilt, RFC to F0  LPC  MEL  LSF  LP Reflection coefficients  Fundamental frequency (pitch)  Root mean square energy Language model Language model  N-gram model  Context-free grammar  Regular expressions  CART trees  WFST Acoustic model Acoustic model  Viterbi decoding Feature processing Feature processing  Power Spectrum  Reflection to Tilt, PitchmarkToF0, Unit Curve (RFC)  LPC  MEL  LSF  LP Reflection coefficients  Energy normalization Acoustic & Language model Acoustic & Language model  Viterbi decoding

Seminar NNSU Lab Summer2003 Speech processing IPP vs. ISIP Feature processing Feature processing  Derivative functions  Spectrum  Cepstrum  Cross correlation  Covariance matrix  Covariance (Cholesky)  Energy (Log, dB, RMS, Power)  Filter bank  Log Area Ratio (Kelly-Lochbaum)  Autocorrelation (Durbin recursion, Leroux-Guegen)  Lattice (Burg)  Reflection coefficients  Gaussian probability Acoustic & Language model (HMM) Acoustic & Language model (HMM)  N-gram model  Viterbi decoding  Baum-Welch training Feature processing Feature processing  Derivative functions  Spectrum  Cepstrum  Cross correlation  Covariance matrix  Energy normalization  Filter bank  Area Ratio  Durbin’s recursion  Reflection coefficients (Schur)  Gaussian probability Acoustic & Language model Acoustic & Language model  Viterbi decoding

Seminar NNSU Lab Summer2003 Speech processing IPP vs. MATLAB Frequency Scale Conversion Frequency Scale Conversion  Mel scale  Equivalent rectangular Bandwidths (ERB) Transforms Transforms  FFT (real data)  DCT (real data)  Hartley (real data)  Diagonalisation of two Hermitian matrices (LDA, IMELDA) Vector distance Vector distance  Euclidean  Squared Euclidean  Mahalanobis  Itakura (AR, Power spectra)  Itakura-Saito (AR, Power spectra)  COSH (AR, Power spectra) Speech enhancement Speech enhancement  Martin spectral subtraction algorithm Frequency Scale Conversion Frequency Scale Conversion  Mel scale  Linear scale Transforms Transforms  DFT  FFT  DCT Distance Distance  Euclidean  Mahalanobis  DTW (observation and reference vector sequences)  Bhattacharya

Seminar NNSU Lab Summer2003 Speech processing IPP vs. MATLAB (continue) LPC analysis and transforms LPC analysis and transforms  Area ratios  Autoregressive or AR  Power spectrum  Cepstrum  DCT  Impulse response (IR)  LSP  LSF  Reflection coefficients  Unit-triangular matrix containing the AR coefficients  Autocorrelation coefficients  Expand formant bandwidths of LPC filter  Warp cepstral (Mel, Linear) Feature processing Feature processing  LPC  Area ratio  Spectrum  Cepstrum  RFC  DCT  LSP  LSF  Reflection coefficients  RFC  Autocorrelation coefficients  Cross correlation coefficients  Covariance matrix  Mel-scale cepstral analysis  Derivative functions  Energy normalization

Seminar NNSU Lab Summer2003 Speech processing IPP vs. MATLAB (continue) Speech synthesis Speech synthesis  Rosenberg glottal model  Liljencrants-Fant glottal model Speech Recognition Speech Recognition  Mel-cepstrum  Mel-filter bank  Cepstral & variances to power domain  Gaussian Mixture Speech coding (ITU G.711) Speech coding (ITU G.711)  Linear PCM  A-law  Mu-law  VQ using K-means algorithm  VQ using the Linde-Buzo-Gray algorithm Speech Recognition Speech Recognition  Feature processing  Model Evaluation  Model Estimation  Model Adaptation  Vector Quantization Speech coding (ITU G.711, G.723.1, G.729) Speech coding (ITU G.711, G.723.1, G.729)  Linear PCM  A-law  Mu-law  VQ given codebook

Seminar NNSU Lab Summer2003 Speech processing IPP vs. HTK Feature processing Feature processing  LPC  Spectral coefficients  Cepstral coefficients  Reflection coefficients  Gaussian distribution  K-means procedure  PLP  Autocorrelation coefficients  Covariance matrix  Mel-scale filter bank  MFCC  Third differential  Energy  VQ codebook Feature processing Feature processing  LPC  Area ratio  Spectrum  Cepstrum  RFC  DCT  LSP  LSF  Reflection coefficients  Autocorrelation coefficients  Cross correlation coefficients  Covariance matrix  Mel-scale cepstral analysis  Derivative functions  Energy normalization  VQ

Seminar NNSU Lab Summer2003 Speech processing IPP vs. HTK (continue) Model adaptation Model adaptation  Maximum Likelihood Linear Regression (MLLR)  EM technique  Bayesian adaptation or Maximum Aposteriori Approach (MAP) Acoustic & Language model based on HMM Acoustic & Language model based on HMM  Grammar  N-gram model  Viterbi training  Baum-Welch training Speech coding Speech coding  Linear PCM  A-law  Mu-law Model adaptation Model adaptation  EM training algorithm Acoustic & Language model Acoustic & Language model  Viterbi decoding  Likelihood of an HMM state cluster  HMM transition matrix Speech coding (ITU G.711, G.723.1, G.729) Speech coding (ITU G.711, G.723.1, G.729)  Linear PCM  A-law  Mu-law  VQ given codebook

Seminar NNSU Lab Summer2003 Speech processing Possible extension IPP 3.0 Model adaptation Model adaptation  Maximum Likelihood Linear Regression (MLLR)  Bayesian adaptation or Maximum Aposteriori Approach (MAP) Model evaluation Model evaluation  Itakura (AR, Power spectra)  Itakura-Saito (AR, Power spectra)  COSH (AR, Power spectra) Speech synthesis Speech synthesis  Rosenberg glottal model  Liljencrants-Fant glottal model Speech enhancement Speech enhancement  Martin spectral subtraction Speech coding Speech coding  VQ using K-means algorithm  VQ using the Linde-Buzo-Graym Acoustic model based on HMM Acoustic model based on HMM  Baum-Welch training Feature processing Feature processing  PLP: filter, cepstrum, filter bank  Relative spectra filtering of log domain coefficients (RASTA)  Fundamental frequency (pitch)  RMS energy  Covariance (Cholesky)  Energy (Log, dB, RMS, Power)  LAR (Kelly-Lochbaum)  Autocorrelation (Leroux-Guegen)  Lattice (Burg)  Equivalent Rectangular Bandwidths (ERB)  Unit-triangular matrix (AR coef.)  Expand formant bandwidths (LP)  Third differential  Hartley transform  Diagonalisation of two Hermitian matrices (LDA, IMELDA)

Seminar NNSU Lab Summer2003 Speech processing Speaker Characteristics Feature processing Feature processing  Preemphasize  Cepstral  Energy  Cepstral Mean Subtraction (CMS)  MFCC, LPCC, LFCC  LPC (to Cepstral, to LSF)  Residual Prediction  Mel-cepstral  Fundamental frequency (F0)  LSF (Bark scale)  RMS energy  Levinson-Durbin recursion  Covariance (Cholesky)  Delta cepstral (Milner, High order)  Pseudo Log Area Ratio (PLAR)  DWT  VQ Acoustic model Acoustic model  Distance  Bhattacharya  Euclidean  DTW  Viterbi decoding  EM (Lloyd)  K-means (Lloyd)  PLP  MLP  Twin-output MLP  LDA  NLDA Generative models Generative models  GMM  HMM (Baum-Welch)

Seminar NNSU Lab Summer2003 Speech processing Speech Processing Feature processing Feature processing  LPC  LSP  F0  Levinson-Durbin recursion  Tilt  Gaussian Acoustic & Language model Acoustic & Language model  Baum-Welch training  Viterbi decoding  CART  Statistical language modeling Speech enhancement Speech enhancement Speech Analysis Speech Analysis  Discrete Wigner Distribution  DWT  Pitch Determination  Code Excited Linear Predictor (CELP)

Seminar NNSU Lab Summer2003 Speech processing Speech Recognition Feature processing Feature processing  DCT  MFCC  Mel-frequency log energy coefficients (MFLEC)  Subband (SB-MFCC)  CMS  Within Vector Filtered (WVF-MFCC)  Robust Formant (RF) algorithm  Split Levinson Algorithm (SLA) Vector Quantization Vector Quantization  VQ correlation  Single VQ  Joint VQ Acoustic & Language model Acoustic & Language model  Viterbi decoding  LDA  QDA  MLP  PLP  EM re-estimation  Minimum Bayes Risk (MBR)  Maximum Likelihood Estimation (MLE)  NN (Elman predictive)  HMM (Baum-Welch)  GMM  Buried Markov Model  Decision tree state clustering  WFST  Dynamic Bayesian Networks

Seminar NNSU Lab Summer2003 Speech processing Speech Synthesis Feature processing Feature processing  MFCC  Log Area Ratio (LAR)  Bark frequency scale  FFT  Power Spectrum  LPC  LSF  F0  Likelihood Ratio  Residual LP  Mel Log Spectral Approximation (MLSA)  MLSA filter  Covariance  Energy  Delta, DeltaDelta Acoustic & Language model Acoustic & Language model  Viterbi decoding  HMM (Baum-Welch)  EM training  WFST  CART  Harmonic plus Noise Modeling (HNM) Distance Distance  Euclidean  Kullback-Leibler  Mean Squared Log Spectral Distance (MS-LSD)  Mahalanobis  Itakura-Saito  Symmetries Itakura  RMS (root mean squared log spectral)

Seminar NNSU Lab Summer2003 Speech processing New Speech Functionality Feature processing Feature processing  Bark scale  Fundamental frequency  Likelihood Ratio  Covariance (Cholesky)  MLSA  CMS  SB-MFCC  WVF-MFCC  Robust formant algorithm  Split Levinson algorithm  LPCC  LFCC  RMS energy  Delta cepstral (Milner, High order)  Pseudo LAR  PLP Acoustic & Language model Acoustic & Language model  HMM (Baum-Welch)  HNM  MLP  WFST  CART  LDA, NLDA, QDA  Minimum Bayes Risk (MBR)  Maximum Likelihood Estimation  NN (Elman predictive)  Discrete Wigner Distribution  Code Excited Linear Predictor Distance Distance  Kullback-Leibler  Mean Squared Log Spectral Distance (MS-LSD)  Itakura-Saito  Symmetries Itakura  RMS

Seminar NNSU Lab Summer2003 Speech processing Summary Intel® IPP 3.0 now covers most useful primitives for speech processing Intel® IPP 3.0 now covers most useful primitives for speech processing Speech enabled applications require still more primitives Speech enabled applications require still more primitives Developers and researches need more samples Developers and researches need more samples

Seminar NNSU Lab Summer2003 Speech processing Thank You ! Vitaly Horban