Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Speech Recognition with Hidden Markov Models Winter 2011

Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL

1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.

Supervised Learning Recap

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Visual Recognition Tutorial

Pattern Recognition and Machine Learning

Speech Recognition Training Continuous Density HMMs Lecture Based on:

Lecture 5: Learning models using EM

Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Visual Recognition Tutorial

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.

9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.

Introduction to Automatic Speech Recognition

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

Isolated-Word Speech Recognition Using Hidden Markov Models

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

Graphical models for part of speech tagging

7-Speech Recognition Speech Recognition Concepts

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Other Models for Time Series. The Hidden Markov Model (HMM)

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Statistical Models for Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

A Tutorial on Bayesian Speech Feature Enhancement

Generally Discriminant Analysis

LECTURE 15: REESTIMATION, EM AND MIXTURES

Presentation transcript:

Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos

TUC - SDSG Outline Prior Work –Adaptation –Acoustic Modeling –Robust Feature Selection Bridge over to HIWIRE work-plan –Robust Features, Acoustic Modeling, Adaptation –New areas: audio-visual, microphone arrays

TUC - SDSG Adaptation Transformation-based adaptation MAP Adaptation (Bayesian learning approximation) Speaker Clustering / Speaker space models. Robust Feature Selection Combinations

TUC - SDSG Acoustic Model Adaptation: SDSG Selected Work Constrained Estimation Adaptation Maximum Likelihood Stochastic Transformations Combined Transformation-MAP adaptation MLST Basis Vectors Incremental Adaptation Dependency modeling of biases Vocal Tract Norm. with Linear Transformation

TUC - SDSG Constrained Estimation Adaptation (Digalakis 1995) Hypothesize a sequence of feature-space linear transformations: Adapted models (A) are then: diagonal. Adaptation is equivalent to estimating the state dependent

TUC - SDSG Compared to MLLR (Leggeter 1996) Both published at the same time. MLLR is only model adaptation. MLLR transforms only the model means in MLLR is block diagonal. Constrained estimation is more generic.

TUC - SDSG Limitations of the Linear Assumption Linear assumption may be too restrictive in modeling the training testing dependency. Goal: Try a more complex transformation. All Gaussians in a class are restricted to be transformed identically using the same transformation. Goal: Let each Gaussian in a class to decide for its own transformation. Which transformation transforms each Gaussian is predefined. Goal: Let the system to automatically choose the transformation-Gaussian couples.

TUC - SDSG ML Stochastic Transformations (MLST) (Diakoloukas Digalakis 1997) Hypothesize a sequence of feature-space stochastic transformations of the form:

TUC - SDSG MLST: model-space Use a set of MLSTs instead of linear transformations. Adapted observation densities: –MLST-Method I is diagonal –MLST-Method II is block diagonal

TUC - SDSG MLST: Reduce the number of mixture components The adapted mixture densities consist of Gaussians. Reduce the Gaussians back to their SI number: –HPT: Apply the component transformation with the highest probability to each Gaussian. –LCT: Linear combination of all component transforms. –MTG: Merge the transformed Gaussians.

TUC - SDSG Schematic representation of MLST adaptation

TUC - SDSG MLST properties A sj, b sj are shared at a state or state-cluster level Transformation weights l j are estimated at a Gaussian level MLST combines transformed Gaussians MLST is flexible on how to select a transformation for each Gaussian. MLST chooses arbitrary number of transformations per class.

TUC - SDSG MLST compared to ML Linear Transforms Hard versus Soft decision: –Choose the linear component based on the training samples. Adaptation Resolution: –Linear components are common to a transformation class –Choose the transformation at a Gaussian level –Increased adaptation resolution - robust estimation

TUC - SDSG MLST basis transforms (Boulis Diakoloukas Digalakis 2000) Algorithm steps: –Cluster the training speaker space into classes –Train MLST component transforms using data from each training speaker class –Adaptation data is used to estimate the transformation weight It is like having a-priori knowledge to the estimation process Results in rapid speaker adaptation Significant gains for medium and small data sets

TUC - SDSG Combined Transformation Bayesian (Digalakis Neumeyer 1996) MAP estimation can be expressed as: Retain the asymptotic properties of MAP Retain fast adaptation rates of transformations.

TUC - SDSG Rapid Speech Recognizer Adaptation (Digalakis et.al 2000) Dependence models of the bias components of cascaded transforms. Techniques: –Gaussian multiscale process –Hierarchical tree-structured prior –Explicit correlation models –Markov Random Fields

TUC - SDSG VTN with Linear Transformation (Potamianos and Rose 1997, Potamianos and Narayanan 1998) Vocal Tract Normalization: Select optimal warping factor  according to  = arg max P(Xª|a,, H) where H is the transcription, and Xª frequency warped observation vector by factor a. VTN with linear transformation { ,  } = arg max P(Xª|a, ,, H) where h  () is a parametric linear transformation with parameter 

TUC - SDSG Acoustic Modeling: SDSG Selected Work Genones: Generalized Gaussian mixture tying scheme Stochastic Segment Models (SSMs)

TUC - SDSG Genones: Generalized Mixture Tying (Digalakis Monaco Murveit 1996) Algorithm Steps: –Clustering of HMM states based on the similarity of their distributions –Splitting: Construct seed codebooks for each state cluster Either identify the most likely mixture component subset Or cluster down the original codebook –Reestimation of the parameters using Baum-Welch Better trade-off between modelling resolution and robustness Genones are used in Decipher and Nuance

TUC - SDSG Segment Models HMM limitations: –Weak duration modelling –Conditional independence of observations assumption –Restrictions on feature extraction imposed by frame-based observations Segment models motivation: –Larger number of degrees of freedom in the model –Use segmental features –Model correlation of frame-based features –Powerful modelling of transitions and longer-range speech dynamics –Less distortion for segmental coding  segmental recognition more efficient

TUC - SDSG General Stochastic Segment Models A segment s in an utterance of N frames is s = {(τ a, τ b ): 1≤ τ a ≤ τ b ≤ N} Segment model density: Segment models generate a variable-length sequence of frames

TUC - SDSG Stochastic Segment Model (Ostendorf Digalakis 1992) Problem: Model time correlation within a segment Solution: Gaussian model variations based on assumptions about the form of statistical dependency –Gauss-Markov model –Dynamical System model –Target State model.

TUC - SDSG SSM Viterbi Decoding (Ostendorf Digalakis Kimball 1996) HMM Viterbi recognition: State to Word sequence mapping: SSM analogous solution: Map the segment label sequence to the appropriate word sequence:

TUC - SDSG From HMMs to Segment Models (Ostendorf Digalakis 1996) Unified view of stochastic modeling General stochastic model that encompasses most SM type models Similarities in terms of correlation and parameter tying assumptions Analogies between segment models and HMMs

TUC - SDSG Robust Feature Selection Time-Frequency Representation for ASR (Potamianos and Maragos 1999) Confidence Measure Estimation for ASR Features sent over wireless channels (“missing features”) (Potamianos and Weerackody 2001) AM-FM Model Based Features (Dimitriadis et al 2002)

TUC - SDSG Other Work Multiple source separation using microphone arrays (Sidiropoulos et al. 2001)

TUC - SDSG Prior Work Overview MLST. Constr. Est. Adapt. MAP (Bayes) Adapt. Genones Segment Models VTLN Combinations Robust Features

TUC - SDSG HIWIRE Work Proposal Adaptation Bayes optimal class. Audio Visual ASR Baseline experiments Microphone Arrays Speech/Noise Separation Feature Selection AM-FM Features Acoustic Modeling Segment Models

TUC - SDSG Bayes optimal classification (HIWIRE proposal) Classifier decision for a test data vector x test : Choose the class that results in the highest value:

TUC - SDSG Bayes optimal versus MAP Assumption: the posterior is sufficiently peaked around the most probable point MAP approximation: θ MAP is the set of parameters that maximize:

TUC - SDSG Why Bayes optimal classification Optimal classification criterion The prediction of all the parameter hypotheses is combined Better discrimination Less training data Faster asymptotic convergence to the ML estimate However: –Computationally more expensive –Difficult to find analytical solutions –....hence some approximations should still be considered

TUC - SDSG Segment Models Phone Transition modeling –New features Combine with HMMs Parametric modeling of feature trajectories

TUC - SDSG AM-FM Features See NTUA presentation

TUC - SDSG Audio-Visual ASR Baseline

TUC - SDSG Microphone Array Speech – Noise source separation algorithms