Segment-Based Speech Recognition

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Philip Jackson and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics in a segmental-HMM recognizer using intermediate.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
A PRESENTATION BY SHAMALEE DESHPANDE
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Isolated-Word Speech Recognition Using Hidden Markov Models
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
7-Speech Recognition Speech Recognition Concepts
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Automatic Speech Recognition
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
SMEM Algorithm for Mixture Models
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Automatic Speech Recognition: Conditional Random Fields for ASR
Digital Systems: Hardware Organization and Design
LECTURE 15: REESTIMATION, EM AND MIXTURES
Speech Recognition: Acoustic Waves
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

Segment-Based Speech Recognition

Outline Introduction to Segment-Based Speech Recognition Searching graph-based observation spaces Anti-phone modeling Near-miss modeling Modeling landmarks Phonological modeling 9 December 2018 Veton Këpuska

Segment-Based Speech Recognition 9 December 2018 Veton Këpuska

Segment-Based Speech Recognition Acoustic modeling is performed over an entire segment Segments typically correspond to phonetic-like units Potential advantages: Improved joint modeling of time/spectral structure Segment- or landmark-based acoustic measurements Potential disadvantages: Significant increase in model and search computation Difficulty in robustly training model parameters 9 December 2018 Veton Këpuska

Hierarchical Acoustic-Phonetic Modeling Homogeneous measurements can compromise performance Nasal consonants are classified better with a longer analysis window Stop consonants are classified better with a shorter analysis window Class-specific information extraction can reduce error. 9 December 2018 Veton Këpuska

Committee-based Phonetic Classification Change of temporal basis affects within-class error Smoothly varying cosine basis better for vowels and nasals Piecewise-constant basis better for fricatives and stops Combining information sources can reduce error. 9 December 2018 Veton Këpuska

Phonetic Classification Experiments (A. Halberstadt, 1998) TIMIT acoustic-phonetic corpus Context-independent classification only 462 speaker training corpus, 24 speaker core test set Standard evaluation methodology, 39 common phonetic classes Several different acoustic representations incorporated Various time-frequency resolutions (Hamming window 10-30 ms) Different spectral representations (MFCCs, PLPCCs, etc) Cosine transform vs. piecewise constant basis functions Evaluated MAP hierarchy and committee-based methods 9 December 2018 Veton Këpuska

Phonetic Classification Experiments (A. Halberstadt, 1998) Evaluation Results Method %Error Baseline 21.6 MAP Hierarchy 21.0 Committee of Classifiers 18.5 Committee with Hierarchy 18.3 9 December 2018 Veton Këpuska

Statistical Approach to ASR P(W) Given acoustic observations, A, choose word sequence, W*, which maximizes a posteriori probability, P(W|A) Language Model Linguistic Decoder Words W’ W A Signal Processor Acoustic Model Speech P(A|W) 9 December 2018 Veton Këpuska

Statistical Approach to ASR Bayes rule is typically used to decompose P(W|A) into acoustic and linguistic terms 9 December 2018 Veton Këpuska

ASR Search Considerations A full search considers all possible segmentations, S, and units, U, for each hypothesized word sequence, W Can seek best path to simplify search using dynamic programming (e.g., Viterbi) or graph-searches (e.g. A*) 9 December 2018 Veton Këpuska

ASR Search Considerations The modified Bayes decomposition has four terms: In HMM’s: P(A|SUW) corresponds to acoustic model likelihood P(S|UW)P(U|W) corresponds to state model likelihood P(W) corresponds to language model probability. 9 December 2018 Veton Këpuska

Examples of Segment-based Approaches HMMs Variable frame-rate (Ponting et al., 1991, Alwan et al., 2000) Segment-based HMM (Marcus, 1993) Segmental HMM (Rusell et al., 1993) Trajectory Modeling Stochastic segment models (Ostendorf et al., 1989) Parametric trajectory models (Ng. 1993) Statistical trajectory models (Goldenthal, 1994) Feature-based FEATURE (Cole et al., 1983) SUMMIT (Zue et al. 1989) LAFF (Stevens et al., 1992) 9 December 2018 Veton Këpuska

Segment-based Modeling at MIT Baseline segment-based modeling incorporates: Averages and derivatives of spectral coefficients (e.g., MFCCs) Dimensionality normalization via principal component analysis PDF estimation via Gaussian mixtures Example acoustic-phonetic modeling investigations, e.g., Alternative probabilistic classifiers (e.g., Leung, Meng) Automatically learned feature measurements (e.g., Phillips, Muzumdar) Statistical trajectory models (Goldenthal) Hierarchical probabilistic features (e.g., Chun, Halberstadt) Near-miss modeling (Chang) Probabilistic segmentation (Chang, Lee) Committee-based classifiers (Halberstadt) 9 December 2018 Veton Këpuska

SUMMIT Segment-Based ASR SUMMIT speech recognition is based on phonetic segments Explicit phone start and end times are hypothesized during search Differs from conventional frame-based methods (e.g., HMMs) Enables segment-based acoustic-phonetic modeling Measurements can be extracted over landmarks and segments 9 December 2018 Veton Këpuska

SUMMIT Segment-Based ASR Recognition is achieved by searching a phonetic graph Graph can be computed via acoustic criterion or probabilistic models Competing segmentations make use of different observation spaces Probabilistic decoding must account for graph-based observation space 9 December 2018 Veton Këpuska

“Frame-based” Speech Recognition Observation space, A, corresponds to a temporal sequence of acoustic frames (e.g., spectral slices) Each hypothesized segment, si, is represented by the series of frames computed between segment start and end times A = {a1a2a3} a1 a2 a3 a1 a2 a2 a3 9 December 2018 Veton Këpuska

“Frame-based” Speech Recognition The acoustic likelihood, P(A|SW), is derived from the same observation space for all word hypotheses P(a1a2a3|SW)  P(a1a2a3 |SW)  P(a1a2a3|SW) 9 December 2018 Veton Këpuska

“Feature-based” Speech Recognition Each segment, si, is represented by a single feature vector, ai Given a particular segmentation, S, A consists of X, the feature vectors associated with S, as well as Y, the feature vectors associated with segments not in S: A = X∪Y A = {a1a2a3a4a5} a1 a3 a5 a2 a4 X={a1a3a5} Y={a2a4} X={a1a2a4a5} Y={a3} 9 December 2018 Veton Këpuska

“Feature-based” Speech Recognition To compare different segmentations it is necessary to predict the likelihood of both X and Y: P(A|SW) = P(XY|SW) P(a1a3a5a2a4 |SW)  P(a1a2a4a5a3 |SW) 9 December 2018 Veton Këpuska

Searching Graph-Based Observation Spaces: The Anti-Phone Model Create a unit, ’, to model segments that are not phones For a segmentation, S, assign anti-phone to extra segments All segments are accounted for in the phonetic graph Alternative paths through the graph can be legitimately compared 9 December 2018 Veton Këpuska

Searching Graph-Based Observation Spaces: The Anti-Phone Model Path likelihoods can be decomposed into two terms: The likelihood of all segments produced by the anti-phone (a constant) The ratio of phone to anti-phone likelihoods for all path segments MAP formulation for most likely word sequence, W, given by: 9 December 2018 Veton Këpuska

Modeling Non-lexical Units: The Anti-phone Given a particular segmentation, S, A consists of X, the segments associated with S, as well as Y, the segments not associated with S:P(A|SU)=P(XY|SU) Given segmentation S, assign feature vectors in X to valid units, and all others in Y to the anti-phone Since P(XY|’) is a constant, K, we can write P(XY|SU) assuming independence between X and Y 9 December 2018 Veton Këpuska

Modeling Non-lexical Units: The Anti-phone We need consider only segments in S during search: 9 December 2018 Veton Këpuska

SUMMIT Segment-based ASR 9 December 2018 Veton Këpuska

Anti-Phone Framework Properties Models entire observation space, using both positive and negative examples Log likelihood scores are normalized by the anti-phone Good scores are positive, bad scores are negative Poor segments all have negative scores Useful for pruning and/or rejection Anti-phone is not used for lexical access No prior or posterior probabilities used during search Allows computation on demand and/or fast-match Subsets of data can be used for training Context-independent or -dependent models can be used Useful for general pattern matching problems with graph-based observation spaces 9 December 2018 Veton Këpuska

Beyond Anti-Phones: Near-Miss Modeling Anti-phone modeling partitions the observation space into two parts (i.e., on or not on a hypothesized segmentation) Near-miss modeling partitions the observation space into a set of mutually exclusive, collectively exhaustive subsets One near-miss subset pre-computed for each segment in a graph Temporal criterion can guarantee proper near-miss subset generation (e.g., segment A is a near-miss of B iff A’s mid-point is spanned by B) During recognition, observations in a near-miss subset are mapped to the near-miss model of the hypothesized phone Near-miss models can be just an anti-phone, but can potentially be more sophisticated (e.g., phone dependent) 9 December 2018 Veton Këpuska

Creating Near-miss Subsets Near-miss subsets, Ai, associated with any segmentation, S, must be mutually exclusive, and exhaustive: A = U Ai AiS Temporal criterion guarantees proper near-miss subsets Abutting segments in S account for all times exactly once Finding all segments spanning a time creates near-miss subsets 9 December 2018 Veton Këpuska

Modeling Landmarks We can also incorporate additional feature vectors computed at hypothesized landmarks or phone boundaries Every segmentation accounts for every landmark – Some landmarks will be transitions between lexical-units – Other landmarks will be considered internal to a unit Both context-independent or dependent units are possible Effectively model transitions between phones (i.e., diphones) Frame-based models can be used to generate segment graph 9 December 2018 Veton Këpuska

Modeling Landmarks Frame-based measurements: Computed every 5 milliseconds Feature vector of 14 Mel-Scale Cepstral Coefficients (MFCCs) Landmark-based measurements: Compute average of MFCCs over 8 regions around landmark 8 regions X 14 MFCC averages = 112 dimension vector 112 dims. reduced to 50 using principal component analysis 9 December 2018 Veton Këpuska

Probabilistic Segmentation Uses forward Viterbi search in first-pass to find best path Relative and absolute thresholds are used to speed-up search. 9 December 2018 Veton Këpuska

Probabilistic Segmentation (con’t) Second pass uses backwards A* search to find N-best paths Viterbi backtrace is used as future estimate for path scores Block processing enables pipelined computation 9 December 2018 Veton Këpuska

Phonetic Recognition Experiments TIMIT acoustic-phonetic corpus 462 speaker training corpus, 24 speaker core test set Standard evaluation methodology, 39 common phonetic classes Segment and landmark representations based on averages and derivatives of 14 MFCCs, energy and duration PCA used for data normalization and reduction Acoustic models based on aggregated Gaussian mixtures Language model based on phone bigram Probabilistic segmentation computed from diphone models 9 December 2018 Veton Këpuska

Phonetic Recognition Experiments Method % Error Triphone CDHMM 27.1 Recurrent Neural Network 26.1 Bayesian Triphone HMM 25.6 Anti-phone, Heterogeneous classifiers 24.4 9 December 2018 Veton Këpuska

Phonological Modeling Phonological Modelling Words described by phonemic baseforms Phonological rules expand baseforms into graph, e.g., Deletion of stop bursts in syllable coda (e.g., laptop) Deletion of /t/ in various environments (e.g., intersection, destination, crafts) Gemination of fricatives and nasals (e.g., this side, in nome) Place assimilation (e.g., did you (/d ih jh uw/)) Arc probabilities, P(U|W), can be trained Most HMMs do not have a phonological component 9 December 2018 Veton Këpuska

Phonological Example Example of “what you” expanded in SUMMIT recognizer Final /t/ in “what” can be realized as released, unreleased, palatalized, or glottal stop, or flap 9 December 2018 Veton Këpuska

Word Recognition Experiments Jupiter telephone-based, weather-queries corpus 50,000 utterance training set, 1806 “in-domain” utterance test set Acoustic models based on Gaussian mixtures Segment and landmark representations based on averages and derivatives of 14 MFCCs, energy and duration PCA used for data normalization and reduction 715 context-dependent boundary classes 935 triphone, 1160 diphone context-dependent segment classes Pronunciation graph incorporates pronunciation probabilities Language model based on class bigram and trigram Best performance achieved by combining models 9 December 2018 Veton Këpuska

Word Recognition Experiments Method % Error Boundary models 7.6 Segment models 9.6 Combined 6.1 9 December 2018 Veton Këpuska

Summary Some segment-based speech recognition techniques transform the observation space from frames to graphs Graph-based observation spaces allow for a wide-variety of alternative modeling methods to frame-based approaches Anti-phone and near-miss modeling frameworks provide a mechanism for searching graph-based observation spaces Good results have been achieved for phonetic recognition Much work remains to be done! 9 December 2018 Veton Këpuska

References J. Glass, “A Probabilistic Framework for Segment-Based Speech Recognition,” to appear in Computer, Speech & Language, 2003. D. Halberstadt, “Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition,” Ph.D. Thesis, MIT, 1998. M. Ostendorf, et al., “From HMMs to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition,” Trans. Speech & Audio Proc., 4(5), 1996. 9 December 2018 Veton Këpuska