Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

Slides:



Advertisements
Similar presentations
Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
: Recognition Speech Segmentation Speech activity detection Vowel detection Duration parameters extraction Intonation parameters extraction German Italian.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Application of HMMs: Speech recognition “Noisy channel” model of speech.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Why is ASR Hard? Natural speech is continuous
Natural Language Understanding
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Page 0 of 14 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.
As a conclusion, our system can perform good performance on a read speech corpus, but we will have to develop more accurate tools in order to model the.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
7-Speech Recognition Speech Recognition Concepts
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Variational Bayesian Methods for Audio Indexing
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Online Multiscale Dynamic Topic Models
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Computational NeuroEngineering Lab
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Statistical Models for Automatic Speech Recognition
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Learning Long-Term Temporal Features
Presentation transcript:

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky

Outline 1.State-of-the-art 2.Modelling phoneme duration 3.Suggestions of human perception results for speech modelling 4.Conclusion I of IV

1. Current HMM state-of-the-art Technology

State-of-the-art: Overview Speech Modelling I of IV

1. Feature Extraction For speech recognition: Extract features that enable us to discriminate between different classes (phonemes) The more discriminant the features, the easier it is to do classification Usually extract frequencies contained in frame (MFCC) I of IV

2. Speech Modelling Usually uses Hidden Markov Models Characteristics Number of states Transition probabilities Model to estimate emission likelihoods (GMMs) or posterior probabilities (ANNs) I of IV

2. Modelling Phoneme duration

Problem (1) Phonemes in reality have different duration If minimum duration longer than phoneme: some states have to model context

Problems (2) Generally, if less number of states, less good performance %WER Baseline TIMIT (S6G32p4) TIMIT S4G32p TIMIT S4G64p

Possible solutions Hypothesis: choose shorter minimum duration HMM‘s for shorter phonemes (prior knowledge) 1.Other topology (jump states) 2.Less states

Test setup – TIMIT TIMIT database 8 dialectic regions, 630 speakers Without the dialect „sa“ utterances 3693 sentences for training 1344 sentences for testing Number of model parameters is constant (less states => more Gaussians per states)

Modelling phoneme duration: Results (1) %WER Baseline TIMIT (S6G32p4) TIMIT var. no of states: S6G32-S4G32p TIMIT var. no of states: S6G32-S4G64 NO P TIMIT jump model for all phonemes (S6G32p2) TIMIT var. Topology: jump model (S6G32p1) 39.90

Modelling Phoneme duration: Results analysis If decreased number of states, an increase in Gaussians per state is neccessary to ensure comparative model complexity Insertion penalty less important Decreasing model minimum duration for short phonemes helps correct recognition Better results for variable states

2. Suggestions of human perception results for speech modelling

Human perception tests: Motivation Speech is created to be perceived by humans We know that human performance is very good and robust Simulation of the human perception may lead to improvements Testing on nonsense phoneme sequences (no language model) to isolate the „Acoustic Model“

Human stop-consonant perception (1) Tested stop-consonant perception: Identical noise burst in variable context Liberman, Cooper, Delattre in 1952 Valid test results? Implications for state-of-the-art technology

Human stop-consonant perception: Test setup Synthetic sounds (Matlab generated) 40 test persons, 2 tests each 17 English 11 French 12 others Tests on different days Headphones Technics RP-F880 Quiet room

Human stop-consonant perception: Test setup (2) 12 different noise burst frequencies 7 different two-formant vowels No transitions

Human stop-consonant perception: Selected results In front of /a/ In front of /o/

Suggestions of human perception results for speech modelling Suggests that speech data has to be analyzed in some context => well-known results that context- dependent phoneme models improve performance Suggests the neccessity of the use of Multiple Gaussians per state

4. Conclusions

Conclusions (1) Performance can be improved by introducing variable-state HMMs Context-independent phoneme model is inadequate with short-term spectral features

Conclusions (2) New features (such as TRAPS) to enable capturing of relative dependencies better? Preference to context-dependent phoneme models with multiple Gaussians

Thank you!

Human stop-consonant perception: Results 2004 EN vs.FR