Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Building an ASR using HTK CS4706

Multipitch Tracking for Noisy Speech

Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.

Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.

Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training Mr. Yik-Cheung Tam Dr. Brian Mak.

Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,

Application of HMMs: Speech recognition “Noisy channel” model of speech.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

A Study on Detection Based Automatic Speech Recognition Author : Chengyuan Ma Yu Tsao Professor: 陳嘉平 Reporter : 許峰閤.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

AdvAIR Supervised by Prof. Michael R. Lyu Prepared by Alex Fok, Shirley Ng 2002 Fall An Advanced Audio Information Retrieval System.

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

Why is ASR Hard? Natural speech is continuous

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

Instrument Recognition in Polyphonic Music Jana Eggink Supervisor: Guy J. Brown University of Sheffield

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.

7-Speech Recognition Speech Recognition Concepts

Tracking with Unreliable Node Sequences Ziguo Zhong, Ting Zhu, Dan Wang and Tian He Computer Science and Engineering, University of Minnesota Infocom 2009.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

Extracting Melody Lines from Complex Audio Jana Eggink Supervisor: Guy J. Brown University of Sheffield {j.eggink

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.

‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Introduction to Digital Signals

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Variational Bayesian Methods for Audio Indexing

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.

A Recognition Model for Speech Coding Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture.

Performance Comparison of Speaker and Emotion Recognition

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

Predicting Voice Elicited Emotions

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Automatic Transcription of Polyphonic Music

Statistical Models for Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition

Missing feature theory

Presenter: Shih-Hsiang(士翔)

Measuring the Similarity of Rhythmic Patterns

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005

Overview of the Talk Introduction to Multisource Decoding Context-dependent Word Duration Modelling Measure the “Speechiness” of Fragments Summary and Plans

Multisource Decoding A framework which integrates bottom-up and top-down processes in sound understanding Easier to find a spectro-temporal region that belongs to a single source (a fragment) than to find a speech fragment (“missing data” techniques)

Unrealistic duration information encoded in HMMs No hard limits on word durations  Decoder may produce word matches with unusual durations Worse with multisouce decoding  Decide segregation hypotheses on incomplete data  Need more temporal constraints Modelling Durations, Why? a ii 1 – a ii

Factors that Determine Word Durations Lexical Stress:  Humans tend to lengthen a word when emphasizing it Surrounding Words:  The neighbouring words can affect the duration of a word Speaking Rate:  Fast speech vs. Slow speech Pause Context:  Words followed by a long pause have relatively longer durations: the “Pre-pausal Lengthening” effect 1 1. T. Crystal, “Segmental durations in connected-speech signals: Syllabic strees,” JASA, 1988.

Word Duration Model Investigation Different words have different durational statistics Skewed distribution shape Discrete distribution more attractive Word duration histograms for digit ‘oh’ and ‘six’

Context-Dependent Duration Modelling In a connected digits domain  High-level linguistic cues are minimised  The effect of lexical stress is not obvious  Surrounding words do not affect duration statistics This work only models the ‘pre-pausal’ lengthening effect

The “Pre-Pausal Lengthening” Effect Word duration histograms obtained by forced-alignment  Distributions (solid lines) have a wide variance  A clear second peak around 600 ms for ‘six’ Word duration examples divided into two parts  Non-terminating word vs pre-pausal word duration examples  Determine histograms for the two parts Smoothed word duration histograms for digit ‘oh’ and ‘six’

Compute Word Duration Penalty Estimate P(d|w,u), the probability of word w having duration d, if followed by u  Word duration histograms (bin width 10 ms) obtained by force- alignment  Smoothed and normalised to evaluate P(d|w,u)  u can only be pause or non-pause in our case, thus two histograms per digit  Scaling factors  to control the impact of the word duration penalties

Decoding with Word Duration Modelling In Viterbi decoding  Time-synchronous algorithm  Apply word duration penalties as paths leave final state  But within a template paths with different histories have different durations! 1. S. Renals and M. Hochberg (1999), “Start-synchronous search for large vocabulary continuous speech recognition,” Multi-Stack decoding  Idea from NOWAY 1 decoder  Time-asynchronous, but start-synchronous  Have knowledge of each hypothesis’s future

Multi-stack Decoding Partial word sequence hypotheses H(t,W(t),P(t)) stored on each stack  The reference time t at which the hypothesis end  The word sequence W(t)=w(1)w(2)…w(n) covering the time from 1 to t  Its overall likelihood P(t) The most likely hypothesis on each stack is extended further Viterbi algorithm is used to find one-word extension The final result: the best hypothesis on the stack at time T Viterbi Search Time t1t1 t2t2 T Final result t3t3 t4t4 t5t5 t6t6

When placing a hypothesis onto stacks  Compute the WD penalty based on the one- word extension  Apply the penalty to the hypothesis’s likelihood score Setting a search range: WD min and WD max  Reduce computational cost  A typical duration range for a digit is between ms Applying Word Duration Penalties Time t1t1 t2t2 t3t3 t4t4 t5t5

Recognition Experiments “Soft mask” missing data system, Spectral domain features 16 states per HMM, 7 Gaussians per state Silence model and short pause model in use Aurora 2 connected digits recognition task, clean training

Experiment Results Four recognition systems: 1.Baseline system, no duration model 2.+ uniform duration model 3.+ context-independent duration model 4.+ context-dependent duration model

Discussion Context-dependent word duration model can offer significant improvement With duration constraints decoder can produce more reasonable duration patterns Assumes the duration pattern in clean situations is same as in noise Need normalisation by speaking rate

Overview of the Talk Introduction to Multisource Decoding Context-dependent Word Duration Modelling “Speechiness” Measures of Fragments Discussion

Motivation of Measuring “Speechiness” The multisource decoder assumes each fragment has a equal probability of being speech or not We can measure the “speechiness” of each fragment These measures can be used to bias the decoder towards including the fragments that are more likely to be speech.

A Noisy Speech Corpus Aurora 2 connected digits mixed with either violins or drums A set of a priori fragments have been generated, but unlabelled Allow us to study the integration problem in isolation of the problems of fragment construction

A Priori Fragments

Recognition Results “Correct”: a priori fragments with correct labels “Fragments”: a priori fragments with no labels Results demonstrate that the top-down information in our HMMs is insufficient AccDELSUBINS Violins Correct93.04%24442 Violins Fragments50.75% Drums Correct91.36%38481 Drums Fragments33.76%

Approach to Measure “Speechiness” Extract features that represent speech characteristics Use statistic models like GMMs to fit the features Need a background model which fits everything Take the speech model / background model likelihoods ratio as the confidence measure

Preliminary Experiment 1 – F0 Estimation Speech and other sounds have differences in F0, and also in Delta F0 Measure the F0 of each fragment rather than full bands signal  Compute the correlogram of all the frequency channels  Only sum those channel within the fragment  For each frame, find the peak to estimate its F0  Smooth F0 crossing the fragment

Preliminary Experiment 1 – F0 Estimation PitchDelta pitchBoth 74.3%77.4%88.8% Accuracies using different features Gmms with full covariance and two Gaussians Speech fragments vs Violin fragments Background model trained on violin fragments Log likelihood ratio threshold is 0

Preliminary Experiment 2 – Energy Ratios Speech has more energy around formants Divides spectral features into frequency bands Compute the amount of energy of a fragment within each band, normalised by full band energy Two bands case: Channel Central Frequency (CF) = 50 – 1000 – 3750 Hz Four bands case: CF = 50 – 282 – 707 – 1214 – 3850 Hz

Preliminary Experiment 2 – Energy Ratios Speech fragments vs music fragments (violins & drums) Full covariance GMMs with 4 Gaussians Background model trained on all types of fragments Two bandsFour bands 79.7%93.2% Accuracies using different features

Summary and Plans Don’t need any classification, leave the confidence measures to the multisource decoder Assumes the background model is accessible, in practice needs a garbage model Combine different features together More speech features, e.g. syllabic rate

Thanks! Any questions?