Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005
Overview of the Talk Introduction to Multisource Decoding Context-dependent Word Duration Modelling Measure the “Speechiness” of Fragments Summary and Plans
Multisource Decoding A framework which integrates bottom-up and top-down processes in sound understanding Easier to find a spectro-temporal region that belongs to a single source (a fragment) than to find a speech fragment (“missing data” techniques)
Unrealistic duration information encoded in HMMs No hard limits on word durations Decoder may produce word matches with unusual durations Worse with multisouce decoding Decide segregation hypotheses on incomplete data Need more temporal constraints Modelling Durations, Why? a ii 1 – a ii
Factors that Determine Word Durations Lexical Stress: Humans tend to lengthen a word when emphasizing it Surrounding Words: The neighbouring words can affect the duration of a word Speaking Rate: Fast speech vs. Slow speech Pause Context: Words followed by a long pause have relatively longer durations: the “Pre-pausal Lengthening” effect 1 1. T. Crystal, “Segmental durations in connected-speech signals: Syllabic strees,” JASA, 1988.
Word Duration Model Investigation Different words have different durational statistics Skewed distribution shape Discrete distribution more attractive Word duration histograms for digit ‘oh’ and ‘six’
Context-Dependent Duration Modelling In a connected digits domain High-level linguistic cues are minimised The effect of lexical stress is not obvious Surrounding words do not affect duration statistics This work only models the ‘pre-pausal’ lengthening effect
The “Pre-Pausal Lengthening” Effect Word duration histograms obtained by forced-alignment Distributions (solid lines) have a wide variance A clear second peak around 600 ms for ‘six’ Word duration examples divided into two parts Non-terminating word vs pre-pausal word duration examples Determine histograms for the two parts Smoothed word duration histograms for digit ‘oh’ and ‘six’
Compute Word Duration Penalty Estimate P(d|w,u), the probability of word w having duration d, if followed by u Word duration histograms (bin width 10 ms) obtained by force- alignment Smoothed and normalised to evaluate P(d|w,u) u can only be pause or non-pause in our case, thus two histograms per digit Scaling factors to control the impact of the word duration penalties
Decoding with Word Duration Modelling In Viterbi decoding Time-synchronous algorithm Apply word duration penalties as paths leave final state But within a template paths with different histories have different durations! 1. S. Renals and M. Hochberg (1999), “Start-synchronous search for large vocabulary continuous speech recognition,” Multi-Stack decoding Idea from NOWAY 1 decoder Time-asynchronous, but start-synchronous Have knowledge of each hypothesis’s future
Multi-stack Decoding Partial word sequence hypotheses H(t,W(t),P(t)) stored on each stack The reference time t at which the hypothesis end The word sequence W(t)=w(1)w(2)…w(n) covering the time from 1 to t Its overall likelihood P(t) The most likely hypothesis on each stack is extended further Viterbi algorithm is used to find one-word extension The final result: the best hypothesis on the stack at time T Viterbi Search Time t1t1 t2t2 T Final result t3t3 t4t4 t5t5 t6t6
When placing a hypothesis onto stacks Compute the WD penalty based on the one- word extension Apply the penalty to the hypothesis’s likelihood score Setting a search range: WD min and WD max Reduce computational cost A typical duration range for a digit is between ms Applying Word Duration Penalties Time t1t1 t2t2 t3t3 t4t4 t5t5
Recognition Experiments “Soft mask” missing data system, Spectral domain features 16 states per HMM, 7 Gaussians per state Silence model and short pause model in use Aurora 2 connected digits recognition task, clean training
Experiment Results Four recognition systems: 1.Baseline system, no duration model 2.+ uniform duration model 3.+ context-independent duration model 4.+ context-dependent duration model
Discussion Context-dependent word duration model can offer significant improvement With duration constraints decoder can produce more reasonable duration patterns Assumes the duration pattern in clean situations is same as in noise Need normalisation by speaking rate
Overview of the Talk Introduction to Multisource Decoding Context-dependent Word Duration Modelling “Speechiness” Measures of Fragments Discussion
Motivation of Measuring “Speechiness” The multisource decoder assumes each fragment has a equal probability of being speech or not We can measure the “speechiness” of each fragment These measures can be used to bias the decoder towards including the fragments that are more likely to be speech.
A Noisy Speech Corpus Aurora 2 connected digits mixed with either violins or drums A set of a priori fragments have been generated, but unlabelled Allow us to study the integration problem in isolation of the problems of fragment construction
A Priori Fragments
Recognition Results “Correct”: a priori fragments with correct labels “Fragments”: a priori fragments with no labels Results demonstrate that the top-down information in our HMMs is insufficient AccDELSUBINS Violins Correct93.04%24442 Violins Fragments50.75% Drums Correct91.36%38481 Drums Fragments33.76%
Approach to Measure “Speechiness” Extract features that represent speech characteristics Use statistic models like GMMs to fit the features Need a background model which fits everything Take the speech model / background model likelihoods ratio as the confidence measure
Preliminary Experiment 1 – F0 Estimation Speech and other sounds have differences in F0, and also in Delta F0 Measure the F0 of each fragment rather than full bands signal Compute the correlogram of all the frequency channels Only sum those channel within the fragment For each frame, find the peak to estimate its F0 Smooth F0 crossing the fragment
Preliminary Experiment 1 – F0 Estimation PitchDelta pitchBoth 74.3%77.4%88.8% Accuracies using different features Gmms with full covariance and two Gaussians Speech fragments vs Violin fragments Background model trained on violin fragments Log likelihood ratio threshold is 0
Preliminary Experiment 2 – Energy Ratios Speech has more energy around formants Divides spectral features into frequency bands Compute the amount of energy of a fragment within each band, normalised by full band energy Two bands case: Channel Central Frequency (CF) = 50 – 1000 – 3750 Hz Four bands case: CF = 50 – 282 – 707 – 1214 – 3850 Hz
Preliminary Experiment 2 – Energy Ratios Speech fragments vs music fragments (violins & drums) Full covariance GMMs with 4 Gaussians Background model trained on all types of fragments Two bandsFour bands 79.7%93.2% Accuracies using different features
Summary and Plans Don’t need any classification, leave the confidence measures to the multisource decoder Assumes the background model is accessible, in practice needs a garbage model Combine different features together More speech features, e.g. syllabic rate
Thanks! Any questions?