UNIT-IV
Introduction Speech signal is generated from a system. Generation is via excitation of system. Speech travels through various media. Nature of speech may deviate due to issues associated with system, excitation or media. Speech signal characteristics changes and hence poor performance.
Different Sources of Degradation Internally or externally induced stress. Excitation source or vocal tract system disorders. Sensor and channel mismatches. Background noise present in the media. Reverberation present in the media. Other speakers’ speech present in the media. Speech signal degrades due to one or more of the above.
Scope of Speech Enhancement Degradations present in media. Background noise, reverberation and other speaker’s speech. Nature of degradation is different in each case. Issues to be addressed are different. Noisy speech enhancement. Enhancement of reverberant speech. Processing multispeaker speech.
Approaches for Speech Enhancement Two schools of thought! Estimate degradation and minimize the same from degraded speech. Use speech specific knowledge and enhance speech components from degraded speech. Both have their own merits. Intelligent combination takes benefit of both.
Human way of Speech Enhancement Two ears, binaural mechanism. Selective attention. Cognitive processing. Focus on speech components. Repeat request. We experience, but do not understand fully !
Noisy Speech Enhancement sd(n) = s(n)+d(n), where s(n) is clean speech, d(n) is background noise and sd(n) is noisy speech. Note: d(n) combines in additive way with s(n) and hence additive background noise. This is the model assumed for noisy speech. Examples of additive background noise: White noise, colored noise, factory noise, babble noise... How to achieve enhancement in case of noisy speech?
Enhancement by Estimating Degradation
Spectral Subtraction based Noisy Speech Enhancement We assume that the noise is additive and uncorrelated with the desired signal.
Enhancement using Speech-Specific Knowledge Noisy speech model: sd(n) = s(n) + d(n) Let w(n) be weight function estimated using speech- specific knowledge. w(n) gives more emphasis to speech-specific high SNR regions. Enhanced speech: ˜s(n) = sd(n) × w(n) = (s(n) + d(n)) × w(n).
Enhancement by Comb Filter There are many algorithms for enhancement of noise corrupted speech. Specific properties of voiced speech signals, which can be considered as quasi harmonic signals, are exploited here. The voiced speech signal x(t) can be considered as a sum of sine waves, whose frequencies are integral multiples of the fundamental frequency F0
Cont’d The number N is the assumed number of harmonics of the voiced speech signal. A comb filter is a filter with multiple pass bands and stop bands. For transmitting only the harmonic components of the speech signal, the pass bands must be centered at multiples of the speech fundamental frequency, i.e. the frequency response of the comb filter has to be a periodic function with period equal to the fundamental frequency.
Cont’d Because voiced speech signals have time varying fundamental frequency, the comb filter for the enhancement of voiced speech has to be an adaptive filter tuned by the instantaneous fundamental frequency of the speech. It means that the comb filter vary from frame to frame. A comb filter can be constructed by frequency transformation of a FIR or IIR prototype filter. Because almost all processing in speech enhancement algorithm based on spectral subtraction is performed in the spectral domain, it is appropriate to design and apply the comb filter in the spectral domain too.
Cont’d
Comb filtering by itself is not sufficient to suppress noisy background in noise degraded speech signal. Further it is difficult to estimate the fundamental frequency for noisy speech. Therefore we have used comb filtering as post processing operation in speech enhancement, e.g. by spectral subtraction. The comb filter is constructed and applied in the frequency domain only for voiced frames after the classical spectral speech enhancement by spectral subtraction. For the construction of the comb filter we have to know the actual value of the speech fundamental frequency F0. For its estimation it is appropriate to use a pitch determination algorithm also in the spectral domain. If the spectrum after the classical spectral speech enhancement is identified as unvoiced, comb filtering is not applied. Cont’d
Enhancement by wiener Filter Additive Noise : Let y[n] be a discrete-time noisy sequence y[n] = x[n] + b[n] where x [n] is the desired signal, and b[n] is the unwanted background noise. An alternative to spectral subtraction for recovering a signal corrupted by additive noise, is to find a linear filter h[n] such that the sequence xˆ[n] = y[n] ∗ h[n] where xˆ[n] is the estimate of x[n].
minimizes the mean-squared error (MMSE)