Time-Domain Methods for Speech Processing 虞台文
Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time Average Zero Crossing Rate Speech vs. Silence Discrimination Using Energy and Zero-Crossing The Short-Time Autocorrelation Function The Short-Time Average Magnitude Difference Function
Time-Domain Methods for Speech Processing Introduction
Speech Processing Methods Time-Domain Method: – Involving the waveform of speech signal directly. Frequency-Domain Method: – Involving some form of spectrum representation.
Time-Domain Measurements Average zero-crossing rate, energy, and the autocorrelation function. Very simple to implement. Provide a useful basis for estimating important features of the speech signal, e.g., – Voiced/unvoiced classification – Pitch estimation
Time-Domain Methods for Speech Processing Time-Dependent Processing of Speech
Time Dependent Natural of Speech This is a test.
Time Dependent Natural of Speech
Short-Time Behavior of Speech Assumption – The properties of speech signal change slowly with time. Analysis Frames – Short segment of speech signal. – Overlap one another usually.
Time-Dependent Analyses Analyzing each frame may produce either a single number, or a set of numbers, e.g., – Energy (a single number) – Vocal tract parameters (a set of numbers) This will produce a new time-dependent sequence.
General Form n: Frame index x(m): Speech signal T[ ]: A linear or nonlinear transformation. w(m): A window function (finite of infinite).
General Form Q n is a sequence of local weighted average values of the sequence T[x(m)].
Example Energy Short-Time Energy
Example Short-Time Energy
Short-Time Energy Example
General Short-Time-Analysis Scheme T [ ] Linear Filter Linear Filter Lowpass Filter Lowpass Filter Depending on the choice of window
Time-Domain Methods for Speech Processing Short-Time Energy and Average Magnitude
Applications Silence Detection Segmentation Lip Sync …
Short-Time Energy
Short-Time Average Magnitude
Block Diagram Representation [ ] 2 x(n)x(n) x2(n)x2(n) | x(n)x(n) |x(n)| h(n)h(n) h(n)h(n) EnEn w(n)w(n) w(n)w(n) MnMn
Block Diagram Representation [ ] 2 x(n)x(n) x2(n)x2(n) | x(n)x(n) |x(n)| h(n)h(n) h(n)h(n) EnEn w(n)w(n) w(n)w(n) MnMn What is the effect of windows?
The Effects of Windows Window length Window function
Rectangular Window
Mainlobe width Rectangular Window Peak sidelobe N=8 8
Rectangular Window What is this? Discuss the effect of window duration. Discuss the effect of mainlobe width and sidelobe peak. Mainlobe width Peak sidelobe N=8 8
Commonly Used Windows Rectangular Blackman Hanning Bartlett Hamming
Commonly Used Windows Rectangular Bartlett (Triangular) Hanning Hamming Blackman
Commonly Used Windows Rectangular Bartlett Hanning Hamming Blackman Least mainlobe width
Examples: Short-Time Energy Rectangular WindowHamming Window
Examples: Average Magnitude Rectangular WindowHamming Window
The Effects of Window Length Increasing the window length N, decreases the bandwidth. If N is too small, e.g., less than one pitch period, E n and M n will fluctuate very rapidly. If N is too large, e.g., on the order of several pitch periods, E n and M n will change very slowly.
The Choice of Window Length No signal value of N is entirely satisfactory. This is because the duration of a pitch period varies from about 2 ms for a high pitch female or a child, up to 25 ms for a very low pitch male.
Sampling Rate The bandwidth of both E n and M n is just that of the lowpass filter. So, they need not be sampled as frequently as speech signals. For example – Frame size = 20 ms – Sample period = 10 ms
Main Applications of E n and M n To provide the basis for distinguishing voiced speech segments from unvoiced segments. Silence detection.
Differences of E n and M n Emphasizing large sample-to- sample variations in x(n). The dynamic range (max/min) is approximately the square root of E n. The differences in level between voiced and unvoiced regions are not as pronounced as E n.
FIR and IIR All the windows that we discussed are FIR ’ s. Each of them is a lowpass filter. It can also be an IIR.
IIR Example Recursive formulas: Short-Time Energy: Short-Time Average magnitude:
Time-Domain Methods for Speech Processing Short-Time Average Zero-Crossing Rate
Voiced and Unvoiced Signals Th/i/s Thi/s/
The Short-Time Average Zero-Crossing Rate x(n)x(n) First Difference | ZnZn Lowpass Filter
Distribution of Zero-Crossings
Example
Time-Domain Methods for Speech Processing Speech vs. Silence Discrimination Using Energy and Zero-Crossing
Speech vs. Silence Discrimination Locating the beginning and end of a speech utterance in the environment with background of noise. Applications: – Segmentation of isolated word – Automatic speech recognition – Save bandwidth for speech transmission
Examples: In some cases, we can locate the beginning and end of a speech utterance using energy alone.
Examples: In other cases, we can locate the beginning and end of a speech utterance using zero-crossing rate alone.
Examples: Sometimes, we cannot do it using one criterion alone. Actual beginning
Difficulties In general, it is difficult to locate the boundaries if we encounter the following cases: – Weak fricatives (/f/, /th/, /h/) at the beginning or end. – Weak plosive bursts (/p/, /t/, /k/) at the beginning or end. – Nasals at the end. – Voiced fricatives which become devoiced at the end of words. – Trailing off of vowel sounds at the end of an utterance.
Rabiner and Sambur 10 msec frame with sampling rate 100 time/sec is used. The algorithm assumes that the first 100 msec of the interval contains no speech. The means and standard deviations of the average magnitude and zero-crossing rate of this interval are computed to characterize the background noise.
The Algorithm
1 2 3 No more than 25 frames
Examples
Time-Domain Methods for Speech Processing The Short-Time Autocorrelation Function
Autocorrelation Functions x(m)x(m) x(m+k)x(m+k) k
Properties 1. Even: (k) = ( k). 2. (k) (0) for all k. 3. (0) is equal to the energy of x(m). x(m)x(m) x(m+k)x(m+k) k
Properties 4. If x(m) has period P, i.e. x(m)= x(m+P), then x(m)x(m) x(m+k)x(m+k) k
Properties 4. If x(m) has period P, i.e. x(m)= x(m+P), then This motivates us to use autocorrelation for pitch detection.
x(m+k)w(n k m) Short-Time Version x(m)x(m) x(m)w(nm)x(m)w(nm) n k
Property x(mk)w(n+km)x(mk)w(n+km) k x(m)w(nm)x(m)w(nm) x(m+k)w(n k m) k R n (k) R n ( k)
Property yk(m)yk(m) hk(nm)hk(nm)
yk(m)yk(m) hk(nm)hk(nm)
zkzk zkzk hk(n)hk(n) hk(n)hk(n) x(n)x(n) Rn(k)Rn(k)
Another Formulation
A noncausal formulation
Examples Rectangular WindowHamming Window N=401 voiced Unvoiced
Examples Less data will be involved for larger lag k. N=401 N=251 N=125
Modified Short-Time Autocorrelation Function Original Version: Modified Version:
Modified Short-Time Autocorrelation Function K Max. lag
Modified Short-Time Autocorrelation Function K Max. lag
Examples Rectangular Window N=401 voiced Unvoiced Modified Version Similar
Examples Rectangular WindowModified Version N=401 N=251 N=125
Time-Domain Methods for Speech Processing The Short-Time Average Magnitude Difference Function
The AMDF If x(n) is periodic with period P, then Computationally more effective than autocorrelation.
Example voiced Unvoiced
Exercise Recording a piece of yours speech to perform voice/unvoice segmentation. Design a effective algorithm to perform autocorrelation.