Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Time Domain Analysis of Speech Signal. 3.1 Short-time windowing signal (1) Three types windows : –Rectangular window –h r [n] = u[n] – u[n –

Similar presentations


Presentation on theme: "Chapter 3 Time Domain Analysis of Speech Signal. 3.1 Short-time windowing signal (1) Three types windows : –Rectangular window –h r [n] = u[n] – u[n –"— Presentation transcript:

1 Chapter 3 Time Domain Analysis of Speech Signal

2 3.1 Short-time windowing signal (1) Three types windows : –Rectangular window –h r [n] = u[n] – u[n – N] –H r (e j ω ) = (sin(ωN/2)/sinω/2)e -jω(N-1) /2 –General Hamming window –H h [n] = (1-α) – αcos(2πn/N) 0 ≤ n < N – = (1-α) h r [n] - α h r [n] cos(2πn/N) –H h (e j ω ) = (1-α)H r (e j ω ) - (α/2)H r [e j( ω-2π/N) ] - (α/2)H r [ e j( ω+2π/N) ] – α=0.5 hanning window, α=0.46 hamming win –Windowed signal is x w (n) = x(n) w(n) π π

3 Short-time windowing signal (2) Q n = Σ m=n-N+1 n T[x(m)]w(n-m) This is another representation for analysis. Window length is limited, so the values of Q n is a sequence of local weighted average values of the sequence T[x(m)]. T[ ] is a linear or nonlinear transformation. Q n describes the short-time property of speech signal.

4 3.2 Time domain parameters (1) 3.2.1 Short-time Energy and short- time average amplitude –E n = Σ m=n n+N-1 x w 2 (m) (by using rectangle window) –the summation is from n to n+N-1 –For voiced segment (or frame) E n is large, for unvoiced segment it is small –E n is too sensitive to large signal levels –M n = Σ m=n n+N-1 |x w (m)|/N –M n also describes the average intensity of the signal

5 Time domain parameters (2) 3.2.2 Short-time average zero- crossing rate –Z n = Σ m=n n+N-1 |sgn[x w (m)] - sgn[x w (m-1)]| –where sgn(x) = 1 x ≥ 0 – = -1 x < 0 –Z n can roughly estimate the frequency of signal –Multiple threshold for zero-crossing: –Z ni = Σ m=n n+N-1 {|sgn[x w (m)-T i ] - sgn[x w (m-1)-T i ]| + |sgn[x w (m)+T i ] - sgn[x w (m-1)+T i ]|}, i=1,2,3,… –It has some ability to avoid interference of low frequency. Random noise won’t contribute to Z ni.

6 Time domain parameters (3) 3.2.3 Short-time auto- correlation function –R w (k) =Σ m=0 N-k-1 x w (m)x w (m+k) –R w (k) = R w (-k) =Σ m=k N-1 x w (m)x w (m-k) –R w (k) = 0 for k N-1 –R w (0) = Σ m=0 N-1 x w 2 (m) >= R w (k)

7 Time domain parameters (4) 3.2.4 Short-time frequency and power spectrum –X w (exp(jω)) = Σ n=0 N-1 x w (n)exp(-jωn) is short-time frequency spectrum –|X w (exp(jω))| 2 is called short-time power spectrum density –|X w (exp(jω))| 2 = Σ -N+1 N-1 R w (n)exp(- jωn) –Short-time auto-correlation function and power spectrum is an important pair of parameter

8 Time domain parameters (5) 3.2.5 Short-time Average Magnitude Difference Function –r w (k) = Σ m=0 N-k-1 |x w (m+k) - x w (m)| –AMDF is implemented with subtraction, addition, and absolute value operations, in contrast to addition and multiplication operation for the auto-correlation function.

9 3.3 S/U/V detection S-silence, U-unvoiced, V-voiced are three basic speech states S, U and V are random, they have different distributions (close to normal distribution). For voiced, M is max, Z is min(20/160) For unvoiced, Z is max (70/160), M is mid For silence, M is min, Z is mid

10 3.4 Endpoint detection 3.4.1 double threshold beginning detection –Set two thresholds T h and T l for the E n or M n to get the real starting and ending points; for unvoiced, the Z n is used to differ the starting point to silence. 3.4.2 multi zero-crossing threshold beginning detection –Set T 1 <T 2 <T 3, for every frame find their Z 1, Z 2,Z 3 and Z=W 1 Z 1 +W 2 Z 2 +W 3 Z 3 –If Z>Z 0 the frame is voiced, otherwise unvoiced

11 3.5 Pitch period (T p ) estimation (1) 3.5.1 preprocessing –1. Center clipping – x(n)-C L x(n) > C L –y(n)=C[x(n)]= 0 |x(n)|<=C L – x(n)+C L x(n) < -C L –2. Low pass filter (900Hz) with linear phase –3. Three levels of clipping – y’(n)=C’[y(n)]=1,0,-1 if y(n)>0,=0,<0

12 Pitch period (T p ) estimation (2) 3.5.2 pitch detection by auto- correlation function –1. 900Hz low pass filtering, deleting first 20 signals {x(n)}  {x’(n)} –2. C L = 0.68 max {x’(n)} –3. y(n) = C[x’(n)] 20<n<300 y’(n) = C’[y(n)] 20<n<300 –4. R(k) = y(n)y’(n+k) k=0,20,21,…,150

13 Pitch period (T p ) estimation (3) –5. R max = max { R 20 ~ R 150 } –6. If R max < 0.25R(0) then T p =0 (unvoiced) else T p =argmax 20<k<150 R(k)xT (voiced ) 3.5.3 pitch detection by average difference of amplitude –1. Same as above. 900Hz filtering –2. r(k) = |x’(n+m) – x’(n+m-k)|/140 k=21,22,…,140

14 Pitch period (T p ) estimation (3) –3. T p ’ = argmin k r(k), r min =min k r(k) –4. Check if r min >a 1, T p =0 (unvoiced); if r min /M<a 1, voiced; (M= |x’(n)|/280) –5. Check if r min (T p ’/2)/M<a 2, T p = T p ’/2 else if r min (T p ’/3)/M<a 2, T p = T p ’/3 –a i is determined by experimental statistics –If there are I frames, p i is the correct pitch estimation of frame i, a 2 = min i r i (p i )/M i –a 2 < a 1 (for unvoiced)

15 Pitch period (T p ) estimation (4) 3.5.4 post-processing of pitch detect –Smoothing processing by median filtering : y(n) = median n-L n+L [x(n)] –Linear smoothing : y(n)=Σ m=-L L x(n-m)w(m), Σ m=-L L w(m)=1 –Smoothing processing by dynamic programming : p 1, p 2,…p N for smoothing –Define cost function (B>0) C(i,j)=|(P i -P j )/(i-j)| - B (i!=j) or -B (i=j)

16 Pitch period (T p ) estimation (5) –D(i) is the cost of i-th step, the track steps: –1. i=1, D(j)=0, j=1~N –2. Calculate C(i,j), j=1~i –3. d(i,j)=D(j)+C(i,j),j=1~I –4. Find optimal path: D(i)=min j=1~i d(i,j) J(i) = argmin j=1~i d(i,j) –5. If I<N goto 2 –6. Smooth result: P i = P j (i), i=1~N –C(i,j)means the cost for replacing P i with P j


Download ppt "Chapter 3 Time Domain Analysis of Speech Signal. 3.1 Short-time windowing signal (1) Three types windows : –Rectangular window –h r [n] = u[n] – u[n –"

Similar presentations


Ads by Google