Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.

Similar presentations


Presentation on theme: "Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih."— Presentation transcript:

1 Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih

2 Determine the fundamental frequency of a speech waveform automatically
Objective Determine the fundamental frequency of a speech waveform automatically 12/11/2006 Chih-Ti Shih

3 Automatic Extraction of Fundamental Frequency Methods
Cepstrum-based FΦ determinator (CFD) Harmonic product spectrum (HPS) Feature-based FΦ tracker (FBFT) Parallel processing method (PP) Integrated FΦ tracking algorithm (IFTA) Super resolution FΦ determinator (SRFD) CFD and HPS make use of frequency domain representations of the speech signal. FBFT and PP produce fundamental frequency estimates by analyzing the waveform in the time domain. IFTA and SRFD uses a waveform similarity metric based on a normalized crosscorrelation coefficient. 12/11/2006 Chih-Ti Shih

4 eSRFD eSRFD: Enhanced Super resolution FΦ determinator.
1. Pass the sample through low-pass filter to simplify the temporal structure of the waveform 2. Pass the sample frames through silence detector to identify unvoiced frames. No analysis will be done for the unvoiced frames. if |XNmin or XNmax| + |YNmin or YNmax | < Tsrfd it is a silent frame The SRFD uses a waveform similarity metric based on a normalized crosscorrelation coefficient. The SRFD: This method uses the idea that the correlation of two adjacent segments is very high when they are spaced apart by a fundamental period or a multiple of it. The method quantifies the degree of similarity between two adjacent and non-overlapping intervals with infinite time resolution by linear interpolation. Equation 1 12/11/2006 Chih-Ti Shih

5 eSRFD Each frame is subdivided into 3 consecutive segments, xn,yn and zn. In the first frame of the sample, Xn is not fully defined, the frame is classified ‘silent’. In the last frame of the sample, Yn and Zn are not fully defined, the frame is ‘silent’. 12/11/2006 Chih-Ti Shih

6 eSRFD 12/11/2006 Chih-Ti Shih

7 eSRFD 3. For the ‘voiced’ frame, the first normalized cross-correlation of Px,y(n) of the frame is determined. Cross-correlation Px,y must have ay least 2 or more oscillations or 4 zero-crossing within the section L is a decimation factor which is used to reduce the computational load of the algorithm. If L is set too low, the calculation of the normalized crosscorrelation coefficient will be computationally expensive and time consuming. Cross-correlation is used to measure similarity of two signals. ‘ the normalized form of cross correlation preferred for feature matching applications does not have a simple frequency domain expression. Normalization Equation 2 12/11/2006 Chih-Ti Shih

8 eSRFD 4. Candidate values of the fundamental period are obtained by locating the peaks in the normalized cross-correlation coefficient for which the value Px,y(n) exceeds a certain threshold Tsrfd Px,y(n) is measuring the similarity of the two adjacent frame. If no candidates are found in the frame, the frame is classified as ‘unvoiced’. 12/11/2006 Chih-Ti Shih

9 eSRFD 5. For the voiced frame (Px,y(n) > Tsrfd), the second normalized cross-correlation coefficient py,z(n) is determined py,z(n) which measure the similarity between the current frame and the next frame. 12/11/2006 Chih-Ti Shih

10 eSRFD 6. For those candidates with both Px,y(n) and py,z(n) exceeds the threshold Tsrfd are given a score of 2, others are 1. Note: If there are one or more candidates with a score of 2, then all those with a score of 1 are removed from the list of candidates. 12/11/2006 Chih-Ti Shih

11 eSRFD If there is only one candidate with score 1 or 2, the candidate is assumed to be the best estimate of the fundamental period of that frame. Otherwise, an optimal fundamental period is sought from the set of remaining candidates. The candidate at the end of this list represents a fundamental period is nM, and the m’th candidate represents a period nm. 12/11/2006 Chih-Ti Shih

12 eSRFD 7. then calculate q(nM) which is a normalized cross-correlation coefficient between sections of length nM spaced nm . q(nM) is defined as: q(nm) is measuring the similarity between two nM spaced nm apart. 12/11/2006 Chih-Ti Shih

13 eSRFD 12/11/2006 Chih-Ti Shih

14 eSRFD The first coefficient q(n1) is assumed to be the optimal value. If the subsequent q(nm) * 0.77 > the current optimal value , the subsequent q(nm) is the optimal value. 12/11/2006 Chih-Ti Shih

15 eSRFD If only 1 candidate score 1 but no candidate score2:
If previous frame is ‘unvoiced’: the current value is hold and depends on the next frame. If the next frame is also unvoiced, the current frame will be considered as ‘unvoiced’ Otherwise, the current frame is considered as ‘voiced’ and current held FΦ will be considered as the good estimation for the current frame. In the case of only 1 candidate score 1 but no candidate score 2. The probability that the candidate correctly represents the true fundamental period of the frame is low. 12/11/2006 Chih-Ti Shih

16 eSRFD The changes reduced the occurrence of doubling and halving in FΦ contour. However, they increase the chance the voiced region been miss-classified as unvoiced. 12/11/2006 Chih-Ti Shih

17 eSRFD 8. Applying biasing to Px,y(n) and Py,z(n) if:
1. The two previous frames were ‘voiced’ 2. The FΦ value of the previous frame is not being temp held. 3. FΦ of previous frame is less than 7/4 *(FΦ of current frame) and greater than 5/8*(FΦ of current frame) However, the biasing tends to increase the percentage of unvoiced regions of speech being incorrectly classified as ‘voiced’. In order it counteract this undesirable effect, biasing is applied to the coefficient Px,y(n) and Py,z(n) for values of n where the fundamental period of a new frame is expected to lie. The biasing tends to increase the percentage of unvoiced regions of speech being incorrectly classified as ‘voiced’. To reduce this side effect, if the unbised coefficient Px,y(n) does not exceed the Tsrfd for the candidate believed to be the best estimate of the frame fundamental period. Then the F0 value for that frame is held until the state of the next frame is known. If the next frame is classified as ‘silent’, the former frame is re-classified as ‘silent’. If the unbiased coefficient Px,y(n) does not exceed the Tsrfd and this candidate is believed to be the best estimate of the frame. The FΦ of this candidate is held until the state of the subsequent frame is known. If the next frame is silent, the current frame is re-classified as silent. 12/11/2006 Chih-Ti Shih

18 eSRFD 9. The fundamental period for the frame is estimated by calculate rx,y(n) for n in the region –L < n < L. The maximum within this range corresponds to a more accurate value of the fundamental period. 12/11/2006 Chih-Ti Shih

19 Comparison of asynchronous frequency contours
Compare Fx which is generated from the laryngograph with the FΦ contours generated by eSRFD Fxreference refer to reference value from laryngograph. FΦ refer to the value from eSRFD The contour generated from laryngograph may be inaccurate due to effects at the end of voiced speech segments for which a small area of vocal-fold contact is insufficient for the glottis activity to be distinguished from noise in the laryngograph signal, but the speech is periodic and low in energy. The extent of such errors will only be over two or three Fx cycles and are thus deemed negligible in this study. 12/11/2006 Chih-Ti Shih

20 Comparison of asynchronous frequency contours
Fxreference and FΦ are zero: both describe a silent or unvoiced region of the utterance and no error result. FΦ is non-zero but Fxreference is zero: the region is incorrectly classified as voiced by eSRFD Fxrefernece is non-zero but FΦ is zero: the voice region is incorrectly classified as unvoiced by eSRFD Fxreference and FΦ are non-zero: both correctly classify the region as voiced. In such case, calculate the ration of: 12/11/2006 Chih-Ti Shih

21 Gross Error Halving error Doubling error Acceptable accuracy
The 20% threshold of acceptance is chosen because all FDAs are expected to form an F0 value within this range with due consideration of time quantization errors and the finite frequency resolution of the analysis technique. Acceptable accuracy 12/11/2006 Chih-Ti Shih

22 Comparison of asynchronous frequency contours
Female 12/11/2006 Chih-Ti Shih

23 Comparison of asynchronous frequency contours
Female Where can I park my car (female) 12/11/2006 Chih-Ti Shih

24 Comparison of asynchronous frequency contours
Male 12/11/2006 Chih-Ti Shih

25 Comparison of asynchronous frequency contours
Male Where can I park my car (male) 12/11/2006 Chih-Ti Shih

26 Comparison of asynchronous frequency contours
laryngograph eSRFD The durations of unvoiced or silent regions classified in error and the durations of voiced sections incorrectly classified as unvoiced or silent by the FDA, are accumulated over all the utterances in the database for each speaker. 12/11/2006 Chih-Ti Shih

27 Comparison of asynchronous frequency contours
female 12/11/2006 Chih-Ti Shih

28 Comparison of asynchronous frequency contours
male 12/11/2006 Chih-Ti Shih

29 Question? 12/11/2006 Chih-Ti Shih


Download ppt "Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih."

Similar presentations


Ads by Google