Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.

Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih

Determine the fundamental frequency of a speech waveform automatically
Objective Determine the fundamental frequency of a speech waveform automatically 12/11/2006 Chih-Ti Shih

Automatic Extraction of Fundamental Frequency Methods
Cepstrum-based FΦ determinator (CFD) Harmonic product spectrum (HPS) Feature-based FΦ tracker (FBFT) Parallel processing method (PP) Integrated FΦ tracking algorithm (IFTA) Super resolution FΦ determinator (SRFD) CFD and HPS make use of frequency domain representations of the speech signal. FBFT and PP produce fundamental frequency estimates by analyzing the waveform in the time domain. IFTA and SRFD uses a waveform similarity metric based on a normalized crosscorrelation coefficient. 12/11/2006 Chih-Ti Shih

eSRFD eSRFD: Enhanced Super resolution FΦ determinator.
1. Pass the sample through low-pass filter to simplify the temporal structure of the waveform 2. Pass the sample frames through silence detector to identify unvoiced frames. No analysis will be done for the unvoiced frames. if |XNmin or XNmax| + |YNmin or YNmax | < Tsrfd it is a silent frame The SRFD uses a waveform similarity metric based on a normalized crosscorrelation coefficient. The SRFD: This method uses the idea that the correlation of two adjacent segments is very high when they are spaced apart by a fundamental period or a multiple of it. The method quantifies the degree of similarity between two adjacent and non-overlapping intervals with infinite time resolution by linear interpolation. Equation 1 12/11/2006 Chih-Ti Shih

eSRFD Each frame is subdivided into 3 consecutive segments, xn,yn and zn. In the first frame of the sample, Xn is not fully defined, the frame is classified ‘silent’. In the last frame of the sample, Yn and Zn are not fully defined, the frame is ‘silent’. 12/11/2006 Chih-Ti Shih

eSRFD 12/11/2006 Chih-Ti Shih

eSRFD 3. For the ‘voiced’ frame, the first normalized cross-correlation of Px,y(n) of the frame is determined. Cross-correlation Px,y must have ay least 2 or more oscillations or 4 zero-crossing within the section L is a decimation factor which is used to reduce the computational load of the algorithm. If L is set too low, the calculation of the normalized crosscorrelation coefficient will be computationally expensive and time consuming. Cross-correlation is used to measure similarity of two signals. ‘ the normalized form of cross correlation preferred for feature matching applications does not have a simple frequency domain expression. Normalization Equation 2 12/11/2006 Chih-Ti Shih

eSRFD 4. Candidate values of the fundamental period are obtained by locating the peaks in the normalized cross-correlation coefficient for which the value Px,y(n) exceeds a certain threshold Tsrfd Px,y(n) is measuring the similarity of the two adjacent frame. If no candidates are found in the frame, the frame is classified as ‘unvoiced’. 12/11/2006 Chih-Ti Shih

eSRFD 5. For the voiced frame (Px,y(n) > Tsrfd), the second normalized cross-correlation coefficient py,z(n) is determined py,z(n) which measure the similarity between the current frame and the next frame. 12/11/2006 Chih-Ti Shih

eSRFD 6. For those candidates with both Px,y(n) and py,z(n) exceeds the threshold Tsrfd are given a score of 2, others are 1. Note: If there are one or more candidates with a score of 2, then all those with a score of 1 are removed from the list of candidates. 12/11/2006 Chih-Ti Shih

eSRFD If there is only one candidate with score 1 or 2, the candidate is assumed to be the best estimate of the fundamental period of that frame. Otherwise, an optimal fundamental period is sought from the set of remaining candidates. The candidate at the end of this list represents a fundamental period is nM, and the m’th candidate represents a period nm. 12/11/2006 Chih-Ti Shih

eSRFD 7. then calculate q(nM) which is a normalized cross-correlation coefficient between sections of length nM spaced nm . q(nM) is defined as: q(nm) is measuring the similarity between two nM spaced nm apart. 12/11/2006 Chih-Ti Shih

eSRFD 12/11/2006 Chih-Ti Shih

eSRFD The first coefficient q(n1) is assumed to be the optimal value. If the subsequent q(nm) * 0.77 > the current optimal value , the subsequent q(nm) is the optimal value. 12/11/2006 Chih-Ti Shih

eSRFD If only 1 candidate score 1 but no candidate score2:
If previous frame is ‘unvoiced’: the current value is hold and depends on the next frame. If the next frame is also unvoiced, the current frame will be considered as ‘unvoiced’ Otherwise, the current frame is considered as ‘voiced’ and current held FΦ will be considered as the good estimation for the current frame. In the case of only 1 candidate score 1 but no candidate score 2. The probability that the candidate correctly represents the true fundamental period of the frame is low. 12/11/2006 Chih-Ti Shih

eSRFD The changes reduced the occurrence of doubling and halving in FΦ contour. However, they increase the chance the voiced region been miss-classified as unvoiced. 12/11/2006 Chih-Ti Shih

eSRFD 8. Applying biasing to Px,y(n) and Py,z(n) if:
1. The two previous frames were ‘voiced’ 2. The FΦ value of the previous frame is not being temp held. 3. FΦ of previous frame is less than 7/4 *(FΦ of current frame) and greater than 5/8*(FΦ of current frame) However, the biasing tends to increase the percentage of unvoiced regions of speech being incorrectly classified as ‘voiced’. In order it counteract this undesirable effect, biasing is applied to the coefficient Px,y(n) and Py,z(n) for values of n where the fundamental period of a new frame is expected to lie. The biasing tends to increase the percentage of unvoiced regions of speech being incorrectly classified as ‘voiced’. To reduce this side effect, if the unbised coefficient Px,y(n) does not exceed the Tsrfd for the candidate believed to be the best estimate of the frame fundamental period. Then the F0 value for that frame is held until the state of the next frame is known. If the next frame is classified as ‘silent’, the former frame is re-classified as ‘silent’. If the unbiased coefficient Px,y(n) does not exceed the Tsrfd and this candidate is believed to be the best estimate of the frame. The FΦ of this candidate is held until the state of the subsequent frame is known. If the next frame is silent, the current frame is re-classified as silent. 12/11/2006 Chih-Ti Shih

eSRFD 9. The fundamental period for the frame is estimated by calculate rx,y(n) for n in the region –L < n < L. The maximum within this range corresponds to a more accurate value of the fundamental period. 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours
Compare Fx which is generated from the laryngograph with the FΦ contours generated by eSRFD Fxreference refer to reference value from laryngograph. FΦ refer to the value from eSRFD The contour generated from laryngograph may be inaccurate due to effects at the end of voiced speech segments for which a small area of vocal-fold contact is insufficient for the glottis activity to be distinguished from noise in the laryngograph signal, but the speech is periodic and low in energy. The extent of such errors will only be over two or three Fx cycles and are thus deemed negligible in this study. 12/11/2006 Chih-Ti Shih

Fxreference and FΦ are zero: both describe a silent or unvoiced region of the utterance and no error result. FΦ is non-zero but Fxreference is zero: the region is incorrectly classified as voiced by eSRFD Fxrefernece is non-zero but FΦ is zero: the voice region is incorrectly classified as unvoiced by eSRFD Fxreference and FΦ are non-zero: both correctly classify the region as voiced. In such case, calculate the ration of: 12/11/2006 Chih-Ti Shih

Gross Error Halving error Doubling error Acceptable accuracy
The 20% threshold of acceptance is chosen because all FDAs are expected to form an F0 value within this range with due consideration of time quantization errors and the finite frequency resolution of the analysis technique. Acceptable accuracy 12/11/2006 Chih-Ti Shih

Female 12/11/2006 Chih-Ti Shih

Female Where can I park my car (female) 12/11/2006 Chih-Ti Shih

Male 12/11/2006 Chih-Ti Shih

Male Where can I park my car (male) 12/11/2006 Chih-Ti Shih

laryngograph eSRFD The durations of unvoiced or silent regions classified in error and the durations of voiced sections incorrectly classified as unvoiced or silent by the FDA, are accumulated over all the utterances in the database for each speaker. 12/11/2006 Chih-Ti Shih

female 12/11/2006 Chih-Ti Shih

male 12/11/2006 Chih-Ti Shih

Question? 12/11/2006 Chih-Ti Shih

Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.

Similar presentations

Presentation on theme: "Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.

Similar presentations

Presentation on theme: "Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih."— Presentation transcript:

Similar presentations

About project

Feedback