Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.

Slides:

Advertisements

Similar presentations

Acoustic/Prosodic Features

Advertisements

Digital Signal Processing

Analysis and Digital Implementation of the Talk Box Effect Yuan Chen Advisor: Professor Paul Cuff.

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),

1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.

Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.

Overview of Real-Time Pitch Tracking Approaches Music information retrieval seminar McGill University Francois Thibault.

Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.

Prosody modification in speech signals Project by Edi Fridman & Alex Zalts supervision by Yizhar Lavner.

A Robust Algorithm for Pitch Tracking David Talkin Hsiao-Tsung Hung.

Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.

Digital Signal Processing

Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.

On improving the intelligibility of synchronized over-lap-and-add (SOLA) at low TSM factor Wong, P.H.W.; Au, O.C.; Wong, J.W.C.; Lau, W.H.B. TENCON '97.

A PRESENTATION BY SHAMALEE DESHPANDE

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.

Lecture 1 Signals in the Time and Frequency Domains

Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.

Automatic Pitch Tracking September 18, 2014 The Digitization of Pitch The blue line represents the fundamental frequency (F0) of the speaker’s voice.

Automatic Pitch Tracking January 16, 2013 The Plan for Today One announcement: Starting on Monday of next week, we’ll meet in Craigie Hall D 428 We’ll.

Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.

Module 2 SPECTRAL ANALYSIS OF COMMUNICATION SIGNAL.

Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.

INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.

Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.

A fast and precise peak finder V. Buzuloiu (University POLITEHNICA Bucuresti) Research Seminar, Fermi Lab November 2005.

ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska

Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio

Pitch Estimation by Enhanced Super Resolution determinator By Sunya Santananchai Chia-Ho Ling.

Speech Recognition Raymond Sastraputera.  Introduction  Frame/Buffer  Algorithm  Silent Detector  Estimate Pitch ◦ Correlation and Candidate ◦ Optimal.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Performance Comparison of Speaker and Emotion Recognition

EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,

By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.

Topic: Pitch Extraction

EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,

1 Chapter 8 The Discrete Fourier Transform (cont.)

Ch 10.2: Fourier Series We will see that many important problems involving partial differential equations can be solved, provided a given function can.

DIGITAL SIGNAL PROCESSING ELECTRONICS

Figure 11.1 Linear system model for a signal s[n].

Basic Estimation Techniques

ARTIFICIAL NEURAL NETWORKS

Sampling rate conversion by a rational factor

Equalization in a wideband TDMA system

Automatic Picking of First Arrivals

1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.

Fitting Curve Models to Edges

Basic Estimation Techniques

Mohamed Chibani, Roch Lefebvre and Philippe Gournay

Linear Predictive Coding Methods

Quantization in Implementing Systems

Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing

Investigation of Prosodic Features for Wake-Up-Word Speech Recognition Task by Chih-Ti Shih Good morning everyone, my name is Chih-T. I am a computer engineering.

Z TRANSFORM AND DFT Z Transform

Finite Wordlength Effects

DEPARTMENT OF INFORMATION TECHNOLOGY DIGITAL SIGNAL PROCESSING UNIT 4

Digital Systems: Hardware Organization and Design

Linear Prediction.

SKTN 2393 Numerical Methods for Nuclear Engineers

DEPARTMENT OF INFORMATION TECHNOLOGY DIGITAL SIGNAL PROCESSING UNIT 4

Speaker Identification:

Speech Processing Final Project

Lec.6:Discrete Fourier Transform and Signal Spectrum

Presenter: Shih-Hsiang(士翔)

An Algorithm for Determining the Endpoints for Isolated Utterances

Introduction to Artificial Intelligence Lecture 22: Computer Vision II

Auditory Morphing Weyni Clacken

Machine Learning: Lecture 5

Presentation transcript:

Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih

Determine the fundamental frequency of a speech waveform automatically Objective Determine the fundamental frequency of a speech waveform automatically 12/11/2006 Chih-Ti Shih

Automatic Extraction of Fundamental Frequency Methods Cepstrum-based FΦ determinator (CFD) Harmonic product spectrum (HPS) Feature-based FΦ tracker (FBFT) Parallel processing method (PP) Integrated FΦ tracking algorithm (IFTA) Super resolution FΦ determinator (SRFD) CFD and HPS make use of frequency domain representations of the speech signal. FBFT and PP produce fundamental frequency estimates by analyzing the waveform in the time domain. IFTA and SRFD uses a waveform similarity metric based on a normalized crosscorrelation coefficient. 12/11/2006 Chih-Ti Shih

eSRFD eSRFD: Enhanced Super resolution FΦ determinator. 1. Pass the sample through low-pass filter to simplify the temporal structure of the waveform 2. Pass the sample frames through silence detector to identify unvoiced frames. No analysis will be done for the unvoiced frames. if |XNmin or XNmax| + |YNmin or YNmax | < Tsrfd it is a silent frame The SRFD uses a waveform similarity metric based on a normalized crosscorrelation coefficient. The SRFD: This method uses the idea that the correlation of two adjacent segments is very high when they are spaced apart by a fundamental period or a multiple of it. The method quantifies the degree of similarity between two adjacent and non-overlapping intervals with infinite time resolution by linear interpolation. Equation 1 12/11/2006 Chih-Ti Shih

eSRFD Each frame is subdivided into 3 consecutive segments, xn,yn and zn. In the first frame of the sample, Xn is not fully defined, the frame is classified ‘silent’. In the last frame of the sample, Yn and Zn are not fully defined, the frame is ‘silent’. 12/11/2006 Chih-Ti Shih

eSRFD 12/11/2006 Chih-Ti Shih

eSRFD 3. For the ‘voiced’ frame, the first normalized cross-correlation of Px,y(n) of the frame is determined. Cross-correlation Px,y must have ay least 2 or more oscillations or 4 zero-crossing within the section L is a decimation factor which is used to reduce the computational load of the algorithm. If L is set too low, the calculation of the normalized crosscorrelation coefficient will be computationally expensive and time consuming. Cross-correlation is used to measure similarity of two signals. ‘ the normalized form of cross correlation preferred for feature matching applications does not have a simple frequency domain expression. Normalization Equation 2 12/11/2006 Chih-Ti Shih

eSRFD 4. Candidate values of the fundamental period are obtained by locating the peaks in the normalized cross-correlation coefficient for which the value Px,y(n) exceeds a certain threshold Tsrfd Px,y(n) is measuring the similarity of the two adjacent frame. If no candidates are found in the frame, the frame is classified as ‘unvoiced’. 12/11/2006 Chih-Ti Shih

eSRFD 5. For the voiced frame (Px,y(n) > Tsrfd), the second normalized cross-correlation coefficient py,z(n) is determined py,z(n) which measure the similarity between the current frame and the next frame. 12/11/2006 Chih-Ti Shih

eSRFD 6. For those candidates with both Px,y(n) and py,z(n) exceeds the threshold Tsrfd are given a score of 2, others are 1. Note: If there are one or more candidates with a score of 2, then all those with a score of 1 are removed from the list of candidates. 12/11/2006 Chih-Ti Shih

eSRFD If there is only one candidate with score 1 or 2, the candidate is assumed to be the best estimate of the fundamental period of that frame. Otherwise, an optimal fundamental period is sought from the set of remaining candidates. The candidate at the end of this list represents a fundamental period is nM, and the m’th candidate represents a period nm. 12/11/2006 Chih-Ti Shih

eSRFD 7. then calculate q(nM) which is a normalized cross-correlation coefficient between sections of length nM spaced nm . q(nM) is defined as: q(nm) is measuring the similarity between two nM spaced nm apart. 12/11/2006 Chih-Ti Shih

eSRFD 12/11/2006 Chih-Ti Shih

eSRFD The first coefficient q(n1) is assumed to be the optimal value. If the subsequent q(nm) * 0.77 > the current optimal value , the subsequent q(nm) is the optimal value. 12/11/2006 Chih-Ti Shih

eSRFD If only 1 candidate score 1 but no candidate score2: If previous frame is ‘unvoiced’: the current value is hold and depends on the next frame. If the next frame is also unvoiced, the current frame will be considered as ‘unvoiced’ Otherwise, the current frame is considered as ‘voiced’ and current held FΦ will be considered as the good estimation for the current frame. In the case of only 1 candidate score 1 but no candidate score 2. The probability that the candidate correctly represents the true fundamental period of the frame is low. 12/11/2006 Chih-Ti Shih

eSRFD The changes reduced the occurrence of doubling and halving in FΦ contour. However, they increase the chance the voiced region been miss-classified as unvoiced. 12/11/2006 Chih-Ti Shih

eSRFD 8. Applying biasing to Px,y(n) and Py,z(n) if: 1. The two previous frames were ‘voiced’ 2. The FΦ value of the previous frame is not being temp held. 3. FΦ of previous frame is less than 7/4 *(FΦ of current frame) and greater than 5/8*(FΦ of current frame) However, the biasing tends to increase the percentage of unvoiced regions of speech being incorrectly classified as ‘voiced’. In order it counteract this undesirable effect, biasing is applied to the coefficient Px,y(n) and Py,z(n) for values of n where the fundamental period of a new frame is expected to lie. The biasing tends to increase the percentage of unvoiced regions of speech being incorrectly classified as ‘voiced’. To reduce this side effect, if the unbised coefficient Px,y(n) does not exceed the Tsrfd for the candidate believed to be the best estimate of the frame fundamental period. Then the F0 value for that frame is held until the state of the next frame is known. If the next frame is classified as ‘silent’, the former frame is re-classified as ‘silent’. If the unbiased coefficient Px,y(n) does not exceed the Tsrfd and this candidate is believed to be the best estimate of the frame. The FΦ of this candidate is held until the state of the subsequent frame is known. If the next frame is silent, the current frame is re-classified as silent. 12/11/2006 Chih-Ti Shih

eSRFD 9. The fundamental period for the frame is estimated by calculate rx,y(n) for n in the region –L < n < L. The maximum within this range corresponds to a more accurate value of the fundamental period. 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours Compare Fx which is generated from the laryngograph with the FΦ contours generated by eSRFD Fxreference refer to reference value from laryngograph. FΦ refer to the value from eSRFD The contour generated from laryngograph may be inaccurate due to effects at the end of voiced speech segments for which a small area of vocal-fold contact is insufficient for the glottis activity to be distinguished from noise in the laryngograph signal, but the speech is periodic and low in energy. The extent of such errors will only be over two or three Fx cycles and are thus deemed negligible in this study. 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours Fxreference and FΦ are zero: both describe a silent or unvoiced region of the utterance and no error result. FΦ is non-zero but Fxreference is zero: the region is incorrectly classified as voiced by eSRFD Fxrefernece is non-zero but FΦ is zero: the voice region is incorrectly classified as unvoiced by eSRFD Fxreference and FΦ are non-zero: both correctly classify the region as voiced. In such case, calculate the ration of: 12/11/2006 Chih-Ti Shih

Gross Error Halving error Doubling error Acceptable accuracy The 20% threshold of acceptance is chosen because all FDAs are expected to form an F0 value within this range with due consideration of time quantization errors and the finite frequency resolution of the analysis technique. Acceptable accuracy 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours Female 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours Female Where can I park my car (female) 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours Male 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours Male Where can I park my car (male) 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours laryngograph eSRFD The durations of unvoiced or silent regions classified in error and the durations of voiced sections incorrectly classified as unvoiced or silent by the FDA, are accumulated over all the utterances in the database for each speaker. 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours female 12/11/2006 Chih-Ti Shih

Comparison of asynchronous frequency contours male 12/11/2006 Chih-Ti Shih

Question? 12/11/2006 Chih-Ti Shih