Multipitch Tracking for Noisy Speech

Slides:

Advertisements

Similar presentations

Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.

Advertisements

1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

An Auditory Scene Analysis Approach to Speech Segregation DeLiang Wang Perception and Neurodynamics Lab The Ohio State University.

Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.

Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.

GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Pitch Detection and Tracking Juhan Nam 1.

William Stallings Data and Computer Communications 7 th Edition Chapter 3 Data Transmission.

Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections

Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.

Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.

Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.

A Hidden Markov Model Framework for Multi-target Tracking DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

William Stallings Data and Computer Communications 7th Edition (Selected slides used for lectures at Bina Nusantara University) Data, Signal.

Communications & Multimedia Signal Processing Formant Tracking LP with Harmonic Plus Noise Model of Excitation for Speech Enhancement Qin Yan Communication.

1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.

Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.

1 Linking Computational Auditory Scene Analysis with ‘Missing Data’ Recognition of Speech Guy J. Brown Department of Computer Science, University of Sheffield.

Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.

HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

1 Business Telecommunications Data and Computer Communications Chapter 3 Data Transmission.

Abstract We report comparisons between a model incorporating a bank of dual-resonance nonlinear (DRNL) filters and one incorporating a bank of linear gammatone.

INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.

Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.

Page 0 of 14 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.

Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.

Experimental Results ■ Observations:  Overall detection accuracy increases as the length of observation window increases.  An observation window of 100.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Gammachirp Auditory Filter

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

Hearing Research Center

Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

CS Statistical Machine learning Lecture 24

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.

Piano Music Transcription Wes “Crusher” Hatch MUMT-614 Thurs., Feb.13.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Speech Enhancement Algorithm for Digital Hearing Aids

Speech and Singing Voice Enhancement via DNN

Feature Mapping FOR SPEAKER Diarization IN NOisy conditions

Statistical Models for Automatic Speech Recognition

Term Project Presentation By: Keerthi C Nagaraj Dated: 30th April 2003

Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing

Statistical Models for Automatic Speech Recognition

朝陽科技大學資訊工程系謝政勳 Application of GM(1,1) Model to Speech Enhancement and Voice Activity Detection 朝陽科技大學資訊工程系謝政勳

Presenter: Shih-Hsiang(士翔)

Measuring the Similarity of Rhythmic Patterns

Presentation transcript:

Multipitch Tracking for Noisy Speech DeLiang Wang The Ohio State University, U.S.A. Joint work with Mingyang Wu (The Ohio State University) and Guy Brown (University of Sheffield, U.K.)

What is Pitch? “The attribute of auditory sensation in terms of which sounds may be ordered on a musical scale.” (American Standards Association) Periodic sound: musical tone, vowel, voiced speech. Aperiodic sound with pitch sensation: e.g. comb-filtered noise

Pitch of a Periodic Sound Fundamental Frequency (period) Pitch Frequency (period) d

Applications of Pitch Tracking Computational Auditory Scene Analysis (CASA) Automatic music transcription Speech coding, analysis, speaker verification and language identification.

Categories of Pitch Determination Algorithms (PDAs) Time-domain algorithms Frequency-domain algorithms Time-frequency domain algorithms

Time-domain PDAs

Frequency-domain PDAs

Time-frequency Domain PDAs Periodicity analysis Acoustic input Periodicity analysis … Filterbank Periodicity analysis Pitch estimates Integration across channels

Pitch Determination Algorithms Numerous PDAs have been proposed. For example, see Hess (1983), Hermes (1992), and de Cheveigne & Kawahara (2002). Many PDAs are designed to detect single pitch in noisy speech. Some PDAs are able to track more than one pitch contour. However, their performance is limited on tracking speech mixed with broadband interference.

PDAs for Multipitch in Noisy Environments speech Output Pitch Tracks noise PDA speech

Diagram of the Proposed Model Normalized Correlogram Channel/Peak Selection Speech/ Interference Cochlear Filtering Pitch Tracking Using HMM Channel Integration Continuous Pitch Tracks

Gammatone Filterbank to Model Cochlea Filtering

Multi-channel Front-end Envelope Extraction High Frequency Channels Speech/ Interference Separation at 800 Hz Low Frequency Channels Gammatone filterbank

Periodicity Extraction Normalized Correlogram Frequency channels Delay Response to clean speech

Second Stage of the Model Normalized Correlogram Channel/Peak Selection Speech/ Interference Cochlear Filtering Pitch Tracking Using HMM Channel Integration Continuous Pitch Tracks

Channel and Peak Selection for Reducing Noise Interference Some channels are masked by interference and provide corrupting information on periodicity. These corrupted channels are excluded from pitch determination. Different strategies are used for selecting valid channels in low- and high-frequency ranges.

Selection of a Low-frequency Channel Clean Channel Corrupted Channel Lag (delay steps) In a clean channel, peaks at non-zero delays are close to one. But these peaks are relatively low in a corrupted channel.

Selection of a High-frequency Channel Clean Channel Corrupted Channel Lag (delay steps) - In a clean channel, normalized correlogram within the original time window and that within a longer time window have similar patterns, but in a corrupted channel they have dissimilar patterns. - Further peak selection is performed in a high-frequency channel.

Summary Correlogram of Selected Channels All channels Only selected channels Lag (delay steps)

Summary Correlogram of Selected Channels with Selected Peaks Lag (delay steps) Without Peak Selection With Peak Selection

Third Stage of the Model Normalized Correlogram Channel/Peak Selection Speech/ Interference Cochlear Filtering Pitch Tracking Using HMM Channel Integration Continuous Pitch Tracks

Integration of Periodicity Information Across Channels How does a frequency channel contribute to a pitch-period hypothesis? How to integrate the contributions from different channels?

Peaks and Pitch Delay Ideal Pitch Delay Peak Delay Relative Time Lag

Relative Time Lag Statistics Histogram of relative time lags from natural speech

Relative Time Lag Statistics Estimated probability distribution of relative time lags (sum of Laplacian and uniform distributions)

Observation Probability in One Channel Normalized Correlogram p(channel|pitch delay) Channel 29.

Channel Combination Step 1: taking the product of observation probabilities of all channels in a time frame. Step 2: flattening the product probability. The responses of different channels are usually correlated and this step is used to correct the probability overshoot phenomenon.

Integrated Observation Probability Distribution (1 Pitch) Pitch delay Log(Probability)

Integrated Observation Probability Distribution (2 Pitches) Log(Probability) Pitch Delay 2 The colors indicate the likelihood (log(Probability)) of pitch hypotheses. Big red spots represent the most likely pitch hypotheses. The identified pitch periods for this time frame are 52 and 123. Pitch Delay 1

Fourth Stage of the Model Normalized Correlogram Channel/Peak Selection Speech/ Interference Cochlear Filtering Pitch Tracking Using HMM Channel Integration Continuous Pitch Tracks

Prediction and Posterior Probabilities Prior probabilities for time frame t Assuming pitch period d for time frame t-1 d Observation probabilities for time frame t Posterior probabilities for time frame t d d

Pitch Change Statistics in Consecutive Time Frames Consistent with the pitch declination phenomenon in natural speech.

Hidden Markov Model as Tracking Mechanism Pitch State Space Observed Signal Pitch Dynamics Observation Probability One Time Frame Viterbi algorithm is used to find the optimal sequence of pitch states.

Results Test the system on the mixtures of 10 speech utterances and 10 interferences (Cooke, 1993). The interferences are 1 kHz tone, white noise, noise bursts, “cocktail party” noise, rock music, siren, trill telephone, two female and one male utterances of speech.

A Male Utterance and White Noise (SNR = –2 dB) Tolonen & Karjalainen (2000) Our algorithm Pitch Period (ms) Time (s)

A Male Utterance and White Noise (cont.) Gu & Bokhoven (1991) Revised Gu & Bokhoven (1991) Pitch Period (ms) Time (s) Time (s)

A Male Utterance and White Noise (cont.) A single pitch tracker by Rouat, Liu & Morissette (1997) Pitch Period (ms) Time (s)

Simultaneous Utterances of a Male and a Female Speaker Our algorithm Time (s) Tolonen & Karjalainen (2000) Pitch Period (ms)

Simultaneous Utterances of a Male and a Female Speaker (cont.) Gu & Bokhoven (1991) Revised Gu & Bokhoven (1991) Pitch Period (ms) Time (s) Time (s)

Categorization of Interference Signals

Error Rates (in Percentage) for Category 1 Interference

Error Rates (in Percentage) for Category 2 Interference

Error Rates (in Percentage) for Category 3 Interference

A CASA Application Demo Original mixture Segregated male utterance using a correlogram-based pitch tracker (Wang & Brown’99) Segregated utterance using our algorithm

Conclusion Improved channel/peak selection method for reducing noise interference. Statistical integration method effectively utilizing the periodicity information across all channels. HMM for modeling continuous pitch tracks. Our algorithm performs reliably for tracking single and double pitch tracks in noisy acoustic environments. The algorithm outperforms others by a substantial margin.