A Hidden Markov Model Framework for Multi-target Tracking DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Multipitch Tracking for Noisy Speech
Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.
Reducing Drift in Parametric Motion Tracking
Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,
Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Presenter: Yufan Liu November 17th,
Visual Recognition Tutorial
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Estimation and the Kalman Filter David Johnson. The Mean of a Discrete Distribution “I have more legs than average”
Bayesian Filtering for Location Estimation D. Fox, J. Hightower, L. Liao, D. Schulz, and G. Borriello Presented by: Honggang Zhang.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
Adaptive Signal Processing
Adaptive Signal Processing Class Project Adaptive Interacting Multiple Model Technique for Tracking Maneuvering Targets Viji Paul, Sahay Shishir Brijendra,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Muhammad Moeen YaqoobPage 1 Moment-Matching Trackers for Difficult Targets Muhammad Moeen Yaqoob Supervisor: Professor Richard Vinter.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
1 Chapter 8 The Discrete Fourier Transform 2 Introduction  In Chapters 2 and 3 we discussed the representation of sequences and LTI systems in terms.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
TP15 - Tracking Computer Vision, FCUP, 2013 Miguel Coimbra Slides by Prof. Kristen Grauman.
From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.
Chapter 21 R(x) Algorithm a) Anomaly Detection b) Matched Filter.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Forward-Scan Sonar Tomographic Reconstruction PHD Filter Multiple Target Tracking Bayesian Multiple Target Tracking in Forward Scan Sonar.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Hearing Research Center
Stable Multi-Target Tracking in Real-Time Surveillance Video
Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Sequential Monte-Carlo Method -Introduction, implementation and application Fan, Xin
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Multi-target Detection in Sensor Networks Xiaoling Wang ECE691, Fall 2003.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Spectrum Sensing In Cognitive Radio Networks
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Course: Autonomous Machine Learning
Dynamical Statistical Shape Priors for Level Set Based Tracking
Statistical Models for Automatic Speech Recognition
LECTURE 15: REESTIMATION, EM AND MIXTURES
Volume 86, Issue 3, Pages (March 2004)
Presentation transcript:

A Hidden Markov Model Framework for Multi-target Tracking DeLiang Wang Perception & Neurodynamics Lab Ohio State University

2 Outline l Problem statement l Multipitch tracking in noisy speech l Multipitch tracking in reverberant environments l Binaural tracking of moving sound sources l Discussion & conclusion

3 Multi-target tracking problem l Multi-target tracking is a problem of detecting multiple targets of interest over time, with each target being dynamic (time-varying) in nature l The input to a multi-target tracking system is a sequence of observations, often noisy l Multi-target tracking occurs in many domains, including radar/sonar applications, surveillance, and acoustic analysis

4 Approaches to the problem l Statistical signal processing has been heavily employed for the multi-target tracking problem l In a very broad sense, statistical methods can be viewed Bayesian tracking or filtering l Prior distribution describing the state of dynamic targets l Likelihood (observation) function describing state-dependent sensor measurements, or observations l Posterior distribution describing the state given the observations. This is the output of the tracker, computed by combining the prior and the likelihood

5 Kalman filter l Perhaps the most widely used approach for tracking is a Kalman filter l For linear state and observation models, and Gaussian perturbations, the Kalman filter gives a recursive estimate of the state sequence that is optimal in the least squares sense l The Kalman filter can be viewed as a Bayesian tracker

6 General Bayesian tracking l When the assumptions of the Kalman filter are not satisfied, a more general framework is needed l For multiple targets, multiple hypothesis tracking or unified tracking can be formulated in the Bayesian framework (Stone et al.’99) l Such general formulations, however, require an exponential number of evaluations, hence computationally infeasible l Approximations and hypothesis pruning techniques are necessary in order to make use of these methods

7 Domain of acoustic signal processing l Domain knowledge can provide powerful constraints to the general problem of multi-target tracking l We consider the domain of acoustic/auditory signal processing, in particular l Multipitch tracking in noisy environments l Multiple moving-source tracking l In this domain, hidden Markov model (HMM) is a dominant framework, thanks to its remarkable success in automatic speech recognition

8 HMM for multi-target tracking l We have explored and developed a novel HMM framework for multi-target tracking for the problems of pitch and moving sound tracking (Wu et al., IEEE T- SAP’03; Roman & Wang, IEEE T-ASLP’08; Jin & Wang, OSU Tech. Rep.’09) l Let’s first consider the problem of multi-pitch tracking

What is pitch? “The attribute of auditory sensation in terms of which sounds may be ordered on a musical scale.” (American Standards Association) Periodic sound: pure tone, voiced speech (vowel, voiced consonant), music Aperiodic sound with pitch sensation, e.g. comb- filtered noise

Pitch of a periodic signal d Fundamental Frequency (period) Pitch Frequency (period)

Applications of pitch tracking Computational auditory scene analysis (CASA) Source separation in general Automatic music transcription Speech coding, analysis, speaker recognition and language identification

Existing pitch tracking algorithms Numerous pitch tracking, or pitch determination algorithms (PDAs), have been proposed (Hess’83; de Cheveigne’06) Time-domain Frequency-domain Time-frequency domain Most PDAs are designed to detect single pitch in noisy speech Some PDAs are able to track two simultaneous pitch contours. However, their performance is limited in the presence of broadband interference

Multipitch tracking in noisy environments Voiced signal Multipitch tracking Output pitch tracks Background noise Voiced signal

Diagram of Wu et al. ’ 03 Normalized Correlogram Channel Selection HMM-based Multipitch Tracking Speech/ Interference Cochlear Filtering Continuous Pitch Tracks Channel Integration

Periodicity extraction using correlogram Normalized Correlogram Frequency channels Delay Response to clean speech High frequency Low frequency

Channel selection Some frequency channels are masked by interference and provide corrupting information on periodicity. These corrupted channels are excluded from pitch determination (Rouat et al.’97) Different strategies are used for selecting valid channels in low- and high-frequency ranges

HMM formulation Normalized Correlogram Channel Selection HMM-based Multipitch Tracking Speech/ Interference Cochlear Filtering Continuous Pitch Tracks Channel Integration

18 Pitch state space l The state space of pitch is neither a discrete nor continuous space in a traditional sense, but a mix of the two (Tokuda et al.’99) l Considering up to two simultaneous pitch contours, we model the pitch state space as a union of three subspaces: l Zero-pitch subspace is an empty set: l One-pitch subspace: l Two-pitch subspace:

19 How to interpret correlogram probabilistically? l The correlogram dominates the modeling of pitch perception (Licklider’51), and is commonly used in pitch detection l We examine the relative time lag between the true pitch period and the lag of the closest peak True pitch delay (d) Peak delay (l)

Relative time lag statistics histogram from natural speech for one channel

21 Modeling relative time lags l From the histogram data, we find that a mixture of a Laplacian and a uniform distribution is appropriate l q is a partition coefficient l l The Laplacian models a pitch event and the uniform models “background noise” l The parameters are estimated using ML from a small corpus of clean speech utterances

Modeling relative time-lag statistics Estimated probability distribution of (Laplacian plus uniform distribution)

23 One-pitch hypothesis l First consider one-pitch state subspace, i.e. l For a given channel, c, let denote the set of correlogram peaks l If c is not selected, the probability of background noise is assigned

24 One-channel observation probability Normalized Correlogram

Integration of channel observation probabilities How to integrate the observation probabilities of individual channels to form a frame-level probability? Modeling joint probability is computationally prohibitive. Instead, First we assume channel independence and take the product of observation probabilities of all channels Then flatten (smooth) the product probability to account for correlated responses of different channels, or to correct the probability overshoot phenomenon (Hand & Hu’01)

26 Two-pitch hypothesis l Next consider two-pitch state subspace, i.e. l If channel energy is dominated by one source, d 1 l denotes relative time-lag distribution from two-pitch frames

27 Two-pitch hypothesis (cont.) l By a similar channel integration scheme, we finally obtain l This gives the larger of the two assuming either d 1 or d 2 dominates

28 Two-pitch integrated observation probability Pitch Delay 1 Pitch Delay 2

29 Zero-pitch hypothesis l Finally consider zero-pitch state subspace, i.e. l We simply give it a constant likelihood

Observation probability 30 HMM tracking Pitch state space Observed signal Pitch dynamics One time frame

31 Prior (prediction) and posterior probabilities Assuming pitch period d for time frame m-1 d Prior probability for time frame m Observation probability for time frame m d d Posterior probability for time frame m

Transition probabilities Transition probabilities consist of two parts: Jump probabilities between pitch subspaces Pitch dynamics within the same subspace Jump probabilities are again estimated from the same small corpus of speech utterances They need not be accurate as long as diagonal values are high

33 Pitch dynamics in consecutive time frames Pitch continuity is best modeled by a Laplacian Derived distribution consistent with the pitch declination phenomenon in natural speech (Nooteboom’97)

Search and efficient implementation Viterbi algorithm is used to find the optimal sequence of pitch states To further improve computational efficiency, we employ Pruning: search only in a neighborhood of a previous pitch point Beam search: search for a limited number of most probable state sequences Search for pitch periods near local peaks

Evaluation results The Wu et al. algorithm was originally evaluated on mixtures of 10 speech utterances and 10 interferences (Cooke’93), which have a variety including broadband noise, speech, music, and environmental sounds The system generates good results, substantially better than alternative systems The performance is confirmed by subsequent evaluations by others using different corpora

Example 1: Speech and white noise Tolonen & Karjalainen’00 Wu et al.’03 Pitch Period (ms) Time (s)

Example 2: Two utterances Wu et al.’03 Time (s) Tolonen & Karjalainen’00 Pitch Period (ms) Time (s)

38 Outline l Problem statement l Multipitch tracking in noisy speech l Multipitch tracking in reverberant environments l Binaural tracking of moving sound sources l Discussion & conclusion

Multipitch tracking for reverberant speech Room reverberation degrades harmonic structure, making pitch tracking harder Mixture of two anechoic utterances Corresponding reverberant mixture

What is pitch of a reverberant speech signal? Laryngograph provides ground truth pitch for anechoic speech. However, it does not account for fundamental alteration to the signal by room reverberation True to the definition of signal periodicity and considering the use of pitch for speech segregation, we suggest to track the fundamental frequency of the quasi- periodic reverberant signal itself, rather than its corresponding anechoic signal (Jin & Wang’09) We use a semi-automatic pitch labeling technique (McGonegal et al.’75) to generate reference pitch by examining waveform, autocorrelation, and cepstrum

41 HMM for multipitch tracking in reverberation l We have recently applied the HMM framework of Wu et al.’03 to reverberant environments (Jin & Wang’09) l The following changes are made to account for reverberation effects: l A new channel selection method based on cross-channel correlation l Observation probability is formulated based on a pitch saliency measure, rather than relative time-lag distribution which is very sensitive to reverberation l These changes result in a simpler HMM model! l Evaluation and comparison with Wu et al.’03 and Klapuri’08 show that this system is robust to reverberation, and gives better performance

Two-utterance example Upper: Wu et al.’03; lower: Jin & Wang’09 Reverberation time is 0.0 s (left), 0.3 s (middle), 0.6 s (right)

43 Outline l Problem statement l Multipitch tracking in noisy speech l Multipitch tracking in reverberant environments l Binaural tracking of moving sound sources l Discussion & conclusion

44 HMM for binaural tracking of moving sources l Binaural cues (observations) are ITD (interaural time difference) and IID (interaural intensity difference) l The HMM framework is similar to that of Wu et al.’03 Roman & Wang (2008)

45 Likelihood in one-source subspace l Joint distribution of ITD-IID deviations for one channel: Actual ITDReference ITD

46 Three-source illustration and comparison Kalman filter output

47 Summary of moving source tracking l The HMM framework automatically provides the number of active sources at a given time l Compared to a Kalman filer approach, the HMM approach produces more accurate tracking l Localization of multiple stationary sources is a special case l The proposed HMM model represents the first CASA study addressing moving sound sources

48 General discussion l The HMM framework for multi-target tracking is a form of Bayesian inference (tracking) that is broader than Kalman filtering l Permits nonlinearity and non-Gaussianity l Yields the number of active targets at all times l Corpus-based training for parameter estimation l Efficient search l Our work has investigated up to two (pitch) or three (moving sources) target tracks in the presence of noise l Extension to more than three is straightforward theoretically, but complexity becomes an issue increasingly l However, for the domain of auditory processing, little need to track more than 2-3 targets due to limited perceptual capacity

49 Conclusion We have proposed an HMM framework for multi-target tracking State space consists of a discrete set of subspaces, each being continuous Observations (likelihoods) are derived in time-frequency domains: Correlogram for pitch and cross-correlogram for azimuth We have applied this framework to tracking multiple pitch contours and multiple moving sources The resulting algorithms perform reliably and outperform related systems The proposed framework appears to have general utility for acoustic (auditory) signal processing

50 Collaborators Mingyang Wu, Guy Brown Nicoleta Roman Zhaozhang Jin

51 A monotonic relationship l This relationship of the distribution spread, λ, with respect to reverberation time (from detected pitch) yields a blind estimate of the room reverberation time up to 0.6 sec (Wu & Wang’06)

52 A byproduct: Reverberation time estimation l Relative time-lag distribution is sensitive to room reverberation, which increases the distribution spread Clean speech Reverberant speech