Brian King, Advised by Les Atlas Electrical Engineering, University of Washington This research was funded by Air Force Office.

Slides:



Advertisements
Similar presentations
Modulation Spectrum Factorization for Robust Speech Recognition Wen-Yi Chu 1, Jeih-weih Hung 2 and Berlin Chen 1 Presenter : 張庭豪.
Advertisements

Online PLCA for Real-Time Semi-supervised Source Separation Zhiyao Duan, Gautham J. Mysore, Paris Smaragdis 1. EECS Department, Northwestern University.
Structured Sparse Principal Component Analysis Reading Group Presenter: Peng Zhang Cognitive Radio Institute Friday, October 01, 2010 Authors: Rodolphe.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Manifold Sparse Beamforming
Speech Group INRIA Lorraine
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
1 Applications on Signal Recovering Miguel Argáez Carlos A. Quintero Computational Science Program El Paso, Texas, USA April 16, 2009.
G. Valenzise *, L. Gerosa, M. Tagliasacchi *, F. Antonacci *, A. Sarti * IEEE Int. Conf. On Advanced Video and Signal-based Surveillance, 2007 * Dipartimento.
Oklahoma State University Generative Graphical Models for Maneuvering Object Tracking and Dynamics Analysis Xin Fan and Guoliang Fan Visual Computing and.
Zhiyao Duan, Gautham J. Mysore, Paris Smaragdis 1. EECS Department, Northwestern University 2. Advanced Technology Labs, Adobe Systems Inc. 3. University.
Convex Optimization in Sinusoidal Modeling for Audio Signal Processing Michelle Daniels PhD Student, University of California, San Diego.
Volkan Cevher, Marco F. Duarte, and Richard G. Baraniuk European Signal Processing Conference 2008.
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.
Page 0 of 8 Time Series Classification – phoneme recognition in reconstructed phase space Sanjay Patil Intelligent Electronics Systems Human and Systems.
Introduction to Automatic Speech Recognition
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Machine Learning for Signal Processing Latent Variable Models and Signal Separation Class Oct Oct /18797.
SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIXFACTORIZATION AND SPECTRAL MASKS Jain-De,Lee Emad M. GraisHakan Erdogan 17 th International.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
SPECTRO-TEMPORAL POST-SMOOTHING IN NMF BASED SINGLE-CHANNEL SOURCE SEPARATION Emad M. Grais and Hakan Erdogan Sabanci University, Istanbul, Turkey  Single-channel.
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Blind Separation of Speech Mixtures Vaninirappuputhenpurayil Gopalan REJU School of Electrical and Electronic Engineering Nanyang Technological University.
May 3 rd, 2010 Update Outline Monday, May 3 rd 2  Audio spatialization  Performance evaluation (source separation)  Source separation  System overview.
Survey of ICASSP 2013 section: feature for robust automatic speech recognition Repoter: Yi-Ting Wang 2013/06/19.
Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者:汪逸婷 1.
A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj NIPS 2009.
Gammachirp Auditory Filter
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
2010/12/11 Frequency Domain Blind Source Separation Based Noise Suppression to Hearing Aids (Part 2) Presenter: Cian-Bei Hong Advisor: Dr. Yeou-Jiunn Chen.
Page 0 of 8 Lyapunov Exponents – Theory and Implementation Sanjay Patil Intelligent Electronics Systems Human and Systems Engineering Center for Advanced.
July Age and Gender Recognition from Speech Patterns Based on Supervised Non-Negative Matrix Factorization Mohamad Hasan Bahari Hugo Van hamme.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Non-negative Matrix Factor Deconvolution; Extracation of Multiple Sound Sources from Monophonic Inputs International Symposium on Independent Component.
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Link Distribution on Wikipedia [0407]KwangHee Park.
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
2010/12/11 Frequency Domain Blind Source Separation Based Noise Suppression to Hearing Aids (Part 3) Presenter: Cian-Bei Hong Advisor: Dr. Yeou-Jiunn Chen.
The Story of Wavelets Theory and Engineering Applications
Spatial Covariance Models For Under- Determined Reverberant Audio Source Separation N. Duong, E. Vincent and R. Gribonval METISS project team, IRISA/INRIA,
Benedikt Loesch and Bin Yang University of Stuttgart Chair of System Theory and Signal Processing International Workshop on Acoustic Echo and Noise Control,
Speech Enhancement based on
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
语音与音频信号处理研究室 Speech and Audio Signal Processing Lab Multiplicative Update of AR gains in Codebook- driven Speech.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
HIGH-RESOLUTION SINUSOIDAL MODELING OF UNVOICED SPEECH GEORGE P. KAFENTZIS, YANNIS STYLIANOU MULTIMEDIA INFORMATICS LABORATORY DEPARTMENT OF COMPUTER SCIENCE.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Image Segmentation Techniques
Outline S. C. Zhu, X. Liu, and Y. Wu, “Exploring Texture Ensembles by Efficient Markov Chain Monte Carlo”, IEEE Transactions On Pattern Analysis And Machine.
LTI Student Research Symposium 2004 Antoine Raux
INTRODUCTION TO THE SHORT-TIME FOURIER TRANSFORM (STFT)
Presented by Chen-Wei Liu
Emad M. Grais Hakan Erdogan
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Speech Enhancement Based on Nonparametric Factor Analysis
Beehive Audio Source Separation
Presentation transcript:

Brian King, Advised by Les Atlas Electrical Engineering, University of Washington This research was funded by Air Force Office of Scientific Research 1

Problem Statement  Develop a theoretical framework for complex probabilistic latent semantic analysis (CPLSA) and its application in single-channel source separation 2

Outline  Introduction  Background  My current contributions  Proposed work 3

Nonnegative Matrix Factorization (NMF) X f,t B f,k W k,t X Time (t) Frequency (f) Basis Index (k) 4 [1] D.D. Lee and H.S. Seung, “Algorithms for Non-Negative Matrix Factorization,” Neural Information Processing Systems, 2001, pp

Using Matrix Factorization for Source Separation Find Bases Find Weights x indiv x mixed Y1Y1 Separation Y2Y2 STFT * ISTFT ** y1y1 y2y2 X indiv X mixed *Short Time Fourier Transform **Inverse Short Time Fourier Transform B, W 5 Separation

Using Matrix Factorization for Synthesis / Source Separation Matrix Factorization X B1B1 B2B2 W1W1 W2W2 X Y1Y1 Bases f,k Y2Y2 Weights k,t Separated Signals f,t Synthesis B, W Y1Y1 Y2Y2 Source Separation Y1Y1 Synthesized Signal f,t B W 6

NMF Cost Function: Frobenius Norm with Sparsity where 7 Frobenius 2 L 1 Sparsity X f,t B f,k W k,t X

Probabilistic Latent Semantic Analysis (PLSA)  Views the magnitude spectrogram as a joint probability distribution 8 [2] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic Latent Variable Models as Nonnegative Factorizations,” Computational Intelligence and Neuroscience, vol. 2008, 2008, pp. 1-9.

Probabilistic Latent Semantic Analysis (PLSA)  Uses the following generative model Pick a time, P(t) Pick a base from that time, P(k|t) Pick a frequency of that base, P(f|k) Increment the chosen (f,t) by one Repeat  Can be written as 9

Probabilistic Latent Semantic Analysis (PLSA)  Relationship to NMF P(t) is the sum of all magnitude at time t P(k|t) similar to weight matrix W k,t P(f|k) similar to base matrix B f,k  NMF  PLSA 10

Probabilistic Latent Semantic Analysis  Advantage of PLSA over NMF: Extensibility A tremendous amount of applicable literature on generative models ○ Entropic priors [2] ○ HMM’s with state-dependent dictionaries [6] [2] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic Latent Variable Models as Nonnegative Factorizations,” Computational Intelligence and Neuroscience, vol. 2008, 2008, pp [6] G.J. Mysore, “A Non-Negative Framework for Joint Modeling of Spectral Structures and Temporal Dynamics in Sound Mixtures,” PhD Thesis, Stanford University,

… but superposition? Original Sources Mixture Proper Separation NMF Separation #1#2 !!! 12

CMF Cost Function: Frobenius Norm with Sparsity where 13 Frobenius 2 L 1 Sparsity X f,t B f,k W k,t X [3] H. Kameoka, N. Ono, K. Kashino, and S. Sagayama, “Complex NMF: A New Sparse Representation for Acoustic Signals,” International Conference on Acoustics, Speech, and Signal Processing, 2009.

Comparing NMF and CMF via ASR: Introduction  Data Boston University news corpus [7] 150 utterances (72 minutes) Two talkers synthetically mixed at 0 dB target/masker ratio 1 minute each of clean speech used for training  Recognizers Sphinx-3 (CMU) SRI [7] M. Ostendorf, “The Boston University Radio Corpus,”

Comparing NMF and CMF via ASR: Results Unprocessed Non-negative Complex * Error bars mark 95% confidence level Word Accuracy % Better 15

Comparing NMF and CMF via ASR: Conclusion  Incorporating phase estimates into matrix factorization can improve source separation performance  Complex matrix factorization is worth further research 16 [4] B. King and L. Atlas, “Single-Channel Source Separation Using Complex Matrix Factorization,” IEEE Transactions on Audio, Speech, and Language Processing (submitted). [5] B. King and L. Atlas, “Single-channel Source Separation using Simplified-training Complex Matrix Factorization,” International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX: 2010.

… but overparameterization?  can result in a potentially infinite number of solutions… which isn’t a good thing!  Example: estimate observation with 3 bases, #1#3#2 17

Overparameterization Difficult to Extend Review of Current Methods NMF PLSA CMF ? Extendible UniqueAdditiveSuperposition 18

Proposed Solution: Complex Probabilistic Latent Semantic Analysis (CPLSA)  Goal: incorporate phase observation and estimation into current nonnegative PLSA framework  Implicitly solves Extensibility Superposition  Proposal to solve Overparameterization 19

Proposed Solution: Outline  Transform complex to nonnegative data  3 CPLSA variants  Phase constraints for STFT consistency Unique solution 20

Transform Complex to Nonnegative Data  Why is this important? Modeling observed data X f,t as a probability mass function PMF’s are nonnegative, real Observation needs to be nonnegative, real 21 If then

Transform Complex to Nonnegative Data  Starting point: Shashanka [8] N real → N+1 nonnegative  Algorithm N+1-length orthogonal vectors (A N+1,N ) Affine transform (for nonnegativity) Normalize  My new, proposed method N complex → 2N real 2N real data → 2N+1 nonnegative [8] M. Shashanka, “Simplex Decompositions for Real-Valued Datasets,” IEEE International Workshop on Machine Learning for Signal Processing, 2009, pp

Transform Complex to Nonnegative Data 23

3 Variants of CPLSA  #1 Complex bases Phase is associated with bases Not a good model for STFT  #2 Nonnegative bases + base- dependent phases Good model for audio, but overparameterized 24

3 Variants of CPLSA  Nonnegative bases + source- dependent phases Additive source model Good model for audio Fewer parameters Simplifies to NMF for single-source case  Compare with CPLSA #2 25

Phase Constraints for STFT Consistency  STFT is consistent when  Incorporate STFT consistency [9] into phase estimation step for separated sources  Unique solution! [9] J. Le Roux, N. Ono, and S. Sagayama, “Explicit Consistency Constraints for STFT Spectrograms and Their Application to Phase Reconstruction,”

Summary of Proposed Theory  Goal: incorporate phase observation and estimation into current nonnegative PLSA framework (extensible, additive, unique)  Theory Transform complex to nonnegative data 3 CPLSA variants Phase constraints for STFT consistency 27

Proposed Experiments  Separating speech in structured, nonstationary noise  Methods CPLSA, PLSA, CMF  Noise Babble noise Automotive noise  Measurements Objective perceptual ASR 28

Objective Measurement Tests  Goal: explore parameter space How they affect performance in CPLSA Find best-performing parameters Compare performance of CPLSA with PLSA, CMF  Data TIMIT corpus [10]  Measurements Blind Source Separation Evaluation Toolbox [11] Perceptual Evaluation of Speech Quality (PESQ) [12] [10] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, and N.L. Dahlgren, DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus, NIST, [11] E. Vincent, R. Gribonval, and C. Fevotte, “Performance Measurement in Blind Audio Source Separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, 2006, pp [12] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual Evaluation of Speech Quality (PESQ) - A New Method for Speech Quality Assessment of Telephone Networks and Codecs,” ICASSP, 2001, pp vol.2. 29

Automatic Speech Recognition Tests  Goal: test robustness of parameters Use best-performing parameters from objective measurements Compare performance of CPLSA with PLSA, CMF  Data Wall Street Journal corpus [13]  ASR System Sphinx-3 (CMU) [13] D.B. Paul and J.M. Baker, “The Design for the Wall Street Journal-Based CSR Corpus,” Proceedings of the workshop on Speech and Natural Language, Stroudsburg, PA, USA: Association for Computational Linguistics, 1992, pp. 357–

Examples 31

Frequency (Hz) Time (s) Subway Noise NMF 4.3 dB improvement

Frequency (Hz) Time (s) Subway Noise NMF 4.2 dB improvement

34 Fountain Noise Example #1  Target speaker synthetically added at -3 dB SNR  Speaker model trained on 60 seconds clean speech

35 Fountain Noise Example #2  No “clean speech” available for training of target talker Generic speaker model used

Mixed Speech (0 dB, no reverb) 36

Mixed Speech (0 dB, reverb) 37

Thank you! 38

39

Why not encode phase into bases? Individual phase term 40 X 11e jπ/1 1e jπ/5 22e jπ/2 2e jπ/6 33e jπ/3 3e jπ/7 44e jπ/4 4e jπ/ e j0 e jπ/1 e jπ/5 e j0 e jπ/2 e jπ/6 e j0 e jπ/3 e jπ/7 e j0 e jπ/4 e jπ/8 BWejθejθ

Why not encode phase into bases? Complex B, W 41 X 11e jπ/1 1e jπ/5 22e jπ/2 2e jπ/6 33e jπ/3 3e jπ/7 44e jπ/4 4e jπ/ e j? BW

BSS Evaluation Measures 42

… but superposition? 43