Approaches of Interest in Blind Source Separation of Speech Julien Bourgeois DAIMLERCHRYSLER AG Research and Technology, RIC/AD 1
Background - Need of speech-based Human-Machine Interface in cars. - Road noise, passengers speech create adverse conditions to Automatic Speech Recognition. 2
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future plans 3
Computational Auditory Scene Analysis (CASA) Generalities Aim: get an algorithmic description of higher auditory functions. Strong biological inspiration. One or two sensors (microphones) are considered. Mic signal is filtered like in a human ear. Variations on a Segmentation - Grouping scheme. 4
Segmentation is based on temporal continuity. CASA - Segmentation Frequency Index Time Segmentation is based on temporal continuity. 5
CASA - Grouping Frequency Index Time Grouping rules are (1) harmonicity and (2) synchronous start or end. These rules agree with certain psychoacoustical phenomena. 6
CASA - Audio example mixture separated
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future plans
Sparse Decomposition - Generalities 2 sensors x1 and x2 of N acoustic sources si are given. Aim : Find an invertible transform T so that the N sources are disjoint in the transformed domain. DUET : T = STFT works !! (Windowed Short Term Fourier Transform) Indeed, statistically S1(w,t) S2(w,t) is small. 7
Sparse Decomposition - DUET Assumption : “At each point (w,t) of the spectrogram, only one source is active.” Angle(X1(w,t)/X2(w,t))/w [Group delay] Group delay 1 Group delay 2 Which source Si is active at (w,t) ? Look at the phase between X1(w,t) and X2(w,t). Frequency Index Time Then set Si(w,t) = X1(w,t) 8
Sparse Decomposition - Audio Example Mix 1 Mix 2 Out 1 Out 2
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future plans
Statistical Blind Source Separation Assumption: “The sources are decorrelated.” or “The sources are independent.” ICA = Independent Component Analysis Generally needs (at least) as many sensors as sources. Permutation and scale ambiguities: If s1 and s2 are independent, so are s2 and b s1 9
Statistical Blind Source Separation Mixture model: x(n) = A(0)s(n) + ... + A(K)s(n-K) = A* s (n) (TF) X(w,t) = A(w)S(w,t) Separation filters W: find W(w) so that the components of Y(w,t) = W(w)X(w,t) are independent or decorrelated. (Y estimates the sources S). For a decorrelation criterion, the output Y is decorrelated at each t. One can find W minimizing the off-diagonal terms of RYY(w,t) = E[Y(w,t)YH(w,t)] jointly for all t. 10
Statistical Blind Source Separation Very few assumption on the sources. But: In frequency domain, the ambiguities occur independently at each frequency bin w. Can be CPU-expensive because of iterative optimization. 11
Statistical Blind Source Separation Audio example Mix 1 Mix 2 Out 1 Out 2
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future Plans
Beamforming - Array signal processing Spatial locations of the sources (direction of arrival - D.O.A.) are mapped on delays between sensors. Array signal processing addresses 3 estimation problems: 1) number of sources, 2) their spatial locations, 3) spatial filtering. Can require more sensors than sources, depending on the spatial resolution. s1 s2 x1 xi xN xi(t) = s1(t-d1,i ) + s2(t-d2,i ) 12
Beamforming - Source Location 1/ Energy-Based: Search for the delays di that maximize sy2 y(t) = x1(t+d1 ) + ... + xN(t+dN ) [output of a delay-sum beamformer] 2/ Correlation Based: Search for the delay d that maximizes E[xi (t)xj (t-d )], for some pairs (i,j) 3/ High Resolution: X(w,t) = A(w)S(w,t) The eigendecomposition of RXX=A RSS AH provides information on A, i.e. on the source location. diagonal if the sources are decorrelated 13
Beamforming - Spatial Filtering xi x1 xN di dN d1 ... Fi FN F1 + Beamforming - Spatial Filtering direction of interest 1/ Data-Independant: e.g. delay sum beamforming 2/ Statistically optimal: Constrain the response in the direction of interest and minimize the output power 14
Beamforming - Audio example Mix 1 Mix 2 Out 1 Out 2
4 Approaches to the Cocktail Party Problem 1 - Computational Auditory Scene Analysis (CASA) 2 - Sparse Decomposition Approach 3 - Statistical Blind Source Separation 4 - Beamforming Conclusion & Future plans
Conclusion & Questions Different definitions of “source”. Perceptual,Topological, Statistical, Spatial: Complementary approaches. No perfect solution to the cocktail party problem. 15
Future plans in Hoarse Combination of existing methods: DUET if the sources are disjoint ICA or beamforming if they overlap Investigation of specific open questions Estimation of the number of sources at each (w,t) point. Sparse Decomposition: Optimal transform T ? Extension to more than 2 mics ? Theoretical Boundaries ? Equivalencies between these approaches (e.g. Second Order BSS and Beamforming) ? 16
Short Bibliography CASA Guy J Brown, Martin Cooke. Computational Auditory Scene Analysis. Computer Speech and Language, vol. 8, no. 4, pp. 297-336, 1994. A. S. Bregman. “Auditory Scene Analysis”, MIT Press, Cambridge, MA, 1990. Guoning Hu and DeLiang Wang, Monaural speech separation, NIPS 2002
Sparse Decomposition - DUET Short Bibliography Sparse Decomposition - DUET M. Zibulevsky, B. A. Pearlmutter, P. Bofill, and P. Kisilev, "Blind Source Separation by Sparse Decomposition", chapter in the book: S. J. Roberts, and R.M. Everson eds., Independent Component Analysis: Principles and Practice, Cambridge, 2001. O. Yilmaz and S. Rickard, Blind Separation of Speech Mixtures via Time-Frequency Masking, Submitted to the IEEE Transactions on Signal Processing, November 4, 2002 Jourjine, S. Rickard, and O. Yilmaz, Blind Separation of Disjoint Orthogonal Signals: Demixing N Sources from 2 Mixtures, Proceedings of the 2000 IEEE Conference on Acoustics, Speech, and Signal Processing (ICASSP2000), Volume 5, Pages 2985-2988, Istanbul, Turkey, June 2000
Statistical Blind Source Separation - ICA Short Bibliography Statistical Blind Source Separation - ICA Lucas Parra, Clay Spence, "Convolutive blind source separation of non-stationary sources", IEEE Trans. on Speech and Audio Processing pp. 320-327, May 2000 Te-Won Lee, Independent Component Analysis: Theory and Applications Kluwer Academic Publishers, September 1998
Short Bibliography Beamforming B.D. van Veen and K.M. Buckley, ``Beamforming: A Versatile Approach to Spatial Filtering,'' IEEE ASSP Magazine, vol.5, pp. 4-24, Apr. 1988. M. Brandstein and H. Silverman, "A practical methodology for speech source localization with microphone arrays," Computer, Speech and Language, vol. 11, no. 2, pp. 91--126, 1997. D. Ward and M. Brandstein (Eds.), 'Microphone Arrays: Techniques and Applications', Springer, Berlin, 2001, pp. 231-256.