Structure-Based Speech Classification Using State-Space Embedding

Structure-Based Speech Classification Using State-Space Embedding
Uchechukwu Ofoegbu Advisor Dr. Robert E. Yantorno Committee Dr. Saroj K. Biswas Dr. Henry M. Sendaula My Thesis research is related to the speaker identification system enhancement techniques using the SID-usable speech concept. Speech has been modeled in this research using the sinusoidal model. The Idea of SID-usable speech was introduced by a previous master’s student ananth iyer; where he proposed a system wherein the SID system itself is used as ground truth to identify what is usable and unusable to it. Monday, November 29, 2004

Acknowledgment Dr. Robert Yantorno Dr. Saroj Biswas Dr. Henry Sendaula
Speech Lab Members Air Force Research Laboratory, Rome, NY Monday, November 29, 2004

Overview Usable Speech Voiced Speech State-Space Embedding
Research Goals Usable Speech Detection Voiced Speech Detection Conclusion I first give a brief introduction as to what is ‘usable speech’ and how the usable speech detection system is designed Then, introduce my research goals. I talk about the TIR-Usable speech measure that I have developed. Then, the SID-usable speech approach and some of the proposed SID enhancement techniques Monday, November 29, 2004

Usable Speech Monday, November 29, 2004

TIR-Based Usable Speech
Monday, November 29, 2004

TIR-Based Results Monday, November 29, 2004

Next-Generation Co-Channel Speech Processing System
Usable Speech Segments Speaker 1 Usable Segments Segment Speech from Speaker 1 Speech Speech Extraction Speaker 2 Extraction Segments Sub-Unit Reconstruction Sub-Unit Sub-Unit Co-Channel Speech from Speaker 2 Speech Monday, November 29, 2004

Voiced Speech Monday, November 29, 2004

Voiced/Unvoiced Characteristics
Quasi-periodic excitation Modulation by vocal tract Production of vowels, voiced fricatives & plosives Unvoiced No periodic vibration of vocal chords Noise-like nature Production of unvoiced fricatives and plosives Monday, November 29, 2004

State-Space Embedding

Nonlinearities in Speech
Glottal waveform changes Shape varies with amplitude Physical observations Flow in vocal tract is non-laminar Coupling between vocal tract and folds When glottis is open, prominent changes are observed in formant characteristics Monday, November 29, 2004

Nonlinear Systems Point moving along some trajectory in an abstract state space Coordinates of the point are independent degrees of freedom of the system State space could be reconstructed from a scalar signal Monday, November 29, 2004

Takens’ Method of Delays A state space representation topologically equivalent to the original state space of a system can be reconstructed from a single observable dimension Vectors in m-dimensional state space are formed from time-delayed values of a signal Monday, November 29, 2004

State-Space Embedding (cont’d)
m = embedding dimension d = delay value Talk about Usable speech and the use of TIR as the ground truth for usable speech detection. Then, talk about how the extracted usable speech is used to find out the speaker’s identity from the speaker identification system. Monday, November 29, 2004

Delay value, d: Dependent on sampling rate and signal properties Large enough such that nonlinearities are taken into account by the reconstructed trajectory Small enough to retain reasonable time resolution Monday, November 29, 2004

Dimension, m: Generation of voiced speech constitutes a low-dimensional system Generation of unvoiced speech constitutes a relatively high-dimensional system Using a low dimension (such as m = 3) sufficiently reconstructs voiced but not unvoiced speech Monday, November 29, 2004

Recent Applications of State-Space Embedding in Speech Processing
Pitch detection Terez, D. E., “Robust Pitch Determination Using Nonlinear State-Space Embedding”, ICASSP, 2002. Automatic speech segmentation using curvature Smolenski, B. Y., “A Filterless Approach to Processing Speech in Degraded Environments” Dissertation Proposal, 2004. Monday, November 29, 2004

Research Goals Structured Usable and Voiced Speech Speech
State-Space Embedding Usable and Voiced Speech Unusable and Unvoiced speech Structure Observation Structured Unstructured Monday, November 29, 2004

Usable Speech Detection

Usable and Unusable Speech

Embedded Voiced and Unvoiced Speech
Observable Difference Usable speech signal is less dense than unusable Measure Nodal Density (ND) Measure Monday, November 29, 2004

Nodal Density (ND) Measure

Nodal Density Smallest cube which encloses the signal is determined
This cube is divided into N smaller cubes Edges of the smaller cubes are defined as nodes Number of nodes spanned by the signal is determined Ratio of number of nodes spanned to total number of nodes is defined as nodal density Monday, November 29, 2004

Embedded Usable and Unusable Speech Frames with Grids

Nodes Spanned by Embedded Usable and Unusable Speech Frames
-4000 -2000 2000 4000 6000 -5000 5000 Nodes Spanned by Embedded Co-channel Speech of 30dB TIR -10000 -6000 Monday, November 29, 2004

ND Distribution Monday, November 29, 2004

Usable Speech Detection Procedure
Voiced Speech Extractor Target + Interferer Framing Nonlinear Embedding Compute Nodal Density Usable Cubing N= 73= 343 Unusable Monday, November 29, 2004

ND-Based Usable speech Detection Results

Result Comparison Monday, November 29, 2004

Voiced Speech Detection

Voiced and Unvoiced Speech

Embedded Voiced and Unvoiced Speech (cont’d)
Observable Differences Rate of change of unvoiced signal is faster than that of voiced. Voiced signal is less dense than unvoiced Measures Difference-Mean Comparison (DMC) Measure Nodal Density (ND) Measure Monday, November 29, 2004

Difference-Mean Comparison (DMC) Measure

Difference-Mean Comparison
3rd order difference computation along first non-singleton dimension 1st order difference of NxN matrix given by Length(3rd order diff. > mean) observed Monday, November 29, 2004

DMC Procedure 3rd Order Lowpass Filtering Difference
Speech State Space Embedding Comparison Mean of First Dimension 3rd Order Difference Computation Voiced Unvoiced < Threshold > Threshold Monday, November 29, 2004

DMC Results Monday, November 29, 2004

Nodal Density (ND) Measure

Nodal Density Smallest cube which encloses the signal is determined
This cube is divided into N smaller cubes Edges of the smaller cubes are defined as nodes Number of nodes spanned by the signal is determined Ratio of number of nodes spanned to total number of nodes is defined as nodal density Monday, November 29, 2004

Embedded Voiced and Unvoiced Speech Frames with Grids

Nodes Spanned by Embedded Voiced and Unvoiced Speech Frames

Nodal Density Procedure
Lowpass Filtering Speech State Space Embedding Computation of Nodal Density Cubing N = 1000 Estimation of Largest Cube spanned Voiced Unvoiced < Threshold > Threshold Monday, November 29, 2004

Nodal Density Results Monday, November 29, 2004

Comparison of ND and DMC Measures

Fusion of Voiced Speech Detection Measures

Why fusion? Different features can provide complementary information.
Different classifiers can produce different decisions. The best classifier can produce an error that an inferior classifier correctly identifies. Monday, November 29, 2004

Levels of Fusion Data level fusion Feature level fusion
Decision level fusion Monday, November 29, 2004

Mutual Information p(c,y) = joint probability mass function of C and Y
p(c) and p(y) = marginal probability mass functions Monday, November 29, 2004

Mutual Information E ZC FR RE DMC ND 0.18 0.31 0.05 1.21 0.20 0.22
0.28 0.10 0.09 1.78 Monday, November 29, 2004

Result Comparison - DMC

Result Comparison - ND Monday, November 29, 2004

Summary Usable Speech Detection Voiced Speech Detection
Nonlinear reconstruction of co-channel speech enhances discrimination between usable and unusable speech. Nodal density measure outperforms existing TIR-based usable speech detection measures Voiced Speech Detection Two structure-based measures have been developed, which show an improvement over traditional measures in voiced speech detection under high-noise conditions. Fusion of voiced speech detection measures further increases voiced speech detection accuracy Monday, November 29, 2004

Further Research Usable Speech Detection Voiced Speech Detection
Evaluate performance of usable speech detection with noisy co-channel speech. Fuse the ND measure with existing usable speech detection measures such as APPC and SAPVR Voiced Speech Detection Employ more advanced fusion techniques such as independent component analysis Further enhance voiced speech detection under very high noise conditions by performing adaptive filtering of noisy signals. Monday, November 29, 2004

Publications [U. Ofoegbu, B. Smolenski and R. Yantorno] “Structure-Based Voiced/Usable Speech Detection Using State-space Embedding”, IEEE international Symposium on in Intelligent Signal Processing and Communication Systems (ISPACS), 2004. [B. Smolenski, U. Ofoegbu and R. Yantorno] “Nonlinear state space embedding Features and their application to Robust Speech segmentation ”, IEEE international Symposium on in Intelligent Signal Processing and Communication Systems (ISPACS), 2004. Monday, November 29, 2004

Please feel FREE to ask QUESTIONS !!!
Puzzled? Perplexed?? Baffled??? Mystified???? Please feel FREE to ask QUESTIONS !!! Monday, November 29, 2004

EXTRA SLIDES Monday, November 29, 2004

Experimental Set-Up nCr = n!/(r!(n-r!)) = 861
41 Speech utterances (TIMIT Database) nCr = n!/(r!(n-r!)) = 861 Scaled and combined at 0dB TIR Broken down into frames of 256 samples Voiced frames extracted Training – 430 co-channel combinations Testing – 861 co-channel combinations Monday, November 29, 2004

DMC Distributions Monday, November 29, 2004

DMC Distributions with Filtering

Experimental Set-Up 25 Speech utterances (TIMIT Database)
12 male files and 13 female files Lowpass filter used as pre-processing block Each file broken down into frames of 128 samples each Monday, November 29, 2004

Results Monday, November 29, 2004

ND Distributions with Filtering

DMC Distributions with Filtering

Varying N Monday, November 29, 2004

Experimental Set-Up 25 Speech utterances (TIMIT Database)
12 male files and 13 female files Lowpass filter used as pre-processing block Each file broken down into frames of 128 samples each Monday, November 29, 2004

Results Monday, November 29, 2004

Comparison of ND and DMC Measures

Fusion New measures fused with residual energy (RE) measure.
Decision-level fusion performed If ((measure1 < threshold1) & (measure2 < threshold2) ) Speech frame = voiced Else Speech frame != voiced Monday, November 29, 2004

Difference-Mean Comparison
Summary Speech State-Space Embedding Difference-Mean Comparison Nodal Density Usable Speech Detection Voiced Speech Detection Monday, November 29, 2004

Structure-Based Speech Classification Using State-Space Embedding

Similar presentations

Presentation on theme: "Structure-Based Speech Classification Using State-Space Embedding"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Structure-Based Speech Classification Using State-Space Embedding

Similar presentations

Presentation on theme: "Structure-Based Speech Classification Using State-Space Embedding"— Presentation transcript:

Similar presentations

About project

Feedback