Download presentation
Presentation is loading. Please wait.
Published byRainer Lange Modified over 6 years ago
1
Structure-Based Speech Classification Using State-Space Embedding
Uchechukwu Ofoegbu Advisor Dr. Robert E. Yantorno Committee Dr. Saroj K. Biswas Dr. Henry M. Sendaula My Thesis research is related to the speaker identification system enhancement techniques using the SID-usable speech concept. Speech has been modeled in this research using the sinusoidal model. The Idea of SID-usable speech was introduced by a previous master’s student ananth iyer; where he proposed a system wherein the SID system itself is used as ground truth to identify what is usable and unusable to it. Monday, November 29, 2004
2
Acknowledgment Dr. Robert Yantorno Dr. Saroj Biswas Dr. Henry Sendaula
Speech Lab Members Air Force Research Laboratory, Rome, NY Monday, November 29, 2004
3
Overview Usable Speech Voiced Speech State-Space Embedding
Research Goals Usable Speech Detection Voiced Speech Detection Conclusion I first give a brief introduction as to what is ‘usable speech’ and how the usable speech detection system is designed Then, introduce my research goals. I talk about the TIR-Usable speech measure that I have developed. Then, the SID-usable speech approach and some of the proposed SID enhancement techniques Monday, November 29, 2004
4
Usable Speech Monday, November 29, 2004
5
TIR-Based Usable Speech
Monday, November 29, 2004
6
TIR-Based Results Monday, November 29, 2004
7
Next-Generation Co-Channel Speech Processing System
Usable Speech Segments Speaker 1 Usable Segments Segment Speech from Speaker 1 Speech Speech Extraction Speaker 2 Extraction Segments Sub-Unit Reconstruction Sub-Unit Sub-Unit Co-Channel Speech from Speaker 2 Speech Monday, November 29, 2004
8
Voiced Speech Monday, November 29, 2004
9
Voiced/Unvoiced Characteristics
Quasi-periodic excitation Modulation by vocal tract Production of vowels, voiced fricatives & plosives Unvoiced No periodic vibration of vocal chords Noise-like nature Production of unvoiced fricatives and plosives Monday, November 29, 2004
10
State-Space Embedding
Monday, November 29, 2004
11
Nonlinearities in Speech
Glottal waveform changes Shape varies with amplitude Physical observations Flow in vocal tract is non-laminar Coupling between vocal tract and folds When glottis is open, prominent changes are observed in formant characteristics Monday, November 29, 2004
12
State-Space Embedding
Nonlinear Systems Point moving along some trajectory in an abstract state space Coordinates of the point are independent degrees of freedom of the system State space could be reconstructed from a scalar signal Monday, November 29, 2004
13
State-Space Embedding
Takens’ Method of Delays A state space representation topologically equivalent to the original state space of a system can be reconstructed from a single observable dimension Vectors in m-dimensional state space are formed from time-delayed values of a signal Monday, November 29, 2004
14
State-Space Embedding (cont’d)
m = embedding dimension d = delay value Talk about Usable speech and the use of TIR as the ground truth for usable speech detection. Then, talk about how the extracted usable speech is used to find out the speaker’s identity from the speaker identification system. Monday, November 29, 2004
15
State-Space Embedding
Delay value, d: Dependent on sampling rate and signal properties Large enough such that nonlinearities are taken into account by the reconstructed trajectory Small enough to retain reasonable time resolution Monday, November 29, 2004
16
State-Space Embedding
Dimension, m: Generation of voiced speech constitutes a low-dimensional system Generation of unvoiced speech constitutes a relatively high-dimensional system Using a low dimension (such as m = 3) sufficiently reconstructs voiced but not unvoiced speech Monday, November 29, 2004
17
Recent Applications of State-Space Embedding in Speech Processing
Pitch detection Terez, D. E., “Robust Pitch Determination Using Nonlinear State-Space Embedding”, ICASSP, 2002. Automatic speech segmentation using curvature Smolenski, B. Y., “A Filterless Approach to Processing Speech in Degraded Environments” Dissertation Proposal, 2004. Monday, November 29, 2004
18
Research Goals Structured Usable and Voiced Speech Speech
State-Space Embedding Usable and Voiced Speech Unusable and Unvoiced speech Structure Observation Structured Unstructured Monday, November 29, 2004
19
Usable Speech Detection
Monday, November 29, 2004
20
Usable and Unusable Speech
Monday, November 29, 2004
21
Embedded Voiced and Unvoiced Speech
Observable Difference Usable speech signal is less dense than unusable Measure Nodal Density (ND) Measure Monday, November 29, 2004
22
Nodal Density (ND) Measure
Monday, November 29, 2004
23
Nodal Density Smallest cube which encloses the signal is determined
This cube is divided into N smaller cubes Edges of the smaller cubes are defined as nodes Number of nodes spanned by the signal is determined Ratio of number of nodes spanned to total number of nodes is defined as nodal density Monday, November 29, 2004
24
Embedded Usable and Unusable Speech Frames with Grids
Monday, November 29, 2004
25
Nodes Spanned by Embedded Usable and Unusable Speech Frames
-4000 -2000 2000 4000 6000 -5000 5000 Nodes Spanned by Embedded Co-channel Speech of 30dB TIR -10000 -6000 Monday, November 29, 2004
26
ND Distribution Monday, November 29, 2004
27
Usable Speech Detection Procedure
Voiced Speech Extractor Target + Interferer Framing Nonlinear Embedding Compute Nodal Density Usable Cubing N= 73= 343 Unusable Monday, November 29, 2004
28
ND-Based Usable speech Detection Results
Monday, November 29, 2004
29
Result Comparison Monday, November 29, 2004
30
Voiced Speech Detection
Monday, November 29, 2004
31
Voiced and Unvoiced Speech
Monday, November 29, 2004
32
Embedded Voiced and Unvoiced Speech (cont’d)
Observable Differences Rate of change of unvoiced signal is faster than that of voiced. Voiced signal is less dense than unvoiced Measures Difference-Mean Comparison (DMC) Measure Nodal Density (ND) Measure Monday, November 29, 2004
33
Difference-Mean Comparison (DMC) Measure
Monday, November 29, 2004
34
Difference-Mean Comparison
3rd order difference computation along first non-singleton dimension 1st order difference of NxN matrix given by Length(3rd order diff. > mean) observed Monday, November 29, 2004
35
DMC Procedure 3rd Order Lowpass Filtering Difference
Speech State Space Embedding Comparison Mean of First Dimension 3rd Order Difference Computation Voiced Unvoiced < Threshold > Threshold Monday, November 29, 2004
36
DMC Results Monday, November 29, 2004
37
Result Comparison Monday, November 29, 2004
38
Nodal Density (ND) Measure
Monday, November 29, 2004
39
Nodal Density Smallest cube which encloses the signal is determined
This cube is divided into N smaller cubes Edges of the smaller cubes are defined as nodes Number of nodes spanned by the signal is determined Ratio of number of nodes spanned to total number of nodes is defined as nodal density Monday, November 29, 2004
40
Embedded Voiced and Unvoiced Speech Frames with Grids
Monday, November 29, 2004
41
Nodes Spanned by Embedded Voiced and Unvoiced Speech Frames
Monday, November 29, 2004
42
Nodal Density Procedure
Lowpass Filtering Speech State Space Embedding Computation of Nodal Density Cubing N = 1000 Estimation of Largest Cube spanned Voiced Unvoiced < Threshold > Threshold Monday, November 29, 2004
43
Nodal Density Results Monday, November 29, 2004
44
Result Comparison Monday, November 29, 2004
45
Comparison of ND and DMC Measures
Monday, November 29, 2004
46
Fusion of Voiced Speech Detection Measures
Monday, November 29, 2004
47
Why fusion? Different features can provide complementary information.
Different classifiers can produce different decisions. The best classifier can produce an error that an inferior classifier correctly identifies. Monday, November 29, 2004
48
Levels of Fusion Data level fusion Feature level fusion
Decision level fusion Monday, November 29, 2004
49
Mutual Information p(c,y) = joint probability mass function of C and Y
p(c) and p(y) = marginal probability mass functions Monday, November 29, 2004
50
Mutual Information E ZC FR RE DMC ND 0.18 0.31 0.05 1.21 0.20 0.22
0.28 0.10 0.09 1.78 Monday, November 29, 2004
51
Result Comparison - DMC
Monday, November 29, 2004
52
Result Comparison - ND Monday, November 29, 2004
53
Summary Usable Speech Detection Voiced Speech Detection
Nonlinear reconstruction of co-channel speech enhances discrimination between usable and unusable speech. Nodal density measure outperforms existing TIR-based usable speech detection measures Voiced Speech Detection Two structure-based measures have been developed, which show an improvement over traditional measures in voiced speech detection under high-noise conditions. Fusion of voiced speech detection measures further increases voiced speech detection accuracy Monday, November 29, 2004
54
Further Research Usable Speech Detection Voiced Speech Detection
Evaluate performance of usable speech detection with noisy co-channel speech. Fuse the ND measure with existing usable speech detection measures such as APPC and SAPVR Voiced Speech Detection Employ more advanced fusion techniques such as independent component analysis Further enhance voiced speech detection under very high noise conditions by performing adaptive filtering of noisy signals. Monday, November 29, 2004
55
Publications [U. Ofoegbu, B. Smolenski and R. Yantorno] “Structure-Based Voiced/Usable Speech Detection Using State-space Embedding”, IEEE international Symposium on in Intelligent Signal Processing and Communication Systems (ISPACS), 2004. [B. Smolenski, U. Ofoegbu and R. Yantorno] “Nonlinear state space embedding Features and their application to Robust Speech segmentation ”, IEEE international Symposium on in Intelligent Signal Processing and Communication Systems (ISPACS), 2004. Monday, November 29, 2004
56
Please feel FREE to ask QUESTIONS !!!
Puzzled? Perplexed?? Baffled??? Mystified???? Please feel FREE to ask QUESTIONS !!! Monday, November 29, 2004
57
EXTRA SLIDES Monday, November 29, 2004
58
Experimental Set-Up nCr = n!/(r!(n-r!)) = 861
41 Speech utterances (TIMIT Database) nCr = n!/(r!(n-r!)) = 861 Scaled and combined at 0dB TIR Broken down into frames of 256 samples Voiced frames extracted Training – 430 co-channel combinations Testing – 861 co-channel combinations Monday, November 29, 2004
59
DMC Distributions Monday, November 29, 2004
60
DMC Distributions with Filtering
Monday, November 29, 2004
61
Experimental Set-Up 25 Speech utterances (TIMIT Database)
12 male files and 13 female files Lowpass filter used as pre-processing block Each file broken down into frames of 128 samples each Monday, November 29, 2004
62
Results Monday, November 29, 2004
63
Results Monday, November 29, 2004
64
Result Comparison Monday, November 29, 2004
65
ND Distributions with Filtering
Monday, November 29, 2004
66
DMC Distributions with Filtering
Monday, November 29, 2004
67
Varying N Monday, November 29, 2004
68
Experimental Set-Up 25 Speech utterances (TIMIT Database)
12 male files and 13 female files Lowpass filter used as pre-processing block Each file broken down into frames of 128 samples each Monday, November 29, 2004
69
Results Monday, November 29, 2004
70
Results Monday, November 29, 2004
71
Result Comparison Monday, November 29, 2004
72
Comparison of ND and DMC Measures
Monday, November 29, 2004
73
Fusion New measures fused with residual energy (RE) measure.
Decision-level fusion performed If ((measure1 < threshold1) & (measure2 < threshold2) ) Speech frame = voiced Else Speech frame != voiced Monday, November 29, 2004
74
Difference-Mean Comparison
Summary Speech State-Space Embedding Difference-Mean Comparison Nodal Density Usable Speech Detection Voiced Speech Detection Monday, November 29, 2004
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.