Download presentation
Presentation is loading. Please wait.
1
Speaker Classification through Deep Learning
Jacob Morris Alex Douglass Luke Woodbury 1
2
Overview Goals Potential Applications Learn more about deep learning!
Create a neural network that will classify voice recordings based on gender, age, natural language, etc. Potential Applications Research Security 2
3
Software Dependencies
Python 2.7 Keras 1.2.2 Theano Matplotlib 3
4
Hardware GeForce TitanX (Pascal) 12gb memory 4
4
5
Speech Accent Archive WAV files Categorizations
2300+ different speakers All recorded speaking same paragraph Categorizations Age Gender English Residence Natural Language Country Learning Style Etc. 5
6
The Essence of Deep Learning
7
Artificial Neural Networks (ANN)
7
8
Recurrent Networks Layer "remembers" data 8
8
9
LSTM Long Short Term Memory 9
9
10
Problem Type Sequence Classification Supervised Learning
Assign classification label(s) to input sequences Supervised Learning Each training sample includes the correct output for that sample 10
11
Variations of Model Topologies
Inputs Sequence of amplitudes Discrete Fourier transform of the segment Hidden Layers Variable Outputs Any subset of data categories 11
12
Training Challenges Process of Exploration Many parameters to tune
Results vague, must be interpreted Days required to train a new model 12
13
Terminology Sample Batch Epoch Base unit of training data
1/100 of a second of audio Batch Group of samples 4 seconds of consecutive samples Epoch Number of batches required to train on entire training data set In our case, 2310 batches
14
Terminology Sample Batch Epoch Base unit of training data
1/100 of a second of audio Batch Group of samples 4 seconds of consecutive samples Epoch Number of batches required to train on entire training data set In our case, 2310 batches
15
Loss Measure of how close an output signal is to its expected value
Categorical Cross Entropy Emphasizes correct answer
16
Learning Rate Determines how big of adjustments to make for given loss values
17
Accuracy Considered correct if the expected output neuron’s activation value is the greatest among all neurons for that category
18
Initial Attempts Features Short sample lengths WAV inputs only
Trained on training set of only 2 speakers 18
19
Results 19
20
False Hope Features Changes Short sample lengths
Trained on training set of only 2 speakers Changes Both input types 20
21
Results 21
22
Hope Features Changes Short sample lengths Both input types
Trained on training set of only 2 speakers Changes Train on single batch per speaker per pass through training set Reduced learning rate 22
23
Results 23
24
Confirmation Features Changes Short sample lengths Both input types
Train on single batch per speaker per pass through training set Changes Trained on full training set of speakers 24
25
Results 25
26
Refinement Features Changes Short sample lengths Both input types
Train on single batch per speaker per pass through training set Changes True Validation Decaying learning rate Epoch duration increased 26
27
Results 27
28
UI 28
29
Conclusion 29
30
Works Cited Weinberger, Steven. (2015). Speech Accent Archive. George Mason University. Retrieved from 30
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.