Speaker Classification through Deep Learning Jacob Morris Alex Douglass Luke Woodbury 1
Overview Goals Potential Applications Learn more about deep learning! Create a neural network that will classify voice recordings based on gender, age, natural language, etc. Potential Applications Research Security 2
Software Dependencies Python 2.7 Keras 1.2.2 Theano Matplotlib 3
Hardware GeForce TitanX (Pascal) 12gb memory 4 https://6lli539m39y3hpkelqsm3c2fg-wpengine.netdna-ssl.com/wp-content/uploads/2016/08/Natoli-CPUvGPU-peak-DP-600x.png 4
Speech Accent Archive WAV files Categorizations 2300+ different speakers All recorded speaking same paragraph Categorizations Age Gender English Residence Natural Language Country Learning Style Etc. 5
The Essence of Deep Learning http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/img/spiral.1-2.2-2-2-2-2-2.gif
Artificial Neural Networks (ANN) http://cs231n.github.io/assets/nn1/neural_net2.jpeg 7
Recurrent Networks Layer "remembers" data 8 http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png 8
LSTM Long Short Term Memory 9 http://deephash.com/2016/10/16/lstm-journey-tensorflow/ 9
Problem Type Sequence Classification Supervised Learning Assign classification label(s) to input sequences Supervised Learning Each training sample includes the correct output for that sample 10
Variations of Model Topologies Inputs Sequence of amplitudes Discrete Fourier transform of the segment Hidden Layers Variable Outputs Any subset of data categories 11
Training Challenges Process of Exploration Many parameters to tune Results vague, must be interpreted Days required to train a new model 12
Terminology Sample Batch Epoch Base unit of training data 1/100 of a second of audio Batch Group of samples 4 seconds of consecutive samples Epoch Number of batches required to train on entire training data set In our case, 2310 batches
Terminology Sample Batch Epoch Base unit of training data 1/100 of a second of audio Batch Group of samples 4 seconds of consecutive samples Epoch Number of batches required to train on entire training data set In our case, 2310 batches
Loss Measure of how close an output signal is to its expected value Categorical Cross Entropy Emphasizes correct answer
Learning Rate Determines how big of adjustments to make for given loss values
Accuracy Considered correct if the expected output neuron’s activation value is the greatest among all neurons for that category
Initial Attempts Features Short sample lengths WAV inputs only Trained on training set of only 2 speakers 18
Results 19
False Hope Features Changes Short sample lengths Trained on training set of only 2 speakers Changes Both input types 20
Results 21
Hope Features Changes Short sample lengths Both input types Trained on training set of only 2 speakers Changes Train on single batch per speaker per pass through training set Reduced learning rate 22
Results 23
Confirmation Features Changes Short sample lengths Both input types Train on single batch per speaker per pass through training set Changes Trained on full training set of 2300+ speakers 24
Results 25
Refinement Features Changes Short sample lengths Both input types Train on single batch per speaker per pass through training set Changes True Validation Decaying learning rate Epoch duration increased 26
Results 27
UI 28
Conclusion 29
Works Cited Weinberger, Steven. (2015). Speech Accent Archive. George Mason University. Retrieved from http://accent.gmu.edu 30