Urban Sound Classification with a Convolution Neural Network

Slides:



Advertisements
Similar presentations
Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,
Advertisements

ImageNet Classification with Deep Convolutional Neural Networks
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
Speech Recognition Deep Learning and Neural Nets Spring 2015.
Representing Acoustic Information
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Multimodal Information Analysis for Emotion Recognition
Today Ensemble Methods. Recap of the course. Classifier Fusion
Predicting Voice Elicited Emotions
Speech Processing Using HTK Trevor Bowden 12/08/2008.
ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Lecture 3b: CNN: Advanced Layers
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Sentiment analysis using deep learning methods
Tenacious Deep Learning
Convolutional Neural Network
Traffic State Detection Using Acoustics
Summary of “Efficient Deep Learning for Stereo Matching”
Data Mining, Neural Network and Genetic Programming
Computer Science and Engineering, Seoul National University
DeepCount Mark Lenson.
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra
ECE 6504 Deep Learning for Perception
Lecture 5 Smaller Network: CNN
Training Techniques for Deep Neural Networks
Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis
Urban Sound Classification
Article and Work by: Justin Salamon and Juan Pablo Bello
Network In Network Authors: Min Lin, Qiang Chen, Shuicheng Yan
Deep Learning Convoluted Neural Networks Part 2 11/13/
Introduction to Neural Networks
Goodfellow: Chap 6 Deep Feedforward Networks
Image Classification.
SBNet: Sparse Blocks Network for Fast Inference
Optimizing Channel Selection for Seizure Detection
Convolutional Neural Networks
Deep Learning Hierarchical Representations for Image Steganalysis
Construct a Convolutional Neural Network with Python
Deep learning Introduction Classes of Deep Learning Networks
Object Classification through Deconvolutional Neural Networks
Very Deep Convolutional Networks for Large-Scale Image Recognition
Smart Robots, Drones, IoT
CSC 578 Neural Networks and Deep Learning
[Figure taken from googleblog
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
LECTURE 35: Introduction to EEG Processing
Neural Networks Geoff Hulten.
Machine learning Empirical Performance Analysis
LECTURE 33: Alternative OPTIMIZERS
Vinit Shah, Joseph Picone and Iyad Obeid
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
Analysis of Trained CNN (Receptive Field & Weights of Network)
John H.L. Hansen & Taufiq Al Babba Hasan
Coding neural networks: A gentle Introduction to keras
Convolutional Neural Networks
Mihir Patel and Nikhil Sardana
ImageNet Classification with Deep Convolutional Neural Networks
Advances in Deep Audio and Audio-Visual Processing
Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arXiv.org,
Reuben Feinman Research advised by Brenden Lake
Automatic Handwriting Generation
Introduction to Neural Networks
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

Urban Sound Classification with a Convolution Neural Network Presenter: Dhara Rana Project Members: Joseph Chiou

Recap: Overall Goal Create a way to classify environmental sound given an audio clip Data Cleaning/Processing: Convert each sound signal into a log-scaled spectrogram Training Data on Spectrograms (128 frequency bands by 128 frame by 2 channels) Important libraries/packages: keras, librosa, sklearn

Feature Selection for baseline Model Domain Specific Features Pearson Correlation 5 Sound Features: (1) Chromagram—12 features (2) Mel-frequency cepstral coefficients—40 features (3) Melspectrogram—128 features (4) Tonnetz—6 features (5) Spectral Contrast—7 features Total features: 193 Basic statistically tool to figure out the relation between two variables Highest 193 |r| values were taken Chromagram—Shows the pitches in a sound Mel-frequency cepstral coefficients—measures to describe scpetral properties of sound; These features give a good representation of information solely about vocal tract filer and cleanly separated from information about the glottal source. Melspectragram– creates a mel-scaled spectrogram of 128 components--approximates the mapping of frequencies to patches of nerves in the cochlea Tonnetz– arranges sounds according to pitch relationships into independent spatial and temportal structures Spectral Contrast: the decibel difference between peaks and valleys in the spectrum

Baseline model: random forest Parameters: Number of tree: 500 Max Depth: 20 Validation Method: 10-Fold Cross Validation Average Runtime With 193 Domain Specific Features: ~5 min With 193 Pearson Correlation Selected Features: ~4min 30sec Average Accuracy With 193 Domain Specific Features: 58.3% ~ 20.6% Domain Specific Feature dataset performed better, so it will be used to compare with the CNN results 10-fold Cross Validation: The dataset is already split into 10 folders with similar distribution of each class. So when I did cross fold validation I would train on folders 1-9 and test on 10 and then train on folders 2-10 and test on folder 1; etc This was also done to validate SVM

Baseline model: SVM Parameters: Validation Method: Average Runtime C=0.01 Max iteration: 3000 Validation Method: 10-Fold Cross Validation Average Runtime With 193 Domain Specific Features: ~1.54 min With 193 Pearson Correlation Selected Features: < 1sec Average Accuracy With 193 Domain Specific Features: 55.4% With 193 Pearson Correlation Selected Features: 16.3% Domain Specific Feature dataset performed better, so it will be used to compare with the CNN results

Convolutional Neural Network: Architecture Layer 1 Convolutional layer: 24 filters with receptive field (5,5) Pool layer: Max pooling (4,2) Rectified linear unit (ReLU) activation: h(x)=max(x,0) Layer 2 Convolutional layer: 48 filters with receptive field (5,5) Layer 3 Convolutional layer: 48 filters with receptive field (5,5) Rectified linear unit (ReLU) activation: h(x)=max(x,0) Layer 4 Fully Connected Layer: 64 Hidden Layer Rectified linear unit (ReLU) activation: h(x)=max(x,0) Dropout: 50% probability L2 regularizer: 0.001 Layer 5 Fully Connected Layer: 10 Hidden Layer Softmax Activation ~ 0-1 This architecture is similar to the salamon and bello paper: Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3), 279-283.

Convolutional Neural Network: Results Parameters: Epochs: 40 Batch size: 30 Validation Method: 10-Fold Cross Validation Average Runtime ~1hr 50min Average Accuracy 66.10% 10-fold Cross Validation Exmaple: Train on folders 1-8 Validate on folder 9 Test on folder 10 So the test accuracy on the test folder that was never seen by the CNN during training To see the training and test accuracy of the CNN for test on folder 10, which showed the highest accuracy see towards the end of presentation

Result Comparison Compared to Salamon and Bello Paper Mean Accuracy: 73% Dense CNN performs better than LinearSVC and random forest Why? (1) Small receptive fields of convolutional kernels (filters) = Better learning and identification of different sound classes [1] (2) Capable of capturing energy modulation patterns across time and frequency of the spectrogram[1] CNN are capable of capturing energy modulation patterns across time and frequency when applied to spectrogram like inputs

Confusion Matrix for Test Folder 10 SVM (Acc: 62.49%) Random Forest (Acc: 61.29%) CNN (Acc: 70.85%) Confusion Matrix of the cross validation that gave the best classification: testing on 10 Classification Accuracy for Cross fold on testing folder 10 Dense CNN: 70.85% Random Forest: 61.29% SVM: 62.49% See class distribution of testing folder 10: towards the ends of the presentation RF is better at identifying between engine idling and air conditioner All models had hard time identifying between children playing and siren CNN is must better at identifying noise urban sound such as street music CNN is better identifying between jackhammer and drilling

References Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3), 279-283.

Appendix: Test 10 Class distribution Test distribution for 193 domain specific features Test distribution for 128x 128 x2 features The reason why there is less class sound in the 128 x 128 x2 is because To create the 128 frame the window size is 65024 sample/sec; window_size= 512*(frames-1); note 512 is the hop size So if the window size is small it will not consider it.

Appendix: Per class Accuracy   Test Fold 10 Accuracy Sound Type Dense CNN Random Forest SVM Air Conditioner 0.72 0.77 0.61 Car Horn 0.57 0.70 Children Playing 0.69 0.82 0.53 Dog Bark 0.86 0.55 0.62 Drilling 0.33 0.46 0.54 Engine Idling 0.49 0.73 Gun Shot 0.5 0.85 Jackhammer 0.76 0.47 0.64 Siren 0.88 0.37 0.75 Street Music 0.79 0.66

Appendix: CNN Test and Validation Accuracy and loss for Test Folder10