Urban Sound Classification with a Convolution Neural Network

Slides:

Advertisements

Similar presentations

Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,

Advertisements

ImageNet Classification with Deep Convolutional Neural Networks

Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.

Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,

Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.

Speech Recognition Deep Learning and Neural Nets Spring 2015.

Representing Acoustic Information

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Multimodal Information Analysis for Emotion Recognition

Today Ensemble Methods. Recap of the course. Classifier Fusion

Predicting Voice Elicited Emotions

Speech Processing Using HTK Trevor Bowden 12/08/2008.

ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.

GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Lecture 3b: CNN: Advanced Layers

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Sentiment analysis using deep learning methods

Tenacious Deep Learning

Convolutional Neural Network

Traffic State Detection Using Acoustics

Summary of “Efficient Deep Learning for Stereo Matching”

Data Mining, Neural Network and Genetic Programming

Computer Science and Engineering, Seoul National University

DeepCount Mark Lenson.

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra

ECE 6504 Deep Learning for Perception

Lecture 5 Smaller Network: CNN

Training Techniques for Deep Neural Networks

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis

Urban Sound Classification

Article and Work by: Justin Salamon and Juan Pablo Bello

Network In Network Authors: Min Lin, Qiang Chen, Shuicheng Yan

Deep Learning Convoluted Neural Networks Part 2 11/13/

Introduction to Neural Networks

Goodfellow: Chap 6 Deep Feedforward Networks

Image Classification.

SBNet: Sparse Blocks Network for Fast Inference

Optimizing Channel Selection for Seizure Detection

Convolutional Neural Networks

Deep Learning Hierarchical Representations for Image Steganalysis

Construct a Convolutional Neural Network with Python

Deep learning Introduction Classes of Deep Learning Networks

Object Classification through Deconvolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

Smart Robots, Drones, IoT

CSC 578 Neural Networks and Deep Learning

[Figure taken from googleblog

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

LECTURE 35: Introduction to EEG Processing

Neural Networks Geoff Hulten.

Machine learning Empirical Performance Analysis

LECTURE 33: Alternative OPTIMIZERS

Vinit Shah, Joseph Picone and Iyad Obeid

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

Analysis of Trained CNN (Receptive Field & Weights of Network)

John H.L. Hansen & Taufiq Al Babba Hasan

Coding neural networks: A gentle Introduction to keras

Convolutional Neural Networks

Mihir Patel and Nikhil Sardana

ImageNet Classification with Deep Convolutional Neural Networks

Advances in Deep Audio and Audio-Visual Processing

Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arXiv.org,

Reuben Feinman Research advised by Brenden Lake

Automatic Handwriting Generation

Introduction to Neural Networks

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Example of training and deployment of deep convolutional neural networks. Example of training and deployment of deep convolutional neural networks. During.

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

Urban Sound Classification with a Convolution Neural Network Presenter: Dhara Rana Project Members: Joseph Chiou

Recap: Overall Goal Create a way to classify environmental sound given an audio clip Data Cleaning/Processing: Convert each sound signal into a log-scaled spectrogram Training Data on Spectrograms (128 frequency bands by 128 frame by 2 channels) Important libraries/packages: keras, librosa, sklearn

Feature Selection for baseline Model Domain Specific Features Pearson Correlation 5 Sound Features: (1) Chromagram—12 features (2) Mel-frequency cepstral coefficients—40 features (3) Melspectrogram—128 features (4) Tonnetz—6 features (5) Spectral Contrast—7 features Total features: 193 Basic statistically tool to figure out the relation between two variables Highest 193 |r| values were taken Chromagram—Shows the pitches in a sound Mel-frequency cepstral coefficients—measures to describe scpetral properties of sound; These features give a good representation of information solely about vocal tract filer and cleanly separated from information about the glottal source. Melspectragram– creates a mel-scaled spectrogram of 128 components--approximates the mapping of frequencies to patches of nerves in the cochlea Tonnetz– arranges sounds according to pitch relationships into independent spatial and temportal structures Spectral Contrast: the decibel difference between peaks and valleys in the spectrum

Baseline model: random forest Parameters: Number of tree: 500 Max Depth: 20 Validation Method: 10-Fold Cross Validation Average Runtime With 193 Domain Specific Features: ~5 min With 193 Pearson Correlation Selected Features: ~4min 30sec Average Accuracy With 193 Domain Specific Features: 58.3% ~ 20.6% Domain Specific Feature dataset performed better, so it will be used to compare with the CNN results 10-fold Cross Validation: The dataset is already split into 10 folders with similar distribution of each class. So when I did cross fold validation I would train on folders 1-9 and test on 10 and then train on folders 2-10 and test on folder 1; etc This was also done to validate SVM

Baseline model: SVM Parameters: Validation Method: Average Runtime C=0.01 Max iteration: 3000 Validation Method: 10-Fold Cross Validation Average Runtime With 193 Domain Specific Features: ~1.54 min With 193 Pearson Correlation Selected Features: < 1sec Average Accuracy With 193 Domain Specific Features: 55.4% With 193 Pearson Correlation Selected Features: 16.3% Domain Specific Feature dataset performed better, so it will be used to compare with the CNN results

Convolutional Neural Network: Architecture Layer 1 Convolutional layer: 24 filters with receptive field (5,5) Pool layer: Max pooling (4,2) Rectified linear unit (ReLU) activation: h(x)=max(x,0) Layer 2 Convolutional layer: 48 filters with receptive field (5,5) Layer 3 Convolutional layer: 48 filters with receptive field (5,5) Rectified linear unit (ReLU) activation: h(x)=max(x,0) Layer 4 Fully Connected Layer: 64 Hidden Layer Rectified linear unit (ReLU) activation: h(x)=max(x,0) Dropout: 50% probability L2 regularizer: 0.001 Layer 5 Fully Connected Layer: 10 Hidden Layer Softmax Activation ~ 0-1 This architecture is similar to the salamon and bello paper: Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3), 279-283.

Convolutional Neural Network: Results Parameters: Epochs: 40 Batch size: 30 Validation Method: 10-Fold Cross Validation Average Runtime ~1hr 50min Average Accuracy 66.10% 10-fold Cross Validation Exmaple: Train on folders 1-8 Validate on folder 9 Test on folder 10 So the test accuracy on the test folder that was never seen by the CNN during training To see the training and test accuracy of the CNN for test on folder 10, which showed the highest accuracy see towards the end of presentation

Result Comparison Compared to Salamon and Bello Paper Mean Accuracy: 73% Dense CNN performs better than LinearSVC and random forest Why? (1) Small receptive fields of convolutional kernels (filters) = Better learning and identification of different sound classes [1] (2) Capable of capturing energy modulation patterns across time and frequency of the spectrogram[1] CNN are capable of capturing energy modulation patterns across time and frequency when applied to spectrogram like inputs

Confusion Matrix for Test Folder 10 SVM (Acc: 62.49%) Random Forest (Acc: 61.29%) CNN (Acc: 70.85%) Confusion Matrix of the cross validation that gave the best classification: testing on 10 Classification Accuracy for Cross fold on testing folder 10 Dense CNN: 70.85% Random Forest: 61.29% SVM: 62.49% See class distribution of testing folder 10: towards the ends of the presentation RF is better at identifying between engine idling and air conditioner All models had hard time identifying between children playing and siren CNN is must better at identifying noise urban sound such as street music CNN is better identifying between jackhammer and drilling

References Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3), 279-283.

Appendix: Test 10 Class distribution Test distribution for 193 domain specific features Test distribution for 128x 128 x2 features The reason why there is less class sound in the 128 x 128 x2 is because To create the 128 frame the window size is 65024 sample/sec; window_size= 512*(frames-1); note 512 is the hop size So if the window size is small it will not consider it.

Appendix: Per class Accuracy Test Fold 10 Accuracy Sound Type Dense CNN Random Forest SVM Air Conditioner 0.72 0.77 0.61 Car Horn 0.57 0.70 Children Playing 0.69 0.82 0.53 Dog Bark 0.86 0.55 0.62 Drilling 0.33 0.46 0.54 Engine Idling 0.49 0.73 Gun Shot 0.5 0.85 Jackhammer 0.76 0.47 0.64 Siren 0.88 0.37 0.75 Street Music 0.79 0.66

Appendix: CNN Test and Validation Accuracy and loss for Test Folder10