Tenacious Deep Learning Matthew Farkas, Sai Haran, Michael Kaesler, Aditya Venkataraman| MSiA 490-30 Deep Learning | Spring 2017 | Northwestern University Problem Statement Technical Approach Results With the rise in popularity of music streaming services, making users’ lives easier via machine learning has grown in importance. Recommendation has been on the forefront of most companies’ efforts, but automated metadata assignment is just as important for both users and engineers alike. Genre identification via deep learning has been done successfully, but rather than improve another classifier’s accuracy, we sought out to identify how a machine would identify a genre by opening the model. Converted 30 second audio clips to five Mel-spectrograms in order to adapt the problem to an image classification problem Ran a convolutional neural network (CNN) using the VGG framework, with several adjustments Utilized 3 VGG layers and only 1 fully connected layer with 128 nodes Regularized with 0.4 dropout, 0.1 L1 regularization, and batch normalization Used a learning rate of .00001 Rather than the standard 224x224x3 VGG input, we chopped our spectrograms into five pieces each, to push the model to learn features over short spans of time. The resulting spectrogram clips were 200x90x3 in size. Utilized a 3x3 kernel convolving in two dimensions Our model achieved up to 72% test accuracy after only 1500 epochs of training We believe the model made accurate decisions because it focused on certain defining frequencies of the classes, such as heavy bass for hip-hop and electronic Baseline human accuracy was around 65% on a small sample, and the model was able to surpass that. This shows that a machine can pick up features which define a genre, even with short 30 second samples Hip Hop Electronic After training, we were able to run spectrogram clips through our model to produce classifications We stitched together clips and the heatmaps produced in order to get a holistic view of how the model is classifying the spectrograms Dataset Conclusion 60,000 30 second clips of music from hip-hop, country, electronic, and metal artists, obtained from Spotify’s developer API Subsequently converted into Mel-spectrogram, and divided images into 5 pieces each Unrepresentative samples of genre, such as remixes and commentary, removed via cleaning script Genre classification can sometimes be fuzzy even for a human, so picking representative artists and samples was of the utmost importance Dataset was initially too large, had to cut it down to ensure homogenous classes in training Shifted spectrograms up to 50% horizontally so as to improve generalization capability of model These heatmaps visualize which parts of the image the model is looking at during classification When looking at low frequencies and ignoring middle frequencies, the model is 50.6% sure the clip is electronic When looking at all frequency ranges, the model is 23.7% sure the clip is hip- hop Genre classification can be performed by focusing mainly on frequency ranges The model can change its classification of genre based on which part of the song it’s looking at – this is why separating the song into short clips can help create a more robust classifier In the future, we can augment the model by implementing a voting system for each song – the class with the most “votes” per song is the final classification We’d like to experiment with using raw audio rather than spectrograms. This approach has been proven to work in Google Deepmind’s Wavenet paper, which utilizes raw audio as the input for a CNN References and Related Work Hip-hop Country Electronic Metal http://benanne.github.io/2014/08/05/spotify-cnns.html https://chatbotslife.com/finding-the-genre-of-a-song-with-deep- learning-da8f59a61194 https://deepmind.com/blog/wavenet-generative-model-raw-audio/