Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra

Slides:

Advertisements

Similar presentations

A brief review of non-neural-network approaches to deep learning

Advertisements

Neural networks Introduction Fitting neural networks

Speaker Adaptation for Vowel Classification

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.

„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Korean Phoneme Discrimination Ben Lickly Motivation Certain Korean phonemes are very difficult for English speakers to distinguish, such as ㅅ and ㅆ.

Learning Features and Parts for Fine-Grained Recognition Authors: Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, Li Fei-Fei ICPR, 2014 Presented by:

Predicting Voice Elicited Emotions

Estimation of car gas consumption in city cycle with ANN Introduction  An ANN based approach to estimation of car fuel consumption  Multi Layer Perceptron.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Speech Recognition through Neural Networks By Mohammad Usman Afzal Mohammad Waseem.

Automatic Classification of Audio Data by Carlos H. L. Costa, Jaime D. Valle, Ro L. Koerich IEEE International Conference on Systems, Man, and Cybernetics.

Radboud University Medical Center, Nijmegen, Netherlands

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Big data classification using neural network

Tenacious Deep Learning

Learning to Compare Image Patches via Convolutional Neural Networks

Convolutional Neural Network

ECE 417 Lecture 1: Multimedia Signal Processing

Traffic State Detection Using Acoustics

Summary of “Efficient Deep Learning for Stereo Matching”

Compact Bilinear Pooling

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Data Mining, Neural Network and Genetic Programming

Computer Science and Engineering, Seoul National University

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Saliency-guided Video Classification via Adaptively weighted learning

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Many slides and slide ideas thanks to Marc'Aurelio Ranzato and Michael Nielson.

Pick samples from task t

Intelligent Information System Lab

Convolutional Networks

Urban Sound Classification with a Convolution Neural Network

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

By: Kevin Yu Ph.D. in Computer Engineering

Bird-species Recognition Using Convolutional Neural Network

A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,

Two-Stream Convolutional Networks for Action Recognition in Videos

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

The Open World of Micro-Videos

Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

EE513 Audio Signals and Systems

Bandwidth Extrapolation of Audio Signals

Outline Background Motivation Proposed Model Experimental Results

Digital Systems: Hardware Organization and Design

Analysis of Trained CNN (Receptive Field & Weights of Network)

John H.L. Hansen & Taufiq Al Babba Hasan

Presentation on Timbre Similarity

CNN-based Action Recognition Using Adaptive Multiscale Depth Motion Maps And Stable Joint Distance Maps Junyou He, Hailun Xia, Chunyan Feng, Yunfei Chu.

Convolutional Neural Networks

Graph Neural Networks Amog Kamsetty January 30, 2019.

Neural networks (3) Regularization Autoencoder

Advances in Deep Audio and Audio-Visual Processing

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Department of Computer Science Ben-Gurion University of the Negev

Human-object interaction

Introduction to Neural Networks

NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Measuring the Similarity of Rhythmic Patterns

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Week 7 Presentation Ngoc Ta Aidean Sharghi

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Timbral Analysis of Music Audio Signals with Convolutional Neural Networks Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, Xavier Serra Hello, everyone, I am Gong Rong, PhD student in the music technology group, universitat pompeu fabra of barcelona, and the third author of the presenting paper Timbral analysis of music audio signals with convolutional neural networks. My first author, my colleague Jordi Pons is currently doing an internship in US, so he is not able to come to present the paper.

Research goal To discover novel deep learning architectures which can efficiently learn timbre representations Previous work on learning temporal features: Jordi Pons, Xavier Serra, Designing efficient architectures for modeling temporal features with convolutional neural networks, ICASSP 2017 Our goal is to discover novel deep learning architectures which can efficiently learn timbre representations. Beforehand, my colleague has designed the efficient convolutional neural networks to model temporal features, such as tempo and rhythm. So here is its reference. In this work, we focus on learning timbre representations.

Presentation structure Motivation State-of-the-arts Architecture design strategy Three experiments and results We will follow this presentation structure. We will introduce the motivation, and explain why to use convolutional neural networks to learn timbral representations. Then I will introduce the state-of-the-arts CNNs design strategy for timbral representation learning. After that, I will explain the proposed design strategy. Finally, we will present three experiments and their results to prove that our strategy is able to learn timbral representations efficiently.

Motivation

Timbral descriptions – traditional approaches Bag-of-features Statistics of frame-based features Spectral centroid, flatness, MFCC, etc. NOT consider the temporal evolution Temporal modeling Hidden Markov models Time-frequency patterns learned by NMD basis Descriptors and temporal models NOT jointly designed. In the old days, we use bag-of-features to describe the musical timbre. They are the statistics of the frame-based features, such as spectral centroid, flatness, MFCC. Its drawback is it ignore the temporal evolution of timbre. On the other hand, we have some temporal modeling methods for these frame-base features, For example, Hidden Markov models or the time-frequency patterns learned by the NMD (non-negative matrix deconvolution) basis. Its drawback is that the features and the temporal models are not jointly learned.

Deep learning and convolutional neural networks Advantage of Deep learning No strong assumptions over the input descriptor—log-mel magnitude spectrogram Able to learn spectral-temporal descriptors—input patch > 1 frame Convolutional neural networks (CNNs) Able to learn spectral-temporal filters Can exploit invariance by sharing parameters We consider to use Deep learning, especially Convolutional neural networks to represent timbre because Firstly, it doesn’t require much feature engineering, we can use perceptual-based log-mel magnitude spectrogram as the input to the network. Secondly, it is able to learn spectral and temporal descriptors in case we use more than 1 frame patch as the input. Besides, the CNNs provides additional benefits, such as, It is able to learn spectral temporal filters It can exploit invariance, such as time or frequency invariance in the mel spectrogram by sharing parameters.

State-of-the-arts

CNNs filter design Small-rectangular filters High filters Ex. 3x3, 5x5 filters For the first layer, NOT able to learn the spectral-temporal patterns with a large frequency spread. Ex. Cymbals, snare drum High filters A lot of weights—prone to overfit or fit noise As we have decided to use CNNs as the timbral representation learning method, now we take a look at its state-of-the-arts: We have identified two general trends in designing CNNs architectures, especially designing the CNN filter shapes In the first trend, we use small rectangular filters, such as 3*3 or 5*5 filters. The drawback of this strategy, is that it is not able to learn the spectral-temporal patterns with a large frequency spread in the first layer. Another trend is to use high filters with a large frequency spread. Although these high filters could successfully learning most timbral patterns, they are prone to overfit or noise fit.

Fit noise example - high filters 12x8 filters learn redundant information Left: onset, redundancy in frequency axis, 1x3 small filter is enough for onset Right: harmonics, redundancy in temporal axis, only captured three harmonics. Here is a noise fit example for using high filters. In below are two learned filters of a size 12*8. We can see the left one is a onset pattern, however, it contains many redundancy in frequency axis. A small 1*3 filter as the small red box we show would be enough in this case. And the right one learned a harmonic pattern but with many redundancy in the temporal axis. Besides, this filter is too short and only captured three harmonics.

Architecture design strategy

Timbre definition Timbre defined by what is NOT: a set of auditory attributes of sound events IN ADDITION TO pitch, loudness, duration and spatial position. After presenting the drawbacks of the previous architecture designing strategy, let’s introduce the proposed one: Firstly, we take a look at the timbre definition. In the literature, the timbre is defined by what is not: a set of auditory attributes of sound events IN ADDITION TO pitch, loudness, duration and spatial position. Which means here the timbre is invariant to these attributes. Accordingly, we can design CNN architectures to capture these invariant.

Invariance Pitch invariance: convolve and max-pooling over the frequency axis Loudness invariance: L2-normalization of the filter weights Duration invariance: m x n filter learn fixed-duration patterns Spatial position invariance: use monoaural down-mixed input We argue that to convolve over the frequency axis of a mel spectrogram will learn pitch independent timbre representations. Max-pooling over the frequency axis will eliminate the frequency resolution of the feature map. We use L2-norm regularization to normalize the filter weights into low energy. This will achieve loudness invariance. We use m*n filters to learn representations with a fixed duration. We use monoaural down-mixed input to remove the spatial trait.

and different filter shapes Use domain knowledge and different filter shapes Ex. to capture phoneme patterns, we use 70 x 10 filter: unvoiced consonants 50 x 1 filter: low pitch harmonics 50 x 5 filter: voiced consonants 70 x 1 filter: high pitch harmonics The first core element in our design strategy is to use musical domain knowledge to design the filter shapes. Here is an example to use different filter shapes to capture phoneme patterns. The second core element in our strategy is to use different filter shapes in the first layer. This can also be clarify by using this example, that we use different filter shapes in the first layer to capture different timbre characteristics.

Experiments

Experiments Three experiments to validate the design strategy Singing voice phoneme classification Musical instrument recognition Music auto-tagging We will prove that the shallow network architecture with multi-filter shapes in the first layer have a great expressiveness. We assess the proposed design strategy by conducting 3 timbre modeling experiments: By these three experiments, we will prove that shallow network architecture with multi-filter shapes in the first layer can be expressive.

Common configuration Input: Monoaural log-mel magnitude spectrogram Activation function: Exponential linear units (ELUs) Regularization: L2-norm of the weights Loss function: cross-entropy Here are the common configurations used in the three experiments:

Singing voice phoneme classification Music style: Beijing opera singing (Chinese) Small data set: 2 hours audio, split into train, validation and test sets Two role-types: dan (young woman) and laosheng (old man) Input patch size: 80x21 32 phoneme class Single layer: 128 filters 50x1 and 70x1 64 50x5 and 70x5 32 50x10 and 70x10 Max-pooling coef. 2 over frequency axis The goal of this experiment is to classify the Beijing opera singing phoneme excerpts into 32 phoneme class. As we can see, this is the small dataset problem, only 2 hours audio are used as training, validation and test set. It contains 2 role-types, we can understand the concept of role-types as two different speakers. So we need to divide this dataset into two parts – one for each role-type, and this will further reduce amount the training set. We use a single layer architecture where the only CNN layer contains 6 different filter shapes to capture different phoneme patterns. We do max-pooling over frequency axis.

Singing voice phoneme classification Small-rectangular: 3x3 filters, 5-layers VGG-net GMMs: 40 components, MFCCs input MLP: 2 layers As comparison, we take a small-rectangular filter architecture which is a 5 layers VGG-net, a GMMs with 40 components and a Mutlilayers perceptron with 2 layers. We can see from the table, the proposed multi-filters architecture achieved the best performance for both role-types. And the parameter number matters! The benefit of using less parameters is that we are less prone to overfitting given the same amount of training set.

Musical instrument recognition Dataset: IRMAS, 6705 training samples, 3s length each Input path size: 96x128 Batch normalization after each convolutional layer 11 instrument class Multi layers: Same multi-filters layer Max-pooling 12,16 Two 128 filters 3x3 layers Max-pooling 2,2 256 nodes dense layer Single layer: 128 filters 5x1 and 80x1 64 5x3 and 80x3 32 5x5 and 80x5 Max-pooling over frequency axis The goal of this experiment is to recognize the predominant musical instruments. The training dataset contains 6705 samples, each sample is label with only one instrument. The samples in the test set have multi labels. We experiment two different architectures, The first one has one single CNN layer with 6 different filter shapes The second one we call it multi layers which contains the same multi-filters layers plus 2 small-rectangular filter layers and one dense layer.

Musical instrument recognition Bosch: bag-of-features + SVM Han: 9-layers, 3x3 filters, VGG-net To compare, we take two baselines: The first line use bag-of-features plus SVM classifier. The second one is the state-of-the-art 9 layers small-rectangular VGG-net We use two types of evaluation metrics. The micro metric takes into account of the class sample support, We see for this metric, the proposed method performed almost equal to the state-of-the-art Han method, however with only half of the parameters. Which means that the proposed architecture is as powerful as the state-of-the-art, however, much less prone to overfitting. However, if we look at the Macro metric, which doesn’t take into account the class support, the proposed architecture performs the best Again with only half of the parameters.

Max-pooling over frequency axis Music auto-tagging Dataset: MagnaTagATune, 26856 clips, 30s each Predicting the top-50 tags, instruments, genres and others Input patch size: 96x187 Batch normalization after each convolutional layer Multi layer: 10 filters 100x1, 6 100x3, 3 100x5, 3 100x7 15 75x1, 10 75x3, 5 75x5, 5 75x7 15 25x1, 10 25x3, 5 25x5, 5 25x7 Max-pooling over frequency axis 100 nodes dense layer The goal of the last experiment is to predict the top-50 tags for the MagnaTagATune dataset. Each sample in this dataset has multiple tags, Which results in a multi-labels auto-tagging problem. We use one CNN layer with 12 different filter shapes plus one dense layer

Music auto-tagging Choi: 5 CNN layers, 3x3 filters Small-rectangular: adaptation of Choi, less parameters Dieleman: 2 CNN layers, high filters We run two experiments with different parameter numbers, On the left of the table, we fix the parameter numbers of the three architecture to 75k, we can see the proposed architecture performs the best. Then we increase the filter numbers in the proposed architecture and compare it with the original Choi’s architecture, When we use two times of filter numbers, the performance is almost equivalent to Choi however with much less parameters. While keep increasing the filter numbers, the performance goes worse.

Conclusion The proposed architecture uses Different filter shapes in the first layer Filter shapes designed by domain knowledge Achieved the state-of-the-art result for a small dataset Achieved equivalent state-of-the-art results for the larger datasets, with less parameters.

Thank you! Any questions?