1 Machine learning for note onset detection. Alexandre Lacoste & Douglas Eck.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Greedy Layer-Wise Training of Deep Networks
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
David Meredith Minim David Meredith
ACHIZITIA IN TIMP REAL A SEMNALELOR. Three frames of a sampled time domain signal. The Fast Fourier Transform (FFT) is the heart of the real-time spectrum.
Computational Rhythm and Beat Analysis Nick Berkner.
DFT/FFT and Wavelets ● Additive Synthesis demonstration (wave addition) ● Standard Definitions ● Computing the DFT and FFT ● Sine and cosine wave multiplication.
AES 120 th Convention Paris, France, 2006 Adaptive Time-Frequency Resolution for Analysis and Processing of Audio Alexey Lukin AES Student Member Moscow.
Overview of Real-Time Pitch Tracking Approaches Music information retrieval seminar McGill University Francois Thibault.
Enabling Access to Sound Archives through Integration, Enrichment and Retrieval Report about polyphonic music transcription.
Vision Based Control Motion Matt Baker Kevin VanDyke.
Onset Detection in Audio Music J.-S Roger Jang ( 張智星 ) MIR LabMIR Lab, CSIE Dept. National Taiwan University.
Time and Frequency Representations Accompanying presentation Kenan Gençol presented in the course Signal Transformations instructed by Prof.Dr. Ömer Nezih.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
DEVON BRYANT CS 525 SEMESTER PROJECT Audio Signal MIDI Transcription.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Introduction to Wavelets
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Brennan Ireland Rochester Institute of Technology Astrophysical Sciences and Technology December 5, 2013 LIGO: Laser Interferometer Gravitational-wave.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Audio and Music Representations (Part 2) 1.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
PATTERN RECOGNITION AND MACHINE LEARNING
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
1 Constant Following Distance Simulations CS547 Final Project December 6, 1999 Jeremy Elson.
CSC361/661 Digital Media Spring 2002
Multiresolution STFT for Analysis and Processing of Audio
storing data in k-space what the Fourier transform does spatial encoding k-space examples we will review:  How K-Space Works This is covered in the What.
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
README Lecture notes will be animated by clicks. Each click will indicate pause for audience to observe slide. On further click, the lecturer will explain.
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Basics of Neural Networks Neural Network Topologies.
Lecture 6: Edge Detection CAP 5415: Computer Vision Fall 2008.
Wavelet transform Wavelet transform is a relatively new concept (about 10 more years old) First of all, why do we need a transform, or what is a transform.
Hidden Markov Classifiers for Music Genres. Igor Karpov Rice University Comp 540 Term Project Fall 2002.
Polyphonic Transcription Bruno Angeles McGill University - Schulich School of Music MUMT-621 Fall /14.
Audio Tempo Extraction Presenter: Simon de Leon Date: February 9, 2006 Course: MUMT611.
Edge Detection and Geometric Primitive Extraction Jinxiang Chai.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Piano Music Transcription Wes “Crusher” Hatch MUMT-614 Thurs., Feb.13.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Wire Detection Version 2 Joshua Candamo Friday, February 29, 2008.
Machine Vision Edge Detection Techniques ENT 273 Lecture 6 Hema C.R.
Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.
The first AURIGA-TAMA joint analysis proposal BAGGIO Lucio ICRR, University of Tokyo A Memorandum of Understanding between the AURIGA experiment and the.
Feasibility of Using Machine Learning Algorithms to Determine Future Price Points of Stocks By: Alexander Dumont.
CLASSIFICATION OF ECG SIGNAL USING WAVELET ANALYSIS
Motion tracking TEAM D, Project 11: Laura Gui - Timisoara Calin Garboni - Timisoara Peter Horvath - Szeged Peter Kovacs - Debrecen.
1 Tempo Induction and Beat Tracking for Audio Signals MUMT 611, February 2005 Assignment 3 Paul Kolesnik.
Onset Detection, Tempo Estimation, and Beat Tracking
A 2 veto for Continuous Wave Searches
Deep Feedforward Networks
Spectrum Analysis and Processing
Spectral Analysis Spectral analysis is concerned with the determination of the energy or power spectrum of a continuous-time signal It is assumed that.
Line Fitting James Hayes.
By: Kevin Yu Ph.D. in Computer Engineering
Jeremy Bolton, PhD Assistant Teaching Professor
A First Look at Music Composition using LSTM Recurrent Neural Networks
Wavelet transform Wavelet transform is a relatively new concept (about 10 more years old) First of all, why do we need a transform, or what is a transform.
On Convolutional Neural Network
Lecture 2: Frequency & Time Domains presented by David Shires
Using Clustering to Make Prediction Intervals For Neural Networks
Music Signal Processing
Report 7 Brandon Silva.
Presentation transcript:

1 Machine learning for note onset detection. Alexandre Lacoste & Douglas Eck

2 Outline What is note onset detection and why is it useful ? What is note onset detection and why is it useful ? Small review of the field Small review of the field The details of the incredible algorithm The details of the incredible algorithm Results of the contest Results of the contest Results of the custom dataset Results of the custom dataset

3 What are note onsets ? Percussive instruments are modeled as shown (right) Percussive instruments are modeled as shown (right) Basic definition : Basic definition : Note onset is the time where the slope is the highest, during the attack time. amplitude time

4 More general definition What happens if we have sounds that are not percussive ? (pitch changing, singing, vibrato …) What happens if we have sounds that are not percussive ? (pitch changing, singing, vibrato …) Then we define onsets as being unpredictable events. Then we define onsets as being unpredictable events. If, with information near in the past, we can’t predict the future, then a new event just arrived. If, with information near in the past, we can’t predict the future, then a new event just arrived. This is the definition used to label the onsets. This is the definition used to label the onsets.

5 Onset detection is not trivial In other words, percussive note onsets in monophonic songs is trivial. In other words, percussive note onsets in monophonic songs is trivial. But if you want to make it work for complex polyphonic with singing, it is another story. But if you want to make it work for complex polyphonic with singing, it is another story.

6 What can we do with a good note onset detector ? Not directly useful, but it is present in many music algorithms. Music transcription (from wave to midi) Music transcription (from wave to midi) Music editing (Song segmentation) Music editing (Song segmentation) Tempo tracking (with onset, finding the tempos is much easier) Tempo tracking (with onset, finding the tempos is much easier) Musical fingerprinting (the onset trace can serve as a robust id for fingerprinting) Musical fingerprinting (the onset trace can serve as a robust id for fingerprinting)

7 Scheirer’s Psycho-acoustical experiment Scheirer showed that only the envelope of a few frequency band was important for the rhythmical information. Scheirer showed that only the envelope of a few frequency band was important for the rhythmical information. By modulating the envelopes with a noise source, the song can be rebuilt and almost no rhythmical aspect is lost. By modulating the envelopes with a noise source, the song can be rebuilt and almost no rhythmical aspect is lost.

8 The Pre-Lacoste Model Most onset detection algorithms use Scheirer’s model and use a filter to find positive slopes. For example : Most onset detection algorithms use Scheirer’s model and use a filter to find positive slopes. For example : Then, they use a peak-picking algorithm to find the onset position. Then, they use a peak-picking algorithm to find the onset position. This method is fast simple and works fine for monophonic percussive songs. This method is fast simple and works fine for monophonic percussive songs. But it got very poor results on complex polyphonic with singing. But it got very poor results on complex polyphonic with singing. And it is very sensitive to parameter adjustment And it is very sensitive to parameter adjustment

9 The information is mainly local in time Why not apply a simple feed-forward neural network directly on all the inputs of the window. Why not apply a simple feed-forward neural network directly on all the inputs of the window. And just ask if there is an onset at this position And just ask if there is an onset at this position Finally, we repeat this for every time step. Finally, we repeat this for every time step.

10 The algorithm can be split in 3 main steps Get the spectrogram of the song Get the spectrogram of the song Convolve a feed-forward neural network across the spectrogram Convolve a feed-forward neural network across the spectrogram Find the onset location Find the onset location

11 SPECTROGRAMS Many different time-frequency representation might be useful for this task. Let’s explore some of them. Many different time-frequency representation might be useful for this task. Let’s explore some of them. 1. Short-time Fourier transform (STFT) 2. Constant-Q transform 3. Phase plane of STFT

12 Short-time Fourier Transform The yellow curve represents the onset time The yellow curve represents the onset time

13 Constant-Q Transform The constant-Q transform has a logarithmic frequency scale which provides: The constant-Q transform has a logarithmic frequency scale which provides: a much better frequency resolution for lower frequency. a much better frequency resolution for lower frequency. a better time resolution for high frequency. a better time resolution for high frequency.

14 Can we do something with the phase plane ? The phase plane, without any manipulation, doesn’t seems to contain any information. The phase plane, without any manipulation, doesn’t seems to contain any information.

15 Phase Acceleration Bello and Sandler [1] have found a way to use phase information for onset detection. Bello and Sandler [1] have found a way to use phase information for onset detection. They takes the principal argument of the phase acceleration. They takes the principal argument of the phase acceleration. Patterns not evident enough !

16 Phase frequency difference Instead, if we simply take the difference along the frequency axis, we get interesting patterns. Instead, if we simply take the difference along the frequency axis, we get interesting patterns. Results show performance equivalent to the magnitude plane, using only the phase.

17 Feed Forward Neural Network Remember, the algorithm is simply the FNN convolved across time and frequency. Remember, the algorithm is simply the FNN convolved across time and frequency. The target is a mixture of thin Gaussians that represents the expectation of having an onset for time t. The target is a mixture of thin Gaussians that represents the expectation of having an onset for time t.

18 Net Inputs For a decent spectrogram resolution For a decent spectrogram resolution Time : 200 bins / s Time : 200 bins / s Frequency : 200 bins Frequency : 200 bins And a window width of 50 ms And a window width of 50 ms We have 2000 input variables We have 2000 input variables This is too many !!! This is too many !!! We randomly sample 200 variables inside the window. We randomly sample 200 variables inside the window. Uniform distribution across frequency Uniform distribution across frequency Gaussian distribution across time (more variables near the center) Gaussian distribution across time (more variables near the center)

19 Net Structure and Training Two hidden layers Two hidden layers 20 units in the first layer 20 units in the first layer 15 units in the second layer 15 units in the second layer 1 output neuron 1 output neuron Learning algorithm : Learning algorithm : Polak-Ribiere version of conjugate gradient K-fold cross-validation for performance estimation

20 Net Output Most peaks are really sharp and there is very low background noise. Most peaks are really sharp and there is very low background noise. Some peaks are smaller but still can be detected Some peaks are smaller but still can be detected The precision is also very good. The precision is also very good.

21 Peak-Picking The neural networks only emphasize the onsets. The neural networks only emphasize the onsets. We now have to find the location of each onset. We now have to find the location of each onset. We simply apply a threshold. We simply apply a threshold. positive crossing is the beginning positive crossing is the beginning Negative crossing is the end Negative crossing is the end Location is the center of mass Location is the center of mass The value of the threshold is learned by exhaustive search. The value of the threshold is learned by exhaustive search. end beginning

22 F-measure To maximize the performance, we want to find the maximum number of onsets (Recall) To maximize the performance, we want to find the maximum number of onsets (Recall) But we also want to minimize the number of spurious onsets (Precision) But we also want to minimize the number of spurious onsets (Precision) The F-measure offers an equilibrium between the two. The F-measure offers an equilibrium between the two.

23 MIREX 2005 Results No other participants used machine learning. No other participants used machine learning. With a simple FNN, we have a huge performance boost. With a simple FNN, we have a huge performance boost. We also have the best equilibrium between precision and recall. We also have the best equilibrium between precision and recall.

24 Custom Dataset For better tests, we built a custom dataset. For better tests, we built a custom dataset. It is composed only of complex polyphonic songs with singing. It is composed only of complex polyphonic songs with singing. There is in total 60 segments of 10 seconds. There is in total 60 segments of 10 seconds. The onsets were all hand-labeled, using a graphical user interface. The onsets were all hand-labeled, using a graphical user interface.

25 Results for Different Spectrograms

26 Combining Phase and Magnitude Does Not Help.

27 Deceptively simple Complex network structure does not help Complex network structure does not help Very simple structure still gets good performance Very simple structure still gets good performance Only one neuron can get most of the performance Only one neuron can get most of the performance 1st layer 2nd layer F-meas Valid ± ± ± ±4 5086±3 2085±5 1083±4

28 Conclusion Applying machine learning for the onset detection problem is simple and very efficient. Applying machine learning for the onset detection problem is simple and very efficient. This provides an algorithm that is accurate and robust to a wide variety of songs. This provides an algorithm that is accurate and robust to a wide variety of songs. It is not sensitive to hyper-parameter adjustment. It is not sensitive to hyper-parameter adjustment.

29 Onset labeling GUI

30 Results for Different Spectrograms Phase acceleration (Bello and Sandlers) is slightly better than noise. Phase acceleration (Bello and Sandlers) is slightly better than noise. Phase frequency difference is almost as good as magnitude plane but highly depends on the spectral window width. Phase frequency difference is almost as good as magnitude plane but highly depends on the spectral window width. Constant-Q and STFT give the best results, provided the spectral window width is small enough. Constant-Q and STFT give the best results, provided the spectral window width is small enough.