A Supervised Approach for Detecting Boundaries in Music using Difference Features and Boosting Douglas Turnbull Computer Audition Lab UC San Diego, USA.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.

Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

A Novel Approach for Recognizing Auditory Events & Scenes Ashish Kapoor.

Patch to the Future: Unsupervised Visual Prediction

Discriminative Segment Annotation in Weakly Labeled Video Kevin Tang, Rahul Sukthankar Appeared in CVPR 2013 (Oral)

EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.

1 Fast Asymmetric Learning for Cascade Face Detection Jiaxin Wu, and Charles Brubaker IEEE PAMI, 2008 Chun-Hao Chang 張峻豪 2009/12/01.

The Viola/Jones Face Detector (2001)

Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.

Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.

Results Audio Information Retrieval using Semantic Similarity Luke Barrington, Antoni Chan, Douglas Turnbull & Gert Lanckriet Electrical & Computer Engineering.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

Ensemble Learning: An Introduction

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Adaboost and its application

Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR.

Semantic Similarity for Music Retrieval Luke Barrington, Doug Turnbull, David Torres & Gert Lanckriet Electrical & Computer Engineering University of California,

Online Learning Algorithms

Face Detection using the Viola-Jones Method

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Music retrieval Conventional music retrieval systems Exact queries: ”Give me all songs from J.Lo’s latest album” What about ”Give me the music that I like”?

Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.

TEMPORAL EVENT CLUSTERING FOR DIGITAL PHOTO COLLECTIONS Matthew Cooper, Jonathan Foote, Andreas Girgensohn, and Lynn Wilcox ACM Multimedia ACM Transactions.

Data mining and machine learning A brief introduction.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)

Detecting Pedestrians Using Patterns of Motion and Appearance Paul Viola Microsoft Research Irfan Ullah Dept. of Info. and Comm. Engr. Myongji University.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.

Lecture 29: Face Detection Revisited CS4670 / 5670: Computer Vision Noah Snavely.

Multimodal Information Analysis for Emotion Recognition

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Structure Discovery of Pop Music Using HHMM E6820 Project Jessie Hsu 03/09/05.

Similarity Matrix Processing for Music Structure Analysis Yu Shiu, Hong Jeng C.-C. Jay Kuo ACM Multimedia 2006.

Supervised Learning of Edges and Object Boundaries Piotr Dollár Zhuowen Tu Serge Belongie.

Copyright © 2010 Siemens Medical Solutions USA, Inc. All rights reserved. Hierarchical Segmentation and Identification of Thoracic Vertebra Using Learning-based.

ECE738 Advanced Image Processing Face Detection IEEE Trans. PAMI, July 1997.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Combining Audio Content and Social Context for Semantic Music Discovery José Carlos Delgado Ramos Universidad Católica San Pablo.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

CS654: Digital Image Analysis

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Robust Real Time Face Detection

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

The Viola/Jones Face Detector A “paradigmatic” method for real-time object detection Training is slow, but detection is very fast Key ideas Integral images.

Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.

Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

BASS TRACK SELECTION IN MIDI FILES AND MULTIMODAL IMPLICATIONS TO MELODY gPRAI Pattern Recognition and Artificial Intelligence Group Computer Music Laboratory.

Audio Processing Mitch Parry. Resource! Sound Waves and Harmonic Motion.

Automatic Classification of Audio Data by Carlos H. L. Costa, Jaime D. Valle, Ro L. Koerich IEEE International Conference on Systems, Man, and Cybernetics.

Machine Learning: Ensemble Methods

Reading: R. Schapire, A brief introduction to boosting

License Plate Detection

Artist Identification Based on Song Analysis

Presented by Steven Lewis

Introduction to Data Mining, 2nd Edition

Presenter: Simon de Leon Date: March 2, 2006 Course: MUMT611

Measuring the Similarity of Rhythmic Patterns

Presentation transcript:

A Supervised Approach for Detecting Boundaries in Music using Difference Features and Boosting Douglas Turnbull Computer Audition Lab UC San Diego, USA Gert Lanckriet, UC San Diego, USA Elias Pampalk, AIST Japan Masataka Goto, AIST, Japan ISMIR September 24, 2007

1 Music has structure Consider the structure of a pop/rock song: –Musical Segments Introduction, Bridges, Verses, Choruses, Outro –Musical Boundary between two musical segments e.g., the end of a verse and the beginning of a chorus

2 Automatic Music Segmentation Our goal is to automatically segment music: 1.Represent a song as a digital signal 2.Extract useful features from that signal 3.Automatically detect ‘musical boundaries’ Purpose 1.More efficiently scan through songs Smart Music Kiosk [Goto06] 2.Generate better musical thumbnails 3.Develop novel music information retrieval applications. Temporal dynamics of music Structural comparison

3 Related Work Two Approaches: 1.Self-Similarity: identify similar audio content within a song Traditional Approach Unsupervised Approach 1.Cluster short-term features of a song [Abdallah 06, Lu 04] 2.Find repetitions within a song [Goto 06, Foote 02,…] 2.Edge Detection: find ‘changes’ in the audio content Difference Features - designed to reflect acoustic changes Supervised Approach –learn a model for musical boundaries using human segmentations –User defines ‘boundary’ through the training data

4 Outline Difference Features Supervised Music Boundary Detection –Feature Generation –Boosted Decision Stumps Concluding Remarks

5 Outline Difference Features Supervised Music Boundary Detection –Feature Generation –Boosted Decision Stumps Concluding Remarks

6 Difference Features Auditory cues indicate the end of one segment and the beginning of the next segment. These cues are related to high-level musical notions: 1.Timbre - new instrumentation 2.Harmony - key change 3.Melody - decreased intensity in the singers voice 4.Rhythm - drum fill We will design ‘difference features’ that attempt to model these cues.

7 Difference Features 1.Using two adjacent windows –‘Past’ & ‘Future’ Windows 2.Calculate a feature within each window –Scalar: RMS or BPM –Matrix: Fluctuation Pattern –Time Series: MFCC or Chroma 3.Calculate dissimilarity between features in each window –Scalar, Matrix:  Euclidean distance –Time series:  KL divergence between distributions of samples ‘Past’ Window ‘Future’ Window

8 Difference Features Slide adjacent windows over a song to generate a time series of difference features –Hop Size = resolution of boundary detection –Peaks represent changes in the audio content Time (Min) Dissimilarity 12 30

9 Outline Difference Features Supervised Music Boundary Detection –Feature Generation –Boosted Decision Stumps Concluding Remarks

10 Feature Generation 1.Start with 52 time series –37 difference features, 15 additional features –Each feature is a time series with a sampling rate of 0.1 seconds 2.Normalize each time series –Time series now is now mean = 0, and variance = 1 3.Generate multiple smoothed versions –1.6 sec, 6.4 sec, 25.6 sec Gaussian kernels. 4.Calculate 1st and 2nd derivatives –include absolute value of derivatives The result is a set of 780 time series –52 features x 3 smoothings x 5 derivatives –Each sample is represented by a 780-dimensional feature vector

11 Training Data

12 Supervised Framework We extract hundreds of 780-dimensional feature vectors per song for 100 human-segmented songs. –600 vectors per minute of audio content –Positive Examples: 7-15 vectors per song will be labeled as ‘boundary’ vectors –Negative Examples: randomly picked vectors far from boundaries We learn a boosted decision stump (BDS) classifier. –Popular classifier for object boundary detection in images [Dollar et al. 06, Viola & Jones 02]. –Powerful discriminative classifier based on Adaboost algorithm [Freund 95] –Useful for efficient and effective feature selection

13 Boosted Decision Stump (BDS) Classifier Decision Stump: a simple classifier that predicts one class if the value of an individual feature is above a threshold. Boosted Decision Stumps: an ensemble of decision stumps and associated weights. –Learning: The decision stump (feature and threshold) that reduces the training set error is added to the growing ensemble at each iteration of the algorithm. –The weight given to the decision stump is determined by the boosting algorithm (e.g., AdaBoost). –Inference: The prediction is based on a weighted ‘voted’ from the ensemble. Feature Selection: The order in which features are added to the ensemble can be interpreted as ranking of features.

14 Evaluation - Hit Rate Output of BDS Classifier is a time series of scores –Score reflects confidence of sample being a ‘boundary’ –Boundary Estimates: smooth time series, pick the 10 highest peaks Hit Rate: estimate is within a half second of a true boundary –Precision: % of estimates hit a true boundary –Recall: % boundaries are hit by estimates –F: harmonic average FrameworkClassifierPrecisionRecallF Baseline Uniform Placement Unsupervised Peak Picking (MFCC-Diff) SupervisedBDS

15 Evaluation - Directional Hamming Distance Directional Hamming Distance (F-score): –‘goodness of overlap’ between two segmentations –Between 0 and 1, 1 being perfect –Rhodes et. al report DHD-F of 0.78 on a set of 14 songs using a unsupervised spectral clustering approach [ICASSP 06] Hard to compare on different corpus with different segmentations FrameworkClassifierDHD-F Baseline Uniform Placement 0.71 Unsupervised Peak Picking (MFCC-Diff) 0.80 SupervisedBDS 0.82

16 Outline Difference Features Supervised Music Boundary Detection –Feature Generation –Boosted Decision Stumps Concluding Remarks

17 Summary 1.Difference features attempt to model acoustic cues that indicate boundaries within a song. 2.A supervised approach allows a user to explicitly define their a notion of ‘musical segment’ through their training segmentations. 3.Boosted decision stumps are used to quickly identify music boundaries produce good music segmentations implicitly perform feature selection

18 Future Work Address problems with dissimilarity measures –Euclidean Distance assumes Euclidean vector space –Reducing time series to a bag-of-features ignores temporal info Use additional features –Information theoretic features –Beat Onset features - e.g., Drum Fill Detector Learn ‘segment-specific’ classifiers –e.g., ‘Chorus-Onset’ classifier Explore new applications ‘chorus-based’ music similarity and retrieval

A Supervised Approach for Detecting Boundaries in Music using Difference Features and Boosting Douglas Turnbull Computer Audition Lab UC San Diego, USA Gert Lanckriet, UC San Diego, USA Elias Pampalk, AIST Japan Masataka Goto, AIST, Japan ISMIR September 24, 2007

20 Summary of Difference Features We create 37 difference features –A features is a time series of scalars sampled at 0.1 sample/sec. Timbre Spectral (6) RMS, ZCR, harm, sc, perc, loud MFCC (4x3) 1-5, 1-20, 2-5, 2-20 coef original, delta, delta2 Subtraction KL Distance Harmony Chroma (3x3) 12, 24, 36 pitch classes original, delta, delta2 KL Distance MelodyF0 (2) F0 and F0-Power KL Distance RhythmFluctuation Pattern (8) G, f, max, sum, base, non aggr, LFD Frobenius Norm Subtraction CueFeatureDifference

21 Evaluation - Median Time Median Time: time between estimates and true boundaries –Measurement in seconds –Lower times are better FrameworkClassifierEstimate-to-TrueTrue-to-Estimate Baseline Uniform Placement Unsupervised Peak Picking (MFCC-Diff) SupervisedBDS4.31.8