Modeling Music with Words a multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego ISMIR 2006 October 11, 2006 Image from vintageguitars.org.uk
People use words to describe music How would one describe “I’m a Believer” by The Monkees? We might use words related to: Genre: ‘Pop’, ‘Rock’, ‘60’s’ Instrumentation: ‘tambourine, ‘male vocals’, ‘electric piano’ Adjectives: ‘catchy’, ‘happy’, ‘energetic’ Usage: ‘getting ready to go out’ Related Sounds: ‘The Beatles’, ‘The Turtles’, ‘Lovin’ Spoonful’ We learn to associate certain words with the music we hear. Image: www.twang-tone.de/45kicks.html
Modeling music and words Our goal is to design a statistical system that learns a relationship between music and words. Given such a system, we can: Annotation: Given a audio-content of a song, we can ‘annotate’ the song with semantically meaningful words. song words Retrieval: Given a text-based query, we can ‘retrieve’ relevant songs based on the audio content of the songs. words songs Image from: http://www.lacoctelera.com/
Modeling images and words Content-based image annotation and retrieval has been a hot topic in recent years [CV05, FLM04, BJ03, BDF+02, …]. This application has benefited from and inspired recent developments in machine learning. Retrieval Query String: ‘jet’ Annotation How can MIR benefit from and inspire new developments in machine learning? *Images from [CV05], www.oldies.com
Related work: Modeling music and words is at the heart of MIR research. jointly modeling semantic labels and audio content genre, emotion, style, usage classification music similarity analysis Whitman et al. have produced a large body of work that is closely related to our work [Whi05, WE04, WR05]. Others have looked at joint model of words and sound effects. Most focus on non-parametric models (kNN) [SAR-Sla02, AudioClas-CK04] Images from www.sixtiescity.com
Representing music and words Consider a vocabulary and a heterogeneous data set of song-caption pairs: Vocabulary - predefined set of words Song - set of audio feature vectors (X = {x1 ,…, xT}) Caption - binary document vector (y) Example: “I’m a believer” by The Monkees is a happy pop song that features tambourine. Given the vocabulary {pop, jazz, tambourine, saxophone, happy, sad} X = set of MFCC vectors extracted from audio track y = [1, 0, 1, 0, 1, 0] Image from www.bluesforpeace.com
Overview of our system: Representation Data Features Training Data Vocabulary T T Caption Document Vectors (y) Audio-Feature Extraction (X)
Probabilistic model for music and words Consider a vocabulary and a set of song-caption pairs Vocabulary - predefined set of words Song - set of audio feature vectors (X = {x1 ,…, xT}) Caption - binary document vector (y) For the i-th word in our vocabulary, we estimate a ‘word’ distribution, P(x|i). Probability distribution over audio feature vector space Modeled with a Gaussian Mixture Model (GMM) GMM estimated using Expectation Maximization (EM) Key idea: training data for each ‘word’ distribution is the set of all feature vectors from all songs that are labeled with that word. Multiple Instance Learning: includes some irrelevant feature vectors Weakly Labeled Data: excludes some relevant feature vectors Our probabilistic model is a set of ‘word’ distributions (GMMs) Image from www.freewebs.com
Overview of our system: Modeling Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X)
Overview of our system: Annotation Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption
Inference: Annotation Given ‘word’ distributions P(x|i) and a query song (x1,…,xT), we annotate with word i*: Naïve Bayes Assumption: we assume xi and xj are conditionally independent, given i: Assuming a uniform prior and taking a log transform, we have Using this equation, we annotate the query song with the top N words. www.cascadeblues.org
Overview of our system: Annotation Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption
Overview of our system: Retrieval Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query
Inference: Retrieval We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. Problem: this results in almost the same ranking for all query words. There are two reasons: Length Bias Longer songs will have proportionately lower likelihood resulting from the sum of additional log terms. This results from the naïve Bayes assumption of conditional independence between audio feature vectors [RQD00]. Image from www.rockakademie-owl.de
Inference: Retrieval We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. Problem: this results in almost the same ranking for all query words. There are two reasons: Length Bias Song Bias Many conditional word distributions P(x|q) are similar to the generic song distribution P(x) High probability (e.g. generic) songs under P(x) often have high probability under P(x|q) Solution: Rank by likelihood P(q|x1, …,xT) instead. Normalize P(x1, …,xT|q) by P(x1, …,xT) Image from www.rockakademie-owl.de
Overview of our system Inference Data Features Modeling Parameter Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query
Overview of our system: Evaluation Data Features Modeling Evaluation Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song Evaluation (annotation) Inference (retrieval) Caption Text Query
Experimental Setup Data: 2131 song-review pairs Audio: popular western music from the last 60 years DMFCC feature vectors [MB03] Each feature vector summarize 3/4 seconds of audio content Each song is represent by between 320-1920 feature vectors Text: song reviews from AMG Allmusic database We create a vocabulary of 317 ‘musically relevant‘ unigrams and bigrams A review is a natural language document written by a musical expert Each review is converted into a binary document vector 80% Training Set: used for parameters estimation 20% Testing Set: used for model evaluation Image from www.chrisbarber.net
Experimental Setup Tasks: Annotation: annotate each test song with 10 words Retrieval: rank order all test songs given a query word Metrics: We adopt evaluation metrics developed for image annotation and retrieval [CV05]. Annotation: mean per-word precision and recall Retrieval: mean average precision mean area under the ROC curve Image from www.chrisbarber.net
Quantitative Results Annotation Retrieval Our Model .072 .119 .109 Recall Precision maPrec AROC Our Model .072 .119 .109 0.61 Baseline .032 .060 0.50 Our model performs significantly better than random for all metrics. one-sided paired t-test with = 0.1 recall & precision are bounded by a value less 1 AROC is perhaps the most intuitive metric Image from sesentas.ururock.com
Music is inherently subjective Discussion Music is inherently subjective Different people will use different words to describe the same song. We are learning and evaluating using a very noisy text corpus Reviewer do not make explicit decisions about the relationships between individual words when reviewing a song. “This song does not rock.” Mining the web may not suffice. Solution: manually label data (e.g., MoodLogic, Pandora) Image from www.16-bits.com.ar
Discussion 3. Our system performs much better when we annotate & retrieve sound effects BBC sound effect library More objective task Cleaner text corpus Area under the ROC = 0.80 (compare with 0.61 for music) 4. Best results for content-based image annotation and retrieval are comparable to our sound effect results. Image from www.16-bits.com.ar
Douglas Turnbull - dturnbul@cs.ucsd.edu “Talking about music is like dancing about architecture” - origins unknown Please send your questions and comments to Douglas Turnbull - dturnbul@cs.ucsd.edu Image from vintageguitars.org.uk
References
References