People use words to describe music

Slides:

Advertisements

Similar presentations

A probabilistic model for retrospective news event detection

Advertisements

Yansong Feng and Mirella Lapata

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Unsupervised Learning

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Audio Retrieval David Kauchak cs160 Fall 2009 Thanks to Doug Turnbull for some of the slides.

Evaluating Search Engine

The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

Results Audio Information Retrieval using Semantic Similarity Luke Barrington, Antoni Chan, Douglas Turnbull & Gert Lanckriet Electrical & Computer Engineering.

Image Search Presented by: Samantha Mahindrakar Diti Gandhi.

Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.

1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.

Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR.

ICME 2004 Tzvetanka I. Ianeva Arjen P. de Vries Thijs Westerveld A Dynamic Probabilistic Multimedia Retrieval Model.

Semantic Similarity for Music Retrieval Luke Barrington, Doug Turnbull, David Torres & Gert Lanckriet Electrical & Computer Engineering University of California,

Towards Musical Query-by-Semantic-Description using the CAL500 Dataset Douglas Turnbull Computer Audition Lab UC San Diego Work with Luke Barrington, David.

Language Modeling Approaches for Information Retrieval Rong Jin.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

POTENTIAL RELATIONSHIP DISCOVERY IN TAG-AWARE MUSIC STYLE CLUSTERING AND ARTIST SOCIAL NETWORKS Music style analysis such as music classification and clustering.

Information Retrieval in Practice

Modeling Music with Words a multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Text Classification, Active/Interactive learning.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

A Game-Based Approach for Collecting Semantic Music Annotations Douglas Turnbull, Rouran Liu, Luke Barrington, Gert Lanckriet Computer Audition Lab UC.

Combining Audio Content and Social Context for Semantic Music Discovery José Carlos Delgado Ramos Universidad Católica San Pablo.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Chapter 23: Probabilistic Language Models April 13, 2004.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

KNN & Naïve Bayes Hongning Wang

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Large-Scale Content-Based Audio Retrieval from Text Queries

Lecture 13: Language Models for IR

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Statistical Models for Automatic Speech Recognition

Multimodal Learning with Deep Boltzmann Machines

Statistical Models for Automatic Speech Recognition

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

INF 141: Information Retrieval

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Modeling Music with Words a multi-class naïve Bayes approach Douglas Turnbull Luke Barrington Gert Lanckriet Computer Audition Laboratory UC San Diego ISMIR 2006 October 11, 2006 Image from vintageguitars.org.uk

People use words to describe music How would one describe “I’m a Believer” by The Monkees? We might use words related to: Genre: ‘Pop’, ‘Rock’, ‘60’s’ Instrumentation: ‘tambourine, ‘male vocals’, ‘electric piano’ Adjectives: ‘catchy’, ‘happy’, ‘energetic’ Usage: ‘getting ready to go out’ Related Sounds: ‘The Beatles’, ‘The Turtles’, ‘Lovin’ Spoonful’ We learn to associate certain words with the music we hear. Image: www.twang-tone.de/45kicks.html

Modeling music and words Our goal is to design a statistical system that learns a relationship between music and words. Given such a system, we can: Annotation: Given a audio-content of a song, we can ‘annotate’ the song with semantically meaningful words. song  words Retrieval: Given a text-based query, we can ‘retrieve’ relevant songs based on the audio content of the songs. words  songs Image from: http://www.lacoctelera.com/

Modeling images and words Content-based image annotation and retrieval has been a hot topic in recent years [CV05, FLM04, BJ03, BDF+02, …]. This application has benefited from and inspired recent developments in machine learning. Retrieval Query String: ‘jet’ Annotation How can MIR benefit from and inspire new developments in machine learning? *Images from [CV05], www.oldies.com

Related work: Modeling music and words is at the heart of MIR research. jointly modeling semantic labels and audio content genre, emotion, style, usage classification music similarity analysis Whitman et al. have produced a large body of work that is closely related to our work [Whi05, WE04, WR05]. Others have looked at joint model of words and sound effects. Most focus on non-parametric models (kNN) [SAR-Sla02, AudioClas-CK04] Images from www.sixtiescity.com

Representing music and words Consider a vocabulary and a heterogeneous data set of song-caption pairs: Vocabulary - predefined set of words Song - set of audio feature vectors (X = {x1 ,…, xT}) Caption - binary document vector (y) Example: “I’m a believer” by The Monkees is a happy pop song that features tambourine. Given the vocabulary {pop, jazz, tambourine, saxophone, happy, sad} X = set of MFCC vectors extracted from audio track y = [1, 0, 1, 0, 1, 0] Image from www.bluesforpeace.com

Overview of our system: Representation Data Features Training Data Vocabulary T T Caption Document Vectors (y) Audio-Feature Extraction (X)

Probabilistic model for music and words Consider a vocabulary and a set of song-caption pairs Vocabulary - predefined set of words Song - set of audio feature vectors (X = {x1 ,…, xT}) Caption - binary document vector (y) For the i-th word in our vocabulary, we estimate a ‘word’ distribution, P(x|i). Probability distribution over audio feature vector space Modeled with a Gaussian Mixture Model (GMM) GMM estimated using Expectation Maximization (EM) Key idea: training data for each ‘word’ distribution is the set of all feature vectors from all songs that are labeled with that word. Multiple Instance Learning: includes some irrelevant feature vectors Weakly Labeled Data: excludes some relevant feature vectors Our probabilistic model is a set of ‘word’ distributions (GMMs) Image from www.freewebs.com

Overview of our system: Modeling Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X)

Overview of our system: Annotation Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

Inference: Annotation Given ‘word’ distributions P(x|i) and a query song (x1,…,xT), we annotate with word i*: Naïve Bayes Assumption: we assume xi and xj are conditionally independent, given i: Assuming a uniform prior and taking a log transform, we have Using this equation, we annotate the query song with the top N words. www.cascadeblues.org

Overview of our system: Annotation Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference Caption

Overview of our system: Retrieval Data Features Modeling Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

Inference: Retrieval We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. Problem: this results in almost the same ranking for all query words. There are two reasons: Length Bias Longer songs will have proportionately lower likelihood resulting from the sum of additional log terms. This results from the naïve Bayes assumption of conditional independence between audio feature vectors [RQD00]. Image from www.rockakademie-owl.de

Inference: Retrieval We would like to rank test songs by the posterior probability P(x1, …,xT|q) given a query word q. Problem: this results in almost the same ranking for all query words. There are two reasons: Length Bias Song Bias Many conditional word distributions P(x|q) are similar to the generic song distribution P(x) High probability (e.g. generic) songs under P(x) often have high probability under P(x|q) Solution: Rank by likelihood P(q|x1, …,xT) instead. Normalize P(x1, …,xT|q) by P(x1, …,xT) Image from www.rockakademie-owl.de

Overview of our system Inference Data Features Modeling Parameter Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song (annotation) Inference (retrieval) Caption Text Query

Overview of our system: Evaluation Data Features Modeling Evaluation Parametric Model: Set of GMMs Training Data Vocabulary T T Parameter Estimation: EM Algorithm Caption Document Vectors (y) Audio-Feature Extraction (X) Novel Song Evaluation (annotation) Inference (retrieval) Caption Text Query

Experimental Setup Data: 2131 song-review pairs Audio: popular western music from the last 60 years DMFCC feature vectors [MB03] Each feature vector summarize 3/4 seconds of audio content Each song is represent by between 320-1920 feature vectors Text: song reviews from AMG Allmusic database We create a vocabulary of 317 ‘musically relevant‘ unigrams and bigrams A review is a natural language document written by a musical expert Each review is converted into a binary document vector 80% Training Set: used for parameters estimation 20% Testing Set: used for model evaluation Image from www.chrisbarber.net

Experimental Setup Tasks: Annotation: annotate each test song with 10 words Retrieval: rank order all test songs given a query word Metrics: We adopt evaluation metrics developed for image annotation and retrieval [CV05]. Annotation: mean per-word precision and recall Retrieval: mean average precision mean area under the ROC curve Image from www.chrisbarber.net

Quantitative Results Annotation Retrieval Our Model .072 .119 .109 Recall Precision maPrec AROC Our Model .072 .119 .109 0.61 Baseline .032 .060 0.50 Our model performs significantly better than random for all metrics. one-sided paired t-test with  = 0.1 recall & precision are bounded by a value less 1 AROC is perhaps the most intuitive metric Image from sesentas.ururock.com

Music is inherently subjective Discussion Music is inherently subjective Different people will use different words to describe the same song. We are learning and evaluating using a very noisy text corpus Reviewer do not make explicit decisions about the relationships between individual words when reviewing a song. “This song does not rock.” Mining the web may not suffice. Solution: manually label data (e.g., MoodLogic, Pandora) Image from www.16-bits.com.ar

Discussion 3. Our system performs much better when we annotate & retrieve sound effects BBC sound effect library More objective task Cleaner text corpus Area under the ROC = 0.80 (compare with 0.61 for music) 4. Best results for content-based image annotation and retrieval are comparable to our sound effect results. Image from www.16-bits.com.ar

Douglas Turnbull - dturnbul@cs.ucsd.edu “Talking about music is like dancing about architecture” - origins unknown Please send your questions and comments to Douglas Turnbull - dturnbul@cs.ucsd.edu Image from vintageguitars.org.uk

References

References