Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
1 Welcome to the Kernel-Class My name: Max (Welling) Book: There will be class-notes/slides. Homework: reading material, some exercises, some MATLAB implementations.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Audio Retrieval David Kauchak cs160 Fall 2009 Thanks to Doug Turnbull for some of the slides.
A Music Search Engine Built upon Audio-based and Web-based Similarity Measures P. Knees, T., Pohle, M. Schedl, G. Widmer SIGIR 2007.
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Evaluating Search Engine
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Principal Component Analysis
Results Audio Information Retrieval using Semantic Similarity Luke Barrington, Antoni Chan, Douglas Turnbull & Gert Lanckriet Electrical & Computer Engineering.
Direct Convex Relaxations of Sparse SVM Antoni B. Chan, Nuno Vasconcelos, and Gert R. G. Lanckriet The 24th International Conference on Machine Learning.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
People use words to describe music
Unsupervised Learning
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
A Supervised Approach for Detecting Boundaries in Music using Difference Features and Boosting Douglas Turnbull Computer Audition Lab UC San Diego, USA.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
Canonical Correlation Analysis: An overview with application to learning methods By David R. Hardoon, Sandor Szedmak, John Shawe-Taylor School of Electronics.
Semantic Similarity for Music Retrieval Luke Barrington, Doug Turnbull, David Torres & Gert Lanckriet Electrical & Computer Engineering University of California,
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Towards Musical Query-by-Semantic-Description using the CAL500 Dataset Douglas Turnbull Computer Audition Lab UC San Diego Work with Luke Barrington, David.
Improving Musical Genre Classification with RBF Networks Douglas Turnbull Department of Computer Science and Engineering University of California, San.
Chapter 5: Information Retrieval and Web Search
POTENTIAL RELATIONSHIP DISCOVERY IN TAG-AWARE MUSIC STYLE CLUSTERING AND ARTIST SOCIAL NETWORKS Music style analysis such as music classification and clustering.
Information Retrieval in Practice
Audio Retrieval David Kauchak cs458 Fall Administrative Assignment 4 Two parts Midterm Average:52.8 Median:52 High:57 In-class “quiz”: 11/13.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Summarized by Soo-Jin Kim
Presented By Wanchen Lu 2/25/2013
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Cs: compressed sensing
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
A Regression Approach to Music Emotion Recognition Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H. Chen, Fellow, IEEE IEEE TRANSACTIONS ON AUDIO,
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Chapter 6: Information Retrieval and Web Search
A Game-Based Approach for Collecting Semantic Music Annotations Douglas Turnbull, Rouran Liu, Luke Barrington, Gert Lanckriet Computer Audition Lab UC.
Combining Audio Content and Social Context for Semantic Music Discovery José Carlos Delgado Ramos Universidad Católica San Pablo.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Dd Generalized Optimal Kernel-based Ensemble Learning for HS Classification Problems Generalized Optimal Kernel-based Ensemble Learning for HS Classification.
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Predicting Voice Elicited Emotions
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Improving Music Genre Classification Using Collaborative Tagging Data Ling Chen, Phillip Wright *, Wolfgang Nejdl Leibniz University Hannover * Georgia.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
PREDICTING SONG HOTNESS
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Large-Scale Content-Based Audio Retrieval from Text Queries
An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism
Introduction to Music Information Retrieval (MIR)
Nonnegative polynomials and applications to learning
Brian Whitman Paris Smaragdis MIT Media Lab
Chapter 5: Information Retrieval and Web Search
Presenter: Simon de Leon Date: March 2, 2006 Course: MUMT611
Large scale multilingual and multimodal integration
Restructuring Sparse High Dimensional Data for Effective Retrieval
SVMs for Document Ranking
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR September 25, 2007

1 Introduction Our Goal: Create a content-based music search engine for natural language queries. –CAL Music Search Engine [SIGIR07] Problem: picking a vocabulary of musically meaningful words? – Word is present  pattern in audio content Solution: find words that are correlated with a set of acoustic signals

2 Two-View Representation Consider a set of annotated songs. Each song is represented by: 1.Annotation vector in a Semantic Space 2.Audio feature vector(s) in an Acoustic Space Acoustic Space (2D) x y Semantic Space (2D) ‘funky’ ‘Ireland’ Mustang Sally The Commitments Riverdance Bill Whelan Hot Pants James Brown

3 Semantic Representation Vocabulary of words: 1.CAL500: 174 phrases from a human survey Instrumentation, genre, emotion, usages, vocal characteristics 2.LastFM: ~15,000 tags from social music site 3.Web Mining: 100,000+ words mined from text documents Annotation Vector, denoted s –Each element represents the ‘semantic association’ between a word and the song. –Dimension (D S ) = size of vocabulary –Example: Frank Sinatra’s ‘Fly Me to the Moon” Vocabulary = {funk, jazz, guitar, female vocals, sad, passionate } Annotation (s i ) = [0/4, 3/4, 4/4, 0/4, 2/4, 1/4] Data is represented by a N x D S Matrix S = - s s i -. - s N -

4 Acoustic Representation Each song is represented by an audio feature vector a that is automatically extracted from the audio-content. Data is represented by NxD A matrix A = - a a i -. - a N - Acoustic Space (2D)Semantic Space (2D) ‘funky’ ‘Ireland’ Mustang Sally The Commitments x y

5 Canonical Correlation Analysis (CCA) CCA is a technique for exploring dependencies between two related spaces. –Generalization of PCA to multiple spaces –Constrained optimization problem Find vectors weight vectors w s and w a : –1-D projection of data in the semantic space - Sw s –1-D projection of data in the acoustic space - Aw a Maximize correlation of the projections –max (Sw s ) T (Aw a ) Constrain w s and w a to prevent infinite correlation max (Sw s ) T (Aw a ) w a, w s subject to: (Sw s ) T (Sw s ) = 1 (Aw a ) T (Aw a ) = 1

6 CCA Visualization Audio feature spaceSemantic space ‘funky’ ‘Ireland’ a a a b b b b c c c c d d d S A = = 1010 wsws 1 wawa (Sw s ) T (Aw a ) = 4 x y Sparse Solution

7 What Sparsity means… In the previous example, w s, ’funky’  0  ‘funky’ is correlated w/ audio signals  a musically meaningful word w s, ’Ireland’ = 0  ‘Ireland’ is not correlated  No linear relationship with the acoustic representation In practice, w s is dense even if most words are uncorrelated –‘dense’ means many non-zero values –due to random variability in the data Key Idea: reformulate CCA to produce a sparse solution.

8 Introducing Sparse CCA [ICML07] Plan: penalize the objective function for each non-zero semantic dimensions Pick a penalty function f(w s ) Penalizes each non-zero dimension Take 1: Cardinality of w s : f(w s ) = |w s | 0 Combinatorial problem - np-hard Take 2: L1 relaxation: f(w s ) = |w s | 1 Non-convex, not very tight approximation Take 3: SDP relaxation Prohibitive expensive for large problem Solution: f(w s ) =  i log |w s,i | Non-convex problem, but Can be solved efficiently with DC program Tight approximation

9 Introducing Sparse CCA [ICML07] Plan: penalize the objective function for each non-zero semantic dimensions Pick a penalty function f(w s ) Penalizes each non-zero dimension f(w s ) =  i log |w s,i | Use tuning parameter  to control importance of sparsity Increasing   smaller set of ‘musically relevant’ words max (Sw s ) T (Aw a ) w a, w s subject to: (Sw s ) T (Sw s ) = 1 (Aw a ) T (Aw a ) = 1 -  f(w s )

10 Experimental Setup CAL500 Data Set [SIGIR07] –500 songs by 500 Artists –Semantic Representation 173 words – genre, instrumentation, usages, emotions, vocals, etc… Annotation vector is average from 3+ listeners Word Agreement Score –measures how consistently listeners apply a word to songs –Acoustic Representation Bag of Dynamic MFCC Vectors [McKinney03] – 52-D vector spectral modulation intensities –160 vectors per minute of audio content Duplicate annotation vector for each Dynamic MFCC

11 Experiment 1: Qualitative Results Words with high acoustic correlation hip-hop, arousing, sad, drum machine, heavy beat, at a party, rapping Words with no acoustic correlation classic rock, normal, constant energy, going to sleep, falsetto

12 Experiment 2: Vocabulary Pruning AMG2131 Text Corpus [ISMIR06] –AMG Allmusic song reviews for most of CAL500 songs –315 word vocabulary –Annotation vector based on the presence or absence of a word in the review –More noisy word-song relationships then CAL500 Experimental Design: 1.Merge vocabularies: = 488 words 2.Prune noisy words as we increase amount of sparsity in CCA Hypothesis: –AMG words will be pruned before CAL500 words

13 Experiment 2: Vocabulary Pruning Experimental Design: 1.Merge vocabularies: 488 words 2.Prune noisy words as we increase amount of sparsity in CCA Result: As Sparse CCA is more aggressive, more AMG words are pruned. Vocabulary Size # CAL500 Words # AMG2131 Words % Web2131 Words

14 Experiment 3: Vocabulary Selection Experimental Design: 1.Rank words by how aggressive Sparse CCA is before word gets pruned. how consistently humans use a word across CAL500 corpus. 2.As we decrease vocabulary size, calculate Average AROC Result: Sparse CCA does predict words that have better AROC AROC Vocab Size 70

15 Recap Constructing a ‘meaningful vocabulary’ is the first step in building a content-based, natural-language search engine for music. Given a semantic representation and acoustic representation Sparse CCA can be used to find ‘musically meaningful’ words. –i.e., semantic dimensions linearly correlated with audio features Automatically pruning words is important when using noisy sources of semantic information –e.g., LastFM Tags or Web Documents

16 Future Work Theory: moving beyond linear correlation with kernel methods Application: Sparse CCA can be used to find ‘musically meaningful’ audio features by imposing sparsity in the acoustic space Practice: handling large, noisy semantically annotated music corpora

Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull, Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR September 25, 2007

18 Experiment 3: Vocabulary Selection Our content-based music search engine rank orders songs given a text-based query [SIGIR 07] –Area under the ROC curve (AROC) measures quality of each ranking 0.5 is random, 1.0 is perfect 0.68 is average AROC for all 1-word queries Can Sparse CCA pick words that will have higher AROC? –Idea: words with high correlation have more signal in the audio representation and will be easier to model. –How does it compare picking words that humans consistently use to label songs.