LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.

Slides:



Advertisements
Similar presentations
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Advertisements

Text Categorization.
Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
A PowerPoint Presentation
Clustering Categorical Data The Case of Quran Verses
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
LYRIC-BASED ARTIST NETWORK Derek Gossi CS 765 Fall 2014.
Dimensionality Reduction PCA -- SVD
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.
A Music Search Engine Built upon Audio-based and Web-based Similarity Measures P. Knees, T., Pohle, M. Schedl, G. Widmer SIGIR 2007.
Finding Similar Music Artists for Recommendation Presented by :Abhay Goel, Prerak Trivedi.
Rubi’s Motivation for CF  Find a PhD problem  Find “real life” PhD problem  Find an interesting PhD problem  Make Money!
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Objective of Computer Vision
Recommender Systems; Social Information Filtering.
A Measurement-driven Analysis of Information Propagation in the Flickr Social Network WWW09 报告人: 徐波.
Clustering Unsupervised learning Generating “classes”
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Item-based Collaborative Filtering Recommendation Algorithms
Sound Applications Advanced Multimedia Tamara Berg.
Tag-based Social Interest Discovery
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
The identification of interesting web sites Presented by Xiaoshu Cai.
By Chris Zachor.  Introduction  Background  Changes  Methodology  Data Collection  Network Topologies  Measures  Tools  Conclusion  Questions.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Computing Relevance, Similarity: The Vector Space Model.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Evidence of Quality of Textual Features on the Web 2.0 Flavio Figueiredo David FernandesEdleno MouraMarco Cristo Fabiano BelémHenrique.
Cosine Similarity Item Based Predictions 77B Recommender Systems.
Music Recommendation On-line Survey Presented by Daniel Wu & Gordon Chang
Network Community Behavior to Infer Human Activities.
Tagging Systems and Their Effect on Resource Popularity Austin Wester.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
KNN CF: A Temporal Social Network kNN CF: A Temporal Social Network Neal Lathia, Stephen Hailes, Licia Capra University College London RecSys ’ 08 Advisor:
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Optimization Indiana University July Geoffrey Fox
Item-Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl GroupLens Research Group/ Army.
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
Item-Based Collaborative Filtering Recommendation Algorithms
A content-based System for Music Recommendation and Visualization of User Preference Working on Semantic Notions Dmitry Bogdanov, Martin Haro, Ferdinand.
IR 6 Scoring, term weighting and the vector space model.
ITunes Genius Presented By: Dibyendu Talukder (MT13063) Prerna Juneja (MT13099)
Data Mining: Concepts and Techniques
Clustering of Web pages
Social Networks Analysis
Brian Whitman Paris Smaragdis MIT Media Lab
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Collaborative Filtering Nearest Neighbor Approach
M.Sc. Project Doron Harlev Supervisor: Dr. Dana Ron
Q4 : How does Netflix recommend movies?
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Presented by Nick Janus
Presentation transcript:

LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014

Topics Review The Music Recommendation Problem Existing Research A Quick Review The Dataset Methodology

REVIEW The Music Recommendation Problem

Approaches to Recommendation Collaborative Filtering Users that liked this artist/song also liked that artist/song Amazon, iTunes store, Spotify Tagging Categorization based on user-generated or pre-defined tags Calm, sad, romantic, cheerful, anxious, depressed Last.fm Content-based Look at the audio signal Not widely used in industry yet Pandora, Spotify (in progress) What can the lyrics tell us?

The Issue Collaborative Filtering works well when there is lots of user feedback, but it doesn’t scale well Content-based methods scale very well, because no user data is required Results are mixed and not widely used in practice What would a lyric-based recommendation network look like, and how would it compare to existing collaborative filtering networks?

EXISTING RESEARCH A Quick Review

Network Topology P. Cano, O. Celma, and M. Koppenberger. “The topology of music recommendation networks,” Feb Analyzes four music recommendation systems from a network perspective Directed edges n = 16,302 (Yahoo) to 51,616 (MSN) m = 158,866 (AMG) to 511,539 (Yahoo) Small-world properties in all networks Average shortest path < 8 Clustering coefficient from 0.14 (Amazon) to 0.54 (MSN)

THE DATASET The Million Song Dataset (MSD)

Million Song Dataset Open source dataset released in Feb 2011 Metadata and audio features for a million contemporary audio tracks Linked to separate datasets with lyrical data, Last.fm tags, and actual user data

METHODOLOGY

Main Idea What can lyrics tell us about music recommendation? Build a network of musical artists where edges represent lyrical similarity Build a comparable network based on actual user play count data utilizing a collaborative filtering methodology Analyze and compare topology of both networks Use a clustering algorithm on both networks to obtain clusters with strongly connected neighbors Figure out which tag categories, if any, are associated with these clusters What are users saying?

Data Cleaning Lyrics provided in bag-of-words format Stop words removed Porter2 stemming algorithm “Dictionary” limited to top 5,000 words 92% of complete set Terms outside limited list are generally noisy and unusable

Term Frequency Matrix Vector Space Model (VSM) Represent songs as sparse vectors in n-dimensional space n = 5,000

Term Frequency Matrix Could sum song vectors together for a particular artist to get term frequencies of an entire artist’s catalog However, this would lose important information about variance of individual song vectors for a given artist Reduce to artist level after song links are formed

TD-IDF Weighting Term frequency matrix can be used to directly compare similarity of two songs But what about frequently used and statistically unimportant words? The, at, which, on, by Idea: make words used frequently across all songs in the dataset less important Multiply term frequency (TF) by inverse document frequency (IDF)

Pairwise Similarity Matrix As songs are vectors, we can calculate similarity in terms of the angle between them in the vector space Cosine similarity is often used in document classification 0 implies two songs are orthogonal (completely different) and 1 implies two songs are identical

Threshold Selection We now have a [0,1] range of how similar each song in the dataset is to every other song Simplest idea: Pick some threshold in [0,1] and create an edge if threshold is exceeded Issues: Network topology can be significantly impacted by threshold selection, no intuitive explanation, some songs are left out of network (scale issues) Instead, fix outdegree by using a weighted k shared neighbors approach Given a user liked a given song, which k songs would be recommended based on lyrics?

Threshold Selection

Reduction to Artist Level How many songs from artist i to artist j have edges between them? Create an edge if 1 or greater, weight edge if strictly greater than 1 by summing existing weights

Collaborative Filtering Network Echo Nest Taste Profile Subset Play counts for 1,019,318 users and 48,373,586 user/song/count triples Item-based collaborative filtering Once again, song are represented as vectors For a given song, i th component of a song vector is number of plays by user i Pairwise similarity computed and network generated to match lyric network

Tag Data Last.fm tags are linked to the nodes in both networks Restricted to high-frequency tags representing: Genre Mood Musical Style

Topology Comparison We now have two networks with fixed outdegree of k and indegree of unknown distribution What are the most important nodes, and what are the tags associated with these nodes? Degree distribution How clustered is the network by genre, style, or mood?

Clustering We want to see what the strongest communities are in each of the two networks L. Erto ̈ z, M. Steinbach, V. Kumar. “Finding topics in collections of documents: a shared nearest neighbors approach” Main idea: avoid clustering two nodes together that are in different classes by ensuring every node shares strongly connected neighbors

Clustering Strong Link Threshold: Min # of shared neighbors Topic Threshold: Represents its neighborhood Merge Threshold: Nodes appear together in a cluster Noise Threshold: Nodes are not included Labeling Threshold: Scan clusters and add strong links

Clustering Which tags are associated with these clusters? How do the two networks compare to each other? What can this tell us about music recommendation? Collaborative filtering works well in practice

Conclusion How do we recommend music without user data? Content-based methods may work Do the lyrics correspond to what people say about music? What does this tell us about lyrical expression in general?

QUESTIONS?