PREDICTING SONG HOTNESS

Slides:

Advertisements

Similar presentations

Florida International University COP 4770 Introduction of Weka.

Advertisements

An Overview of Machine Learning

Introduction to Data Mining with XLMiner

Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.

Math 5364 Notes Chapter 5: Alternative Classification Techniques Jesse Crawford Department of Mathematics Tarleton State University.

Postgraduate Department of Electrical Engineering PPGEE UFPR - Federal University of Paraná Luis Gustavo Weigert Machado

Who would be a good loanee? Zheyun Feng 7/17/2015.

CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.

Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.

MACHINE LEARNING TECHNIQUES FOR MUSIC PREDICTION S. Grant Lowe Advisor: Prof. Nick Webb.

1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.

NFL Play Predictions Will Burton, NCSU Industrial Engineering 2015

Gesture Recognition & Machine Learning for Real-Time Musical Interaction Rebecca Fiebrink Assistant Professor of Computer Science (also Music) Princeton.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Data Mining In contrast to the traditional (reactive) DSS tools, the data mining premise is proactive. Data mining tools automatically search the data.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.

Predicting Voice Elicited Emotions

Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

COMP24111: Machine Learning Ensemble Models Gavin Brown

Data analysis tools Subrata Mitra and Jason Rahman.

***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Blackbox classifiers for preoperative discrimination between malignant and benign ovarian tumors C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel.

CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Mining of Massive Datasets Edited based on Leskovec’s from

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.

GROUP GOAL Learn and understand python programing language Libraries: Pandas Numpy SKlearn Use machine learning algorithms Decision trees Random Forests.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

A Smart Tool to Predict Salary Trends of H1-B Holders

Machine Learning with Spark MLlib

Introduction to Machine Learning

An Empirical Comparison of Supervised Learning Algorithms

Trees, bagging, boosting, and stacking

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Source: Procedia Computer Science（2015）70:

COMP61011 : Machine Learning Ensemble Models

Basic machine learning background with Python scikit-learn

Introduction Feature Extraction Discussions Conclusions Results

Machine Learning Week 1.

Talking Data Click Fraud Detection

Intro to Machine Learning

Data Analytics at CNU Dmitriy Shaltayev

Analytics: Its More than Just Modeling

Applying SVM to Data Bypass Prediction

CSCI N317 Computation for Scientific Applications Unit Weka

Machine Learning Interpretability

Classification Boundaries

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Creative Activity and Research Day (CARD)

Intro to Machine Learning

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Artificial Intelligence Club

Python for Data Analysis

Predicting Loan Defaults

CS639: Data Management for Data Science

Analysis on Accelerated Learning Cohorts

Practice Project Overview

Igor Stančin, Alan Jović to: {igor.stancin,

March Madness Data Crunch Overview

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Machine Learning for Cyber

Presentation transcript:

PREDICTING SONG HOTNESS MILLION SONGS PREDICTING SONG HOTNESS MICHAEL BALL, NISHOK CHETTY, ROHAN ROY CHOUDHURY, ALPER VURAL

Music industry makes a lot of money from popular music OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Music industry makes a lot of money from popular music Highly invested in identifying trending features Especially interested in an algorithmic way to evaluate potential popularity of a new song Can we predict whether a song is going to be popular? Can we determine what factors make a song popular? CHETTY

Using machine learning, predict whether a song is going to be popular OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Using machine learning, predict whether a song is going to be popular Use feature importance metrics to explore what makes certain songs popular Quality metrics: classification accuracy, ROC/AUC CHETTY

Dataset name: Million songs Dataset size: 1 million song records OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Dataset name: Million songs Dataset size: 1 million song records Stored as compressed HDF5 files Features include: key duration energy tempo artist details and more…(50+ features) Class label: song hotness (popularity metric) CHETTY

Data cleaning/imputation: Dropped records with missing hotness data OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Data cleaning/imputation: Dropped records with missing hotness data Dropped records with missing year Imputed longitude, latitude, location Checked for duplicate keys (song_id as our unique record identifier) Checked for statistical anomalies using the basic statistics described previously. Only anomalies: energy and danceability columns, which we dropped. MICHAEL

OVERVIEW PROBLEM STATEMENT METHODS RESULTS TOOLS LEARNINGS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Hotness Hotness Artist Familiarity Loudness MICHAEL Data Set is highly similar, which we know about pop-much Many unrated songs, with a bias towards more recent music Hotness Hotness Year

Create a decade feature TF-IDF on song_title Create a decade feature We know that music patterns can be described by decades: binned years → decades. Genre Bag of words on artist_terms MSDS (surprisingly!) does not have a column for genre of a song. We categorized songs into an appropriate genres based on the content of artist_terms. Ablation to determine optimal feature set OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS MICHAEL

Tuned using 5 fold cross-validation with Grid Search OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS L EARNINGS Tuned using 5 fold cross-validation with Grid Search SVM (baseline) Kernel: RBF, C: 256 Random Forest Max depth: 40, Min samples for split: 10, Num trees: 10 Logistic Regression C: 512 Decision Tree Depth: 5, Min samples for split: 10 Adaboost Num trees: 200, Learning rate: 0.01 K-Nearest Neighbors k = 1 Neural Network (Multi-Layer Perceptron) Algorithm: l-bfgs, learning rate: 0.0001 ALPER

OVERVIEW PROBLEM STATEMENT METHODS RESULTS TOOLS LEARNINGS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Model Accuracy (%) Baseline (frequency-based) 55.2 Baseline (SVM) 56.3 Neural Network (MLP) 67.7 kNN 71.2 Logistic Regression 73.4 Decision Tree 72 Random Forest 77.8 Adaboost 74.6 ROHAN Significant improvement over the baseline Used a simple SVM for baseline Baseline: 56.3 Best model: random forest – almost 80% accuracy

ROC Curve for Random Forest model OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS ROC Curve for Random Forest model ROHAN SVM (baseline) Adaboost Neural Network

Pandas for efficient data handling, cleaning and imputation OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Spark to process 270 GB dataset into 1 GB csv; also for ML models (with sparkit-learn) h5py library to read the dataset (dataset stored in HDF5 binary format) Pandas for efficient data handling, cleaning and imputation Numpy and Scipy for data exploration and analysis Scikit-learn for machine learning models Sparkit-learn for machine learning models on EC2 Matplotlib for data visualization ALPER

Able to predict song popularity with ~80% accuracy OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Dataset Learnings: Able to predict song popularity with ~80% accuracy Random Forest model performed best Feature importance (from information gain metric of RF model): Artist familiarity, Artist popularity, Loudness, Tempo, Keywords: pop, jazz, classic, guitar, hop, metal, new, power, world Data Science Learnings: Importance of feature engineering (BoW on artist_terms, TF-IDF on song_title) significantly improved results Accuracy isn’t enough – need to look at ROC MICHAEL Interesting ?: How do you break into music?