PREDICTING SONG HOTNESS MILLION SONGS PREDICTING SONG HOTNESS MICHAEL BALL, NISHOK CHETTY, ROHAN ROY CHOUDHURY, ALPER VURAL
Music industry makes a lot of money from popular music OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Music industry makes a lot of money from popular music Highly invested in identifying trending features Especially interested in an algorithmic way to evaluate potential popularity of a new song Can we predict whether a song is going to be popular? Can we determine what factors make a song popular? CHETTY
Using machine learning, predict whether a song is going to be popular OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Using machine learning, predict whether a song is going to be popular Use feature importance metrics to explore what makes certain songs popular Quality metrics: classification accuracy, ROC/AUC CHETTY
Dataset name: Million songs Dataset size: 1 million song records OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Dataset name: Million songs Dataset size: 1 million song records Stored as compressed HDF5 files Features include: key duration energy tempo artist details and more…(50+ features) Class label: song hotness (popularity metric) CHETTY
Data cleaning/imputation: Dropped records with missing hotness data OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Data cleaning/imputation: Dropped records with missing hotness data Dropped records with missing year Imputed longitude, latitude, location Checked for duplicate keys (song_id as our unique record identifier) Checked for statistical anomalies using the basic statistics described previously. Only anomalies: energy and danceability columns, which we dropped. MICHAEL
OVERVIEW PROBLEM STATEMENT METHODS RESULTS TOOLS LEARNINGS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Hotness Hotness Artist Familiarity Loudness MICHAEL Data Set is highly similar, which we know about pop-much Many unrated songs, with a bias towards more recent music Hotness Hotness Year
Create a decade feature TF-IDF on song_title Create a decade feature We know that music patterns can be described by decades: binned years → decades. Genre Bag of words on artist_terms MSDS (surprisingly!) does not have a column for genre of a song. We categorized songs into an appropriate genres based on the content of artist_terms. Ablation to determine optimal feature set OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS MICHAEL
Tuned using 5 fold cross-validation with Grid Search OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS L EARNINGS Tuned using 5 fold cross-validation with Grid Search SVM (baseline) Kernel: RBF, C: 256 Random Forest Max depth: 40, Min samples for split: 10, Num trees: 10 Logistic Regression C: 512 Decision Tree Depth: 5, Min samples for split: 10 Adaboost Num trees: 200, Learning rate: 0.01 K-Nearest Neighbors k = 1 Neural Network (Multi-Layer Perceptron) Algorithm: l-bfgs, learning rate: 0.0001 ALPER
OVERVIEW PROBLEM STATEMENT METHODS RESULTS TOOLS LEARNINGS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Model Accuracy (%) Baseline (frequency-based) 55.2 Baseline (SVM) 56.3 Neural Network (MLP) 67.7 kNN 71.2 Logistic Regression 73.4 Decision Tree 72 Random Forest 77.8 Adaboost 74.6 ROHAN Significant improvement over the baseline Used a simple SVM for baseline Baseline: 56.3 Best model: random forest – almost 80% accuracy
ROC Curve for Random Forest model OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS ROC Curve for Random Forest model ROHAN SVM (baseline) Adaboost Neural Network
Pandas for efficient data handling, cleaning and imputation OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Spark to process 270 GB dataset into 1 GB csv; also for ML models (with sparkit-learn) h5py library to read the dataset (dataset stored in HDF5 binary format) Pandas for efficient data handling, cleaning and imputation Numpy and Scipy for data exploration and analysis Scikit-learn for machine learning models Sparkit-learn for machine learning models on EC2 Matplotlib for data visualization ALPER
Able to predict song popularity with ~80% accuracy OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Dataset Learnings: Able to predict song popularity with ~80% accuracy Random Forest model performed best Feature importance (from information gain metric of RF model): Artist familiarity, Artist popularity, Loudness, Tempo, Keywords: pop, jazz, classic, guitar, hop, metal, new, power, world Data Science Learnings: Importance of feature engineering (BoW on artist_terms, TF-IDF on song_title) significantly improved results Accuracy isn’t enough – need to look at ROC MICHAEL Interesting ?: How do you break into music?