PREDICTING SONG HOTNESS

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

An Overview of Machine Learning
Introduction to Data Mining with XLMiner
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Math 5364 Notes Chapter 5: Alternative Classification Techniques Jesse Crawford Department of Mathematics Tarleton State University.
Postgraduate Department of Electrical Engineering PPGEE UFPR - Federal University of Paraná Luis Gustavo Weigert Machado
Who would be a good loanee? Zheyun Feng 7/17/2015.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
MACHINE LEARNING TECHNIQUES FOR MUSIC PREDICTION S. Grant Lowe Advisor: Prof. Nick Webb.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
NFL Play Predictions Will Burton, NCSU Industrial Engineering 2015
Gesture Recognition & Machine Learning for Real-Time Musical Interaction Rebecca Fiebrink Assistant Professor of Computer Science (also Music) Princeton.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Data Mining In contrast to the traditional (reactive) DSS tools, the data mining premise is proactive. Data mining tools automatically search the data.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Predicting Voice Elicited Emotions
Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
COMP24111: Machine Learning Ensemble Models Gavin Brown
Data analysis tools Subrata Mitra and Jason Rahman.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Blackbox classifiers for preoperative discrimination between malignant and benign ovarian tumors C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Mining of Massive Datasets Edited based on Leskovec’s from
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
GROUP GOAL Learn and understand python programing language Libraries: Pandas Numpy SKlearn Use machine learning algorithms Decision trees Random Forests.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
A Smart Tool to Predict Salary Trends of H1-B Holders
Machine Learning with Spark MLlib
Introduction to Machine Learning
An Empirical Comparison of Supervised Learning Algorithms
Trees, bagging, boosting, and stacking
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Source: Procedia Computer Science(2015)70:
COMP61011 : Machine Learning Ensemble Models
Basic machine learning background with Python scikit-learn
Introduction Feature Extraction Discussions Conclusions Results
Machine Learning Week 1.
Talking Data Click Fraud Detection
Intro to Machine Learning
Data Analytics at CNU Dmitriy Shaltayev
Analytics: Its More than Just Modeling
Applying SVM to Data Bypass Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning Interpretability
Classification Boundaries
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Creative Activity and Research Day (CARD)
Intro to Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Artificial Intelligence Club
Python for Data Analysis
Predicting Loan Defaults
CS639: Data Management for Data Science
Analysis on Accelerated Learning Cohorts
Practice Project Overview
Igor Stančin, Alan Jović to: {igor.stancin,
March Madness Data Crunch Overview
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Machine Learning for Cyber
Presentation transcript:

PREDICTING SONG HOTNESS MILLION SONGS PREDICTING SONG HOTNESS MICHAEL BALL, NISHOK CHETTY, ROHAN ROY CHOUDHURY, ALPER VURAL

Music industry makes a lot of money from popular music OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Music industry makes a lot of money from popular music Highly invested in identifying trending features Especially interested in an algorithmic way to evaluate potential popularity of a new song Can we predict whether a song is going to be popular? Can we determine what factors make a song popular? CHETTY

Using machine learning, predict whether a song is going to be popular OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Using machine learning, predict whether a song is going to be popular Use feature importance metrics to explore what makes certain songs popular Quality metrics: classification accuracy, ROC/AUC CHETTY

Dataset name: Million songs Dataset size: 1 million song records OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Dataset name: Million songs Dataset size: 1 million song records Stored as compressed HDF5 files Features include: key duration energy tempo artist details and more…(50+ features) Class label: song hotness (popularity metric) CHETTY

Data cleaning/imputation: Dropped records with missing hotness data OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Data cleaning/imputation: Dropped records with missing hotness data Dropped records with missing year Imputed longitude, latitude, location Checked for duplicate keys (song_id as our unique record identifier) Checked for statistical anomalies using the basic statistics described previously. Only anomalies: energy and danceability columns, which we dropped. MICHAEL

OVERVIEW PROBLEM STATEMENT METHODS RESULTS TOOLS LEARNINGS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Hotness Hotness Artist Familiarity Loudness MICHAEL Data Set is highly similar, which we know about pop-much Many unrated songs, with a bias towards more recent music Hotness Hotness Year

Create a decade feature TF-IDF on song_title Create a decade feature We know that music patterns can be described by decades: binned years → decades. Genre Bag of words on artist_terms MSDS (surprisingly!) does not have a column for genre of a song. We categorized songs into an appropriate genres based on the content of artist_terms. Ablation to determine optimal feature set OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS MICHAEL

Tuned using 5 fold cross-validation with Grid Search OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS L EARNINGS Tuned using 5 fold cross-validation with Grid Search SVM (baseline) Kernel: RBF, C: 256 Random Forest Max depth: 40, Min samples for split: 10, Num trees: 10 Logistic Regression C: 512 Decision Tree Depth: 5, Min samples for split: 10 Adaboost Num trees: 200, Learning rate: 0.01 K-Nearest Neighbors k = 1 Neural Network (Multi-Layer Perceptron) Algorithm: l-bfgs, learning rate: 0.0001 ALPER

OVERVIEW PROBLEM STATEMENT METHODS RESULTS TOOLS LEARNINGS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Model Accuracy (%) Baseline (frequency-based) 55.2 Baseline (SVM) 56.3 Neural Network (MLP) 67.7 kNN 71.2 Logistic Regression 73.4 Decision Tree 72 Random Forest 77.8 Adaboost 74.6 ROHAN Significant improvement over the baseline Used a simple SVM for baseline Baseline: 56.3 Best model: random forest – almost 80% accuracy

ROC Curve for Random Forest model OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS ROC Curve for Random Forest model ROHAN SVM (baseline) Adaboost Neural Network

Pandas for efficient data handling, cleaning and imputation OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Spark to process 270 GB dataset into 1 GB csv; also for ML models (with sparkit-learn) h5py library to read the dataset (dataset stored in HDF5 binary format) Pandas for efficient data handling, cleaning and imputation Numpy and Scipy for data exploration and analysis Scikit-learn for machine learning models Sparkit-learn for machine learning models on EC2 Matplotlib for data visualization ALPER

Able to predict song popularity with ~80% accuracy OVERVIEW PROBLEM STATEMENT METHODS DATA EXPLORATION DATA PREPARATION VISUALIZATION FEATURE ENGINEERING MODELS RESULTS ACCURACY ROC/AUC TOOLS LEARNINGS Dataset Learnings: Able to predict song popularity with ~80% accuracy Random Forest model performed best Feature importance (from information gain metric of RF model): Artist familiarity, Artist popularity, Loudness, Tempo, Keywords: pop, jazz, classic, guitar, hop, metal, new, power, world Data Science Learnings: Importance of feature engineering (BoW on artist_terms, TF-IDF on song_title) significantly improved results Accuracy isn’t enough – need to look at ROC MICHAEL Interesting ?: How do you break into music?