Demographics and Weblog Hackathon – Case Study

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

x – independent variable (input)

Introduction to Predictive Learning

Sparse vs. Ensemble Approaches to Supervised Learning

Adaboost and its application

Three kinds of learning

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

Sparse vs. Ensemble Approaches to Supervised Learning

General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.

Classification and Prediction: Regression Analysis

Ensemble Learning (2), Tree and Forest

Who would be a good loanee? Zheyun Feng 7/17/2015.

Ranga Rodrigo April 5, 2014 Most of the sides are from the Matlab tutorial. 1.

Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.

Matrix Factorization Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

NFL Play Predictions Will Burton, NCSU Industrial Engineering 2015

Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Methods: Bagging and Boosting

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

CLASSIFICATION: Ensemble Methods

BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.

Data Mining – Best Practices Part #2 Richard Derrig, PhD, Opal Consulting LLC CAS Spring Meeting June 16-18, 2008.

Ensemble Learning (1) Boosting Adaboost Boosting is an additive model

ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.

Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.

Classification Ensemble Methods 1

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Random Forests Feb., 2016 Roger Bohn Big Data Analytics 1.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

Final Project ED Modeling & Prediction

Machine Learning with Spark MLlib

KAIR 2013 Nov 7, 2013 A Data Driven Analytic Strategy for Increasing Yield and Retention at Western Kentucky University Matt Bogard Office of Institutional.

Chapter 13 – Ensembles and Uplift

Boosting and Additive Trees (2)

Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.

Perceptrons Lirong Xia.

Teaching Analytics with Case Studies: Finding Love in a Classification Tree Ruth Hummel, PhD JMP Academic Ambassador.

Introduction to Data Mining, 2nd Edition

Machine Learning with Weka

Implementing AdaBoost

Lecture 6: Introduction to Machine Learning

Overfitting and Underfitting

Ensemble learning Reminder - Bagging of Trees Random Forest

CSE 491/891 Lecture 25 (Mahout).

Data Mining Ensembles Last modified 1/9/19.

Lecture 10 – Introduction to Weka

Predicting Loan Defaults

Perceptrons Lirong Xia.

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Demographics and Weblog Hackathon – Case Study 5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are important for strategies to increase the subscription rate Learn by Doing

http://www.meetup.com/HandsOnProgrammingEvents/

Data Mining Hackathon

Funded by Rapleaf With Motley Fool’s data App note for Rapleaf/Motley Fool Template for other hackathons Did not use AWS. R on individual PCs Logisics: Rapleaf funded prizes and food for 2 weekends for ~20-50. Venue was free

Getting more subscribers

Headline Data, Weblog

Demographics

Cleaning Data training.csv(201,000), headlines.tsv(811MB), entry.tsv(100k), demographics.tsv Feature Engineering Github:

Ensemble Methods Bagging, Boosting, randomForests Overfitting Stability (small changes make large prediction changes) Previously none of these work at scale Small scale results using R, large scale exist in proprietary implementations(google, amazon, etc..)

ROC Curves Binary Classifier Only!

Paid Subscriber ROC curve, ~61%

Boosted Regression Trees Performance training data ROC score = 0.745 cv ROC score = 0.737 ; se = 0.002 5.5% less performance than the winning score without doing any data processing Random is 50% or .50. We are .737-.50 better than random by 23.7%

Contribution of predictor variables

Predictive Importance Friedman, number of times a variable is selected for splitting weighted by squared error or improvement to model. Measure of sparsity in data Fit plots remove averages of model variables 1 pageV 74.0567852 2 loc 11.0801383 3 income 4.1565597 4 age 3.1426519 5 residlen 3.0813927 6 home 2.3308287 7 marital 0.6560258 8 sex 0.6476549 9 prop 0.3817017 10 child 0.2632598 11 own 0.2030012

Behavioral vs. Demographics Demographics are sparse Behavioral weblogs are the best source. Most sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm Linear vs. Nonlinear

Fitted Values (Crappy)

Fitted Values Better

Predictor Variable Interaction Adjusting variable interactions

Variable Interactions

Plot Interactions age, loc

Trees vs. other methods Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model No Math. Analyst

Number of Trees

Data Set Number of Trees

Hackathon Results

Weblogs only 68.15%, 18% better than random

Demographics add 1%

AWS Advantages Running multiple instances with different algorithms and parameters using R Add tutorial, install Screen, R GUI bugs http://amazonlabs.pbworks.com/w/page/28036646/FrontPage

Conclusion Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing. Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3. This isn’t reproducable in Hadoop/Mahout or any open source code I know of Other use cases, i.e. predicting which item will sell(eBay), search engine ranking. Careful with MR paradigms, Hadoop MR != Couchbase MR