Download presentation
Presentation is loading. Please wait.
1
Demographics and Weblog Hackathon – Case Study
5.3% of Motley Fool visitors are subscribers. Design a classificaiton model for insight into which variables are important for strategies to increase the subscription rate Learn by Doing
3
Data Mining Hackathon
4
Funded by Rapleaf With Motley Fool’s data
App note for Rapleaf/Motley Fool Template for other hackathons Did not use AWS. R on individual PCs Logisics: Rapleaf funded prizes and food for 2 weekends for ~ Venue was free
5
Getting more subscribers
6
Headline Data, Weblog
7
Demographics
8
Cleaning Data training.csv(201,000), headlines.tsv(811MB), entry.tsv(100k), demographics.tsv Feature Engineering Github:
9
Ensemble Methods Bagging, Boosting, randomForests Overfitting
Stability (small changes make large prediction changes) Previously none of these work at scale Small scale results using R, large scale exist in proprietary implementations(google, amazon, etc..)
10
ROC Curves Binary Classifier Only!
11
Paid Subscriber ROC curve, ~61%
12
Boosted Regression Trees Performance
training data ROC score = 0.745 cv ROC score = ; se = 0.002 5.5% less performance than the winning score without doing any data processing Random is 50% or .50. We are better than random by 23.7%
13
Contribution of predictor variables
14
Predictive Importance
Friedman, number of times a variable is selected for splitting weighted by squared error or improvement to model. Measure of sparsity in data Fit plots remove averages of model variables 1 pageV loc 3 income age 5 residlen home 7 marital sex prop 10 child own
15
Behavioral vs. Demographics
Demographics are sparse Behavioral weblogs are the best source. Most sites aren’t using this information correctly. There is no single correct answer. Trial and Error on features. The features are more important than the algorithm Linear vs. Nonlinear
16
Fitted Values (Crappy)
17
Fitted Values Better
18
Predictor Variable Interaction
Adjusting variable interactions
19
Variable Interactions
20
Plot Interactions age, loc
21
Trees vs. other methods Can see multiple levels good for trees. Do other variables match this? Simplify model or add more features. Iterate to a better model No Math. Analyst
22
Number of Trees
23
Data Set Number of Trees
24
Hackathon Results
25
Weblogs only 68.15%, 18% better than random
26
Demographics add 1%
27
AWS Advantages Running multiple instances with different algorithms and parameters using R Add tutorial, install Screen, R GUI bugs
28
Conclusion Data Mining at scale requires more development in visualization, MR algorithms, MR data preprocessing. Tuning using visualization. Tune 3 parameters, tc, lr, #trees. Didn’t cover 2/3. This isn’t reproducable in Hadoop/Mahout or any open source code I know of Other use cases, i.e. predicting which item will sell(eBay), search engine ranking. Careful with MR paradigms, Hadoop MR != Couchbase MR
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.