Download presentation
Presentation is loading. Please wait.
1
Talking Data Click Fraud Detection
Andrew Cudworth 04/23/18
2
FAKE! Introduction TalkingData Objective: Does Click = Download?
(70% of Chinese Mobile Devices) Chinese Data Service Company Builds IP blacklists Objective: Does Click = Download? Kaggle Data (184M Training rows 100k Sample for modeling) All Data is Anonymized ROC_AUC score FAKE! “3 billion clicks per day 90% potentially Fraudulent”
3
EDA – The Data! 100k Sample 187M Full Data 18.8M Predictions Score +
Rank MODEL Predict Apply Submit
4
***100k training Sample Represented
EDA -What is Unique? Unique Count ip 34857 app 161 device 100 OS 130 Channel 2 OS make up 45% of traffic iOS? Android? ***100k training Sample Represented
5
EDA – Unique Continued
6
EDA- Data Imbalance 227 attributed values 100k total records
Very Unbalanced Data 227 attributed values 100k total records Null Accuracy Hard to Improve .778 null ROC_AUC with logistic Regression Room to Improve .5000 Kaggle Score if you submit all 0
7
Modeling Process Review Models Features/Transformations KNN
Decision Tree Logistic Regression Features/Transformations Time Included Up sample Down Sample Review
8
Modeling Results –Lots of choices Lots of Overfitting
9
Conclusions Further work
Null Score on Kaggle is .500 Selected Model (Random Forest GS) score .5122 Leader Board 1st place .9827 Further Investigation: Overfitting Appears to be a problem Spend more time tuning parameters Minimize train/test split delta Explore attribution time vs click time Relationships IP addresses in Test Data not in Sample Data Scale to Full Data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.