Talking Data Click Fraud Detection Andrew Cudworth 04/23/18
FAKE! Introduction TalkingData Objective: Does Click = Download? (70% of Chinese Mobile Devices) Chinese Data Service Company Builds IP blacklists Objective: Does Click = Download? Kaggle Data (184M Training rows 100k Sample for modeling) All Data is Anonymized ROC_AUC score FAKE! “3 billion clicks per day 90% potentially Fraudulent”
EDA – The Data! 100k Sample 187M Full Data 18.8M Predictions Score + Rank MODEL Predict Apply Submit
***100k training Sample Represented EDA -What is Unique? Unique Count ip 34857 app 161 device 100 OS 130 Channel 2 OS make up 45% of traffic iOS? Android? ***100k training Sample Represented
EDA – Unique Continued
EDA- Data Imbalance 227 attributed values 100k total records Very Unbalanced Data 227 attributed values 100k total records .99760003495 Null Accuracy Hard to Improve .778 null ROC_AUC with logistic Regression Room to Improve .5000 Kaggle Score if you submit all 0
Modeling Process Review Models Features/Transformations KNN Decision Tree Logistic Regression Features/Transformations Time Included Up sample Down Sample Review
Modeling Results –Lots of choices Lots of Overfitting
Conclusions Further work Null Score on Kaggle is .500 Selected Model (Random Forest GS) score .5122 Leader Board 1st place .9827 Further Investigation: Overfitting Appears to be a problem Spend more time tuning parameters Minimize train/test split delta Explore attribution time vs click time Relationships IP addresses in Test Data not in Sample Data Scale to Full Data