Talking Data Click Fraud Detection

Slides:



Advertisements
Similar presentations
Portable Device Operating Systems. Portable Device OS Portable devices use scaled down operating systems, which are smaller than those found in notebook.
Advertisements

Indian Statistical Institute Kolkata
1. Abstract 2 Introduction Related Work Conclusion References.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Stat 217 – Week 10. Outline Exam 2 Lab 7 Questions on Chi-square, ANOVA, Regression  HW 7  Lab 8 Notes for Thursday’s lab Notes for final exam Notes.
1 Accurate Object Detection with Joint Classification- Regression Random Forests Presenter ByungIn Yoo CS688/WST665.
WHO WE ARE ●Website Development & Design ●Web Marketing Strategy, Training, and Analysis ●Web Applications, iOS apps, Android apps.
CHURN PREDICTION MODEL IN RETAIL BANKING USING FUZZY C- MEANS CLUSTERING Džulijana Popović Consumer Finance, Zagrebačka banka d.d. Consumer Finance, Zagrebačka.
Who would be a good loanee? Zheyun Feng 7/17/2015.
Enterprise systems infrastructure and architecture DT211 4
Jay Stokes, Microsoft Research John Platt, Microsoft Research Joseph Kravis, Microsoft Network Security Michael Shilman, ChatterPop, Inc. ALADIN: Active.
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Data mining for credit card fraud: A comparative study.
NFL Play Predictions Will Burton, NCSU Industrial Engineering 2015
Use data-driven app marketing to get your app to rank #1 in the App Store and increase ROI.
Scaling up Decision Trees. Decision tree learning.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Data Mining – Best Practices Part #2 Richard Derrig, PhD, Opal Consulting LLC CAS Spring Meeting June 16-18, 2008.
SBVC and CHC Mobile Apps PRESENTATION TO THE BOARD OF TRUSTEES.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Predicting Good Probabilities With Supervised Learning
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
OESAI COMPREHENSIVE LIFE INSURANCE TECHNICAL TRAINING.
ACT VS GPA(DUN DUN DDDUUUUUUNNNNNN) Jade Lonyae Vinson & Brandon Terrell Johnson.
Konstantina Christakopoulou Liang Zeng Group G21
Kaggle Competition Prudential Life Insurance Assessment
Using Classification Trees to Decide News Popularity
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CS 189 Brian Chu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge)
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Borja Sanz, Igor Santos, Carlos Laorden, Xabier Ugarte-Pedrero and Pablo Garcia Bringas The 9th Annual IEEE Consumer Communications and Networking Conference.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Kaggle Competition Rossmann Store Sales.
PREDICTING SONG HOTNESS
Predicting Mortgage Pre-payment Risk. Introduction Definition Borrower pays off the loan before the contracted term loan length. Lender loses future part.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
GST Helpline - A Complete GST App TO RESOLVE GST INDIA QUERIES
Elizabeth R McMahon 14 April 2017
Sentiment Analysis of Twitter Messages Using Word2Vec
An Empirical Comparison of Supervised Learning Algorithms
Predict House Sales Price
Android Mobile apps development services company in India
Project 1 – Twitter Slang Term Extraction
AliExpress: An opportunity for Central and Eastern Europe
Employee Turnover: Data Analysis and Exploration
Transportation Mode Recognition using Smartphone Sensor Data
Researching social media
Data Mining Classification: Alternative Techniques
Components of Experiments
Machine Learning to Predict Experimental Protein-Ligand Complexes
STAT 689 Class Project STAT 689 Class Project
iSRD Spam Review Detection with Imbalanced Data Distributions
Opening Weka Select Weka from Start Menu Select Explorer Fall 2003
Benefits and Wellness – MDLIVE
CSCI N317 Computation for Scientific Applications Unit Weka
Lecture 06: Bagging and Boosting
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Predicting Loan Defaults
Grading Assignments in Google Classroom
March Madness Data Crunch Overview
Credit Card Fraudulent Transaction Detection
BVM Web Solutions is a Leading Website and Mobile App Development Company Offering best Ecommerce website and app development services for Android and.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Outlines Introduction & Objectives Methodology & Workflow
Lecturer: Geoff Hulten TAs: Alon Milchgrub, Andrew Wei
An introduction to Machine Learning (ML)
Presentation transcript:

Talking Data Click Fraud Detection Andrew Cudworth 04/23/18

FAKE! Introduction TalkingData Objective: Does Click = Download? (70% of Chinese Mobile Devices) Chinese Data Service Company Builds IP blacklists Objective: Does Click = Download? Kaggle Data (184M Training rows 100k Sample for modeling) All Data is Anonymized ROC_AUC score FAKE! “3 billion clicks per day 90% potentially Fraudulent”

EDA – The Data! 100k Sample 187M Full Data 18.8M Predictions Score + Rank MODEL Predict Apply Submit

***100k training Sample Represented EDA -What is Unique? Unique Count ip 34857 app 161 device 100 OS 130 Channel 2 OS make up 45% of traffic iOS? Android? ***100k training Sample Represented

EDA – Unique Continued

EDA- Data Imbalance 227 attributed values 100k total records Very Unbalanced Data 227 attributed values 100k total records .99760003495 Null Accuracy Hard to Improve .778 null ROC_AUC with logistic Regression Room to Improve .5000 Kaggle Score if you submit all 0

Modeling Process Review Models Features/Transformations KNN Decision Tree Logistic Regression Features/Transformations Time Included Up sample Down Sample Review

Modeling Results –Lots of choices Lots of Overfitting

Conclusions Further work Null Score on Kaggle is .500 Selected Model (Random Forest GS) score .5122 Leader Board 1st place .9827 Further Investigation: Overfitting Appears to be a problem Spend more time tuning parameters Minimize train/test split delta Explore attribution time vs click time Relationships IP addresses in Test Data not in Sample Data Scale to Full Data