Data Science: Statistics in the Wild Jin Kim. Hi, I’m Jin! My professional mission is to help people and organizations understand and improve themselves.

Data Science: Statistics in the Wild Jin Kim

Hi, I’m Jin! My professional mission is to help people and organizations understand and improve themselves using the power of data. Ph.D in Computer Science (Information Retrieval) ‘2012 Applied Scientist in Microsoft Bing

Agenda Talk about myths & truths about data science Introduce a few projects representative of Data Science in industry Compare experience in academia (CS Ph.D) vs. industry

Myths & Truths about Data Science in Industry 1.You need big data to do anything interesting 2.You spend most of time analyzing & building models 3.You need to be a hard-core programmer to be successful 4.You can communicate results after analysis is done Can you guess which one is true?

You need ‘big’ data to do anything interesting You won’t need big data most of the time

Burdens of Big Data Big data is costly to collect and store Big data slows down the iteration Big data is useful only if: You’re trying to build a data product (i.e., search engine) You’re dealing with very noisy measurement (i.e., A/B testing) You’re interested in identifying the exceptions (outliers) Even then, start with small data!

Determining how much data you need Exploratory analysis Do we have enough coverage for all edge cases? (i.e., outliers) Statistical Inference Is our confidence interval narrow enough? Do we have enough statistical power to validate our hypotheses? Predictive Analysis Do we have enough data to train/evaluate our model?

You need to be a hard-core programmer Basic skills (e.g., SQL) get you pretty far

Data Science Tool Usage Survey (2014/O’Rielly) Still dominated by simple tools…

Choosing Tools for Data Science Small Data Big Data End-user Developer Excel RDBMS / SQL Hadoop R Python

Chaining Tools for Data Science Data Preparation Exploratory Analysis Inference / Prediction Solution Implementation Results Communication Excel Hadoop RDBMS / SQL Python Excel R Python Custom Code R Use the right toolset in different stages

Modern R is no more difficult than SQL Enter Hadleyverse: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

What matters more is the ability to choose and learn right tools and methods…

You spend most of time analyzing data You spend most of time preparing data

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

Things can go wrong in many different levels… Inherent noise / bias in data The process of collecting the data (instrumentation) The process of processing the data Interpretation of processed data … How to share data with a statistician: https://github.com/jtleek/datasharing

Make sure you check for quality issues! Is the data representative of the problem space? Any missing observations / attributes? Completeness Do the measurements capture the reality? Any issues of bias or variance? Fidelity Are values follow data types specified? Do different attributes agree with each other? Consistency

You can communicate results after analysis is done You need to communicate throughout the process

Imagine you’re in jungle with complete strangers

Why communication is so critical for solving a data problem? You are seldom given a clear-cut problem (hence the data problem) The team is composed of people with different expertise / style No one has complete information of the problem / solution space You often need to change courses multiple times, along the way

Myths & Truths about Data Science in Industry 1.You need big data to do anything interesting 2.You spend most of time analyzing & building models 3.You need to be a hard-core programmer to be successful 4.You can communicate results after analysis is done All these are myths!

Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre- Experiment Data Alex Deng, Ya Xu, Ron Kohavi, Toby Walker

Background: Online (A/B) Experiment Randomly split the traffic into two groups Make sure your split is actually random (pre-A/A test) Ideally, no difference in the metric values Apply treatment to one group (A/B test) Now, the difference is purely due to the treatment Use the two-sample T-test for comparison Group1 U1: 0.5 U2: 0.4 U3: 0.1 Group2 U4: 0.2 U5: 0.7 U6: 0.4

Goal: Higher Sensitivity in A/B Experiments!

Main idea Variance explained by X

Stratification for Variance Reduction Assuming we can find reasonable strata of size K:

Beyond Discrete Strata: Control Covariate

Practical Issues Group1 U1: 0.5 U2: 0.4 U3: 0.1 Group2 U4: 0.2 U5: 0.7 U6: 0.4 Group1 U1: 0.7 U2: 0.5 U3: 0.2 Group2 U4: 0.2 U5: 0.6 U6: 0.5 Pre-AA Test A/B Test No Difference! Treatment Effect

Empirical Results with CUPED Delay-page-load experiment in Bing with half the users Faster with CUPED Statistically significant from day 1! Fewer Users with CUPED Better results with only half the users! (Metric: CTR)

Impact of pre-experiment period length

Evaluating Online Ad Campaigns in a Pipeline Causal Models At Scale David Chan, Rong Ge, Ori Gershony, Tim Hesterberg, Diane Lambert

Problem Setting

Controls in Natural Experiment Controls: those who could have been exposed, but werern’t visited the publisher site, saw other display ads, met targeting conditions

Solution: Match Control & Treatment Users But exact matching wouldn’t work because there are so many dimensions…

Removing the bias in the controls by re- weighting Each exposed gets weight one Each control gets weight p(x) / (1-p(x))

Now the estimate the campaign effect:

Summary: Data Science in Industry Need to be aware of inherent variability / bias in data Statistical techniques are useful to mitigate these issues There is nothing more practical than a good theory!

Working in Academia vs. Industry

Other lessons I learned in industry… You need to learn new things all the time Multitasking is not a choice, but a must Need to ‘sell’ your ideas and results Hard science + Soft skill = Rockstar

Optional

Doubly Robust Estimate

Results

Data Science: Statistics in the Wild Jin Kim. Hi, I’m Jin! My professional mission is to help people and organizations understand and improve themselves.

Similar presentations

Presentation on theme: "Data Science: Statistics in the Wild Jin Kim. Hi, I’m Jin! My professional mission is to help people and organizations understand and improve themselves."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Science: Statistics in the Wild Jin Kim. Hi, I’m Jin! My professional mission is to help people and organizations understand and improve themselves.

Similar presentations

Presentation on theme: "Data Science: Statistics in the Wild Jin Kim. Hi, I’m Jin! My professional mission is to help people and organizations understand and improve themselves."— Presentation transcript:

Similar presentations

About project

Feedback