Download presentation
Presentation is loading. Please wait.
Published byHeather Waters Modified over 9 years ago
1
Data Science: Statistics in the Wild Jin Kim
2
Hi, I’m Jin! My professional mission is to help people and organizations understand and improve themselves using the power of data. Ph.D in Computer Science (Information Retrieval) ‘2012 Applied Scientist in Microsoft Bing
3
Agenda Talk about myths & truths about data science Introduce a few projects representative of Data Science in industry Compare experience in academia (CS Ph.D) vs. industry
4
Myths & Truths about Data Science in Industry 1.You need big data to do anything interesting 2.You spend most of time analyzing & building models 3.You need to be a hard-core programmer to be successful 4.You can communicate results after analysis is done Can you guess which one is true?
5
You need ‘big’ data to do anything interesting You won’t need big data most of the time
6
Burdens of Big Data Big data is costly to collect and store Big data slows down the iteration Big data is useful only if: You’re trying to build a data product (i.e., search engine) You’re dealing with very noisy measurement (i.e., A/B testing) You’re interested in identifying the exceptions (outliers) Even then, start with small data!
7
Determining how much data you need Exploratory analysis Do we have enough coverage for all edge cases? (i.e., outliers) Statistical Inference Is our confidence interval narrow enough? Do we have enough statistical power to validate our hypotheses? Predictive Analysis Do we have enough data to train/evaluate our model?
8
You need to be a hard-core programmer Basic skills (e.g., SQL) get you pretty far
9
Data Science Tool Usage Survey (2014/O’Rielly) Still dominated by simple tools…
10
Choosing Tools for Data Science Small Data Big Data End-user Developer Excel RDBMS / SQL Hadoop R Python
11
Chaining Tools for Data Science Data Preparation Exploratory Analysis Inference / Prediction Solution Implementation Results Communication Excel Hadoop RDBMS / SQL Python Excel R Python Custom Code R Use the right toolset in different stages
12
Modern R is no more difficult than SQL Enter Hadleyverse: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
13
What matters more is the ability to choose and learn right tools and methods…
14
You spend most of time analyzing data You spend most of time preparing data
15
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
16
Things can go wrong in many different levels… Inherent noise / bias in data The process of collecting the data (instrumentation) The process of processing the data Interpretation of processed data … How to share data with a statistician: https://github.com/jtleek/datasharing
17
Make sure you check for quality issues! Is the data representative of the problem space? Any missing observations / attributes? Completeness Do the measurements capture the reality? Any issues of bias or variance? Fidelity Are values follow data types specified? Do different attributes agree with each other? Consistency
18
You can communicate results after analysis is done You need to communicate throughout the process
19
Imagine you’re in jungle with complete strangers
20
Why communication is so critical for solving a data problem? You are seldom given a clear-cut problem (hence the data problem) The team is composed of people with different expertise / style No one has complete information of the problem / solution space You often need to change courses multiple times, along the way
21
Myths & Truths about Data Science in Industry 1.You need big data to do anything interesting 2.You spend most of time analyzing & building models 3.You need to be a hard-core programmer to be successful 4.You can communicate results after analysis is done All these are myths!
22
Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre- Experiment Data Alex Deng, Ya Xu, Ron Kohavi, Toby Walker
23
Background: Online (A/B) Experiment Randomly split the traffic into two groups Make sure your split is actually random (pre-A/A test) Ideally, no difference in the metric values Apply treatment to one group (A/B test) Now, the difference is purely due to the treatment Use the two-sample T-test for comparison Group1 U1: 0.5 U2: 0.4 U3: 0.1 Group2 U4: 0.2 U5: 0.7 U6: 0.4
24
Goal: Higher Sensitivity in A/B Experiments!
25
Main idea Variance explained by X
26
Stratification for Variance Reduction Assuming we can find reasonable strata of size K:
27
Beyond Discrete Strata: Control Covariate
28
Practical Issues Group1 U1: 0.5 U2: 0.4 U3: 0.1 Group2 U4: 0.2 U5: 0.7 U6: 0.4 Group1 U1: 0.7 U2: 0.5 U3: 0.2 Group2 U4: 0.2 U5: 0.6 U6: 0.5 Pre-AA Test A/B Test No Difference! Treatment Effect
29
Empirical Results with CUPED Delay-page-load experiment in Bing with half the users Faster with CUPED Statistically significant from day 1! Fewer Users with CUPED Better results with only half the users! (Metric: CTR)
30
Impact of pre-experiment period length
31
Evaluating Online Ad Campaigns in a Pipeline Causal Models At Scale David Chan, Rong Ge, Ori Gershony, Tim Hesterberg, Diane Lambert
33
Problem Setting
34
Controls in Natural Experiment Controls: those who could have been exposed, but werern’t visited the publisher site, saw other display ads, met targeting conditions
35
Solution: Match Control & Treatment Users But exact matching wouldn’t work because there are so many dimensions…
36
Removing the bias in the controls by re- weighting Each exposed gets weight one Each control gets weight p(x) / (1-p(x))
37
Now the estimate the campaign effect:
38
Summary: Data Science in Industry Need to be aware of inherent variability / bias in data Statistical techniques are useful to mitigate these issues There is nothing more practical than a good theory!
39
Working in Academia vs. Industry
40
Other lessons I learned in industry… You need to learn new things all the time Multitasking is not a choice, but a must Need to ‘sell’ your ideas and results Hard science + Soft skill = Rockstar
41
Optional
42
Doubly Robust Estimate
43
Results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.