DATA INTEGRATION AND ERROR: BIG DATA FROM THE 1930’S TO NOW.

Slides:



Advertisements
Similar presentations
T-tests continued.
Advertisements

MRI Essentials. MRI: Who Are We?  MRI (Mediamark Research & Intelligence) is the leading provider of multimedia audience research data in the United.
EVAL 6970: Meta-Analysis Vote Counting, The Sign Test, Power, Publication Bias, and Outliers Dr. Chris L. S. Coryn Spring 2011.
Testing Theories: Three Reasons Why Data Might not Match the Theory.
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
WHAT MATTERS MOST IN DIGITAL ADVERTISING REAL RETURNS.
ASYMPTOTIC PROPERTIES OF ESTIMATORS: PLIMS AND CONSISTENCY
Chapter 7 Introduction to Sampling Distributions
Review: What influences confidence intervals?
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Evaluating Hypotheses
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Knowledge is Power Marketing Information System (MIS) determines what information managers need and then gathers, sorts, analyzes, stores, and distributes.
“There are three types of lies: Lies, Damn Lies and Statistics” - Mark Twain.
Marketing Research Information gathered for a specific need or want. Can be primary or secondary Lecture Dr. Geurts.
Bootstrapping applied to t-tests
The Power of Relevance Behavioural Targeting & Smart Ads Cadi Jones Sales Development Manager, Belgium, Yahoo! EMEA.
CNNIC Symposium Conceptual and Operational Issues in the Measurement of Internet Use * Jonathan Zhu City University of Hong Kong
What’s good? –Clear wording. –A “typical week” eliminates confusion as to weekday vs. weekend What’s bad? –The “buckets” are not evenly distributed. (normative.
Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.
Hypothesis Testing:.
Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.
© 2008 Thomson, a part of the Thomson Corporation. Thomson, the Star logo, and Atomic Dog are trademarks used herein under license. All rights reserved.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
Testing Theories: Three Reasons Why Data Might not Match the Theory Psych 437.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Chapter 8 Introduction to Hypothesis Testing
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
Copyright © 2010 Lumina Decision Systems, Inc. Monte Carlo Simulation Analytica User Group Modeling Uncertainty Series #3 13 May 2010 Lonnie Chrisman,
PARAMETRIC STATISTICAL INFERENCE
Research and Analysis Methods October 5, Surveys Electronic vs. Paper Surveys –Electronic: very efficient but requires users willing to take them;
Optimizing Marketing Spend Through Multi-Source Conversion Attribution David Jenkins.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Scot Exec Course Nov/Dec 04 Survey design overview Gillian Raab Professor of Applied Statistics Napier University.
© IDM Academy 2008 Web Analytics for DM Matthew Tod CEO Logan Tod & Co.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
The new multiple-source system for Italian Structural Business Statistics based on administrative and survey data Orietta Luzi, Ugo Guarnera, Paolo Righi.
Controlling for Common Method Variance in PLS Analysis: The Measured Latent Marker Variable Approach Wynne W. Chin Jason Bennett Thatcher Ryan T. Wright.
for statistics based on multiple sources
Matt Kaneshiro Christine Pierce 1/8/2014 LEVERAGING THE STRENGTHS OF THE PRCS LESSONS FROM PPH AND OTHER HOUSEHOLD CHARACTERISTICS ESTIMATED USING THE.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Market research for a start-up. LEARNING OUTCOMES By the end of this lesson I will be able to: –Define and explain market research –Distinguish between.
DIRECTIONAL HYPOTHESIS The 1-tailed test: –Instead of dividing alpha by 2, you are looking for unlikely outcomes on only 1 side of the distribution –No.
Chapter 16 Data Analysis: Testing for Associations.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Risk Analysis & Modelling Lecture 2: Measuring Risk.
I271B The t distribution and the independent sample t-test.
PSY 307 – Statistics for the Behavioral Sciences Chapter 9 – Sampling Distribution of the Mean.
CpSc 881: Machine Learning Evaluating Hypotheses.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Classification Ensemble Methods 1
Barcelona Declaration of Measurement Principles Presented June 17, 2010 Revised June 20, 2010 Final July 19, 2010 Global Alliance ICCO Institute for Public.
Copyright © 2009 Pearson Education, Inc. 8.1 Sampling Distributions LEARNING GOAL Understand the fundamental ideas of sampling distributions and how the.
Some type of major TSE effort TSE for an “important” statistic Form a group to design a TSE evaluation for existing survey “exemplary TSE estimation plan”
Warsaw Summer School 2015, OSU Study Abroad Program Normal Distribution.
Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.
Chapter 7: The Distribution of Sample Means
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Chapter 9 Sampling Distributions 9.1 Sampling Distributions.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
Chapter 11 – With Woodruff Modications Demand Management and Forecasting Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Multiple Regression Analysis and Model Building
Longitudinal Designs.
Presentation transcript:

DATA INTEGRATION AND ERROR: BIG DATA FROM THE 1930’S TO NOW

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 2 CONTENTS Big Data in the 1930’s and why that matters now TV measurement and Return Path Data (STB) Interesting questions for understanding error

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 3 BIG DATA 1930’S STYLE

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 4 PROBABILITY SAMPLING 1930’S STYLE

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 5 EVOLUTION OF STATISTICAL CONCEPTS IN RESEARCH Early days: Novel, non-scientific 1930’s: Scientific sampling Since the 1950’s: weighting, probability models, imputation techniques, data fusion, time series analyses, hybrid (Big Data/sample integration)

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 6 NIELSEN AND AUDIENCE MEASUREMENT 1923: Nielsen Founded 1950: Introduces TV Audience Measurement Current technology: People Meter Electronic measurement Probability samples All people and sets in home measured Nielsen Ratings are the currency for US TV advertising

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 7 THE CHANGING TV ENVIRONMENT Fragmentation of Viewing Choices Proliferation of Devices Increasing Population Diversity

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 8 RESEARCH DATA - STATISTICAL TOOLS From: Sample/Measure/Project (Panel Data) To: Sample/Measure/Project + Integrate - Data Fusion - Probability Modeling - Calibration - Predictive Modeling Using Multiple Panels, Census Data, Surveys

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 9 WHAT STB AND PANELS CAN GIVE US STB Large convenience samples, stable results DATA Panels Completeness of Audience Measurement RESEARCH PRODUCTS In combination, STB + Panels offer the possibility of stable, UNBIASED RESEARCH + =

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 10 STB GAPS AND BIAS 1.Data Quality/coverage/ timeliness/representativeness 2.Set Activity (On/Off/Other Source) 3.Household Characteristics 4.Persons viewing (including visitors in the home) 5.Other Viewing Activity Bias Standard Error STB Bias Standard Error People Meter STB + People Meter? Bias Standard Error Total Survey Error

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 11 STB DATA QUALITY – EXAMPLE ANALYSES Good… Not so good… Machine Reboot Activity Program junction spikes

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 12 ARE WE IMPROVING THE MEASUREMENT? 1.Transparency and validation at each step and overall 2. Total Survey Error Total Survey Error Bias Standard Error

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 13 ASSESSING INTEGRATION ERROR Input Error (GIGO) Matching Error Statistical Error Validity Levels Multiple Database error compounding

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 14 ASSESSING INTEGRATION ERRORS Input Error (GIGO) -Coverage Gaps, Definitional problems, Input Errors etc -But possible improvement through integration weighting effects Most problems remain but some can be mitigated through integration

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 15 ASSESSING INTEGRATION ERRORS Matching Error (eg address matching) -Good – correct match, Bad – no match, Ugly – incorrect match -Trade-off between match rates and error rates Multiple databases may have correlated errors – that may be preferable to random errors since overall effect is restricted to a smaller group (eg new householders in some address lists)

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 16 STATISTICAL ERROR (SAMPLE-BASED IMPUTATION) Model bias leads to attenuation (regression to mean) Individual data point bias can be undetectable due to sampling error

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 17 SEPARATING MODEL BIAS AND SAMPLING ERROR Z-tests on each comparison and evaluation of Z-score distributions Deviation from expected distribution gives bias estimate

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 18 STATISTICAL ERROR - MULTIPLE DATA SETS TV BuyWeb Hub and Spoke Sequential TV BuyWeb Comparison with Single Source Data: Nielsen National People Meter TV and Internet matched with Credit Card Purchase Data

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 19 ACCURACY TEST TV BuyWeb Hub and Spoke Sequential TV BuyWeb R = 0.4 Correlation of 8 product categories with 14 TV Networks and 60 Websites R = 0.5 R = 0.67 R = 0.44

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 20 SEQUENTIAL VS HUB AND SPOKE Unless the Hub has all the relevant linking information, a sequential approach gives better results In our example, we captured interactions between web and purchase behavior through the sequential fusion However sequential fusions can fall down with too many data-sets as error compounds.

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 21 VALIDITY LEVELS – INDIVIDUAL VS AGGREGATED Individual Prediction IDEAL SCENARIO: You can predict every individual’s behavior REALITY With most Imputation methods we can do better than random but rarely can we get close to 100% accuracy. Eg ~40% improvement on random when predicting product users based on cookies. ie 14% of online ad impressions delivered to product users rather than 10% Aggregate Prediction Imputation methods can reliably predict aggregate level behavior given good predictive variables Eg 90% Accuracy (10% regression to mean) for TV audience estimates by product users Errors compound with multiple sources but extent varies by case

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 22 CONCLUSION Data Everywhere! Data quality and relevance is essential Integration brings insights and error Statistical Integrity is as important now as it was in the 1930’s

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 23 APPENDIX

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 24 AD EFFECTIVENESS - MORE COMPLICATED Imagine a data set of 10,000 people for whom you have tracked exposure to a brand’s website and subsequent purchase of that brand. In our initial thought experiment, 76% converted. HUB: Matching info TBD... PUR- CHASE Website visit TBD...

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 25 A BASIC EXPERIMENT Now imagine that you have measurement error in 10% of your cases. We ran a simulation of 1000 datasets which had incorrect data on site visits in 10% of cases. The difference between the original conversion rate and that in the 1000 error ridden test cases is about 8.5%. SD is xx.

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 26 A BASIC EXPERIMENT What happens when we add another data set? HUB: Matching info TBD... PUR- CHASE! Website visit Saw TV ad TBD...

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 27 MORE DATA – SAME ERROR Given two types of ad exposure data to measure, the impact of error in a single data source should be less... Imagine that you have measurement error in 10% of your cases for one data source – the same error as in previous experiment. As expected, conversion values are closer to our error-free data set. SD =

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 28 MORE DATA – MORE ERROR Next, we introduced error into the TV data set as well. Worsening of performance SD is xx. But it looks more additive than exponential.

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 29 MORE DATA – EVEN MORE ERROR Next, we imagined combining 6 data sets, each with 10% error. WHAT DO WE SEE?

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 30 MATCHING ERROR In any data combination, there is an additional source of error – mismatches to the HUB or identity variable. Mispelled names can lead to false negatives. Non-deterministic matching can lead to false positives. Introducing 10% matching error (to first only, both and second only data sets) suggests that the impact is negligible over conversion in error free data. Suggests the quality of data is more important than the matching quality.

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 31 ASIDE: THE IMPORTANCE OF WEIGHT Here, TV data was heavily weighted toward exposure. That overwhelmed any error from website visit data. Indeed, it appeared to counterbalance it.

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 32 ASIDE: THE IMPORTANCE OF CORRELATION The greater the correlation between the dependent and independent variable, the greater the impact of error. Weaker correlation between webvisit and purchase (xx) Strong correlation between webvisit and purchase (xx)

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 33 WHAT DO WE KNOW THUS FAR? Still more work to do certainly. But we have formed certain hypotheses: When combining multiple data sets, the error appears additive. Error rates being equal, the underlying aspects of the data are more likely to impact the outcome than the combination. It is important, however, to qualify basic relatedness between each independent variable and the dependent outcome. This argues for a hub and spoke approach to data combination. SO how did these hypotheses fare in a quick test using real world data? (next slide on your recent error work)

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 34 There are two basic paths to integrating data A serial integration: (A+B)+C Each data set resulting from an integration is smaller than either original source due to non-matches. Combining Data Sets Data Source A+B Data Source B Data Source A Data Source C Data Source A+B+C += += Data Source A+B

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 35 COMBINING DATA SETS Another approach is a hub and spoke model: (A+B)+(A+C)...etc. While the final integrated set is still reduced due to non- matches, the error from each match to the HUB is known. HUB: Matching info TBD... TBD. TBD...

Copyright ©2012 The Nielsen Company. Confidential and proprietary. 36 AD EFFECTIVENESS - MORE COMPLICATED Ad effectiveness captures the correlation between exposure to advertising and subsequent purchase of a product. When someone who sees an ad buys a product, we say they have CONVERTED. HUB: Matching info TBD... PUR- CHASE TBD. TBD...