There be dragons Avoid stats pitfalls Jennifer LaFleur, ProPublica.

Slides:



Advertisements
Similar presentations
Chapter 8 Linear regression
Advertisements

Chapter 8 Linear regression
Linear Regression Copyright © 2010, 2007, 2004 Pearson Education, Inc.
Ratchet up your investigation with statistics David Donald, The Center for Public Integrity Jennifer LaFleur, CIR.
Warm up Use calculator to find r,, a, b. Chapter 8 LSRL-Least Squares Regression Line.
Copyright © 2009 Pearson Education, Inc. Chapter 8 Linear Regression.
Chapter 8 Linear Regression © 2010 Pearson Education 1.
CHAPTER 8: LINEAR REGRESSION
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Stat 217 – Day 25 Regression. Last Time - ANOVA When?  Comparing 2 or means (one categorical and one quantitative variable) Research question  Null.
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
An Introduction to Logistic Regression
1 Relationships We have examined how to measure relationships between two categorical variables (chi-square) one categorical variable and one measurement.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Haroon Alam, Mitchell Sanders, Chuck McAllister- Ashley, and Arjun Patel.
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
CHAPTER 5 REGRESSION Discovering Statistics Using SPSS.
Chapter 8: Bivariate Regression and Correlation
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Introduction to Linear Regression and Correlation Analysis
Correlation and Regression. The test you choose depends on level of measurement: IndependentDependentTest DichotomousContinuous Independent Samples t-test.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Least-Squares Regression Section 3.3. Why Create a Model? There are two reasons to create a mathematical model for a set of bivariate data. To predict.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Statistical Power 1. First: Effect Size The size of the distance between two means in standardized units (not inferential). A measure of the impact of.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 Linear Regression.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 19 Linear Patterns.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Chapter 13 Multiple Regression
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
STA291 Statistical Methods Lecture LINEar Association o r measures “closeness” of data to the “best” line. What line is that? And best in what terms.
Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. In addition, we would like to.
 Find the Least Squares Regression Line and interpret its slope, y-intercept, and the coefficients of correlation and determination  Justify the regression.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Overview of Regression Analysis. Conditional Mean We all know what a mean or average is. E.g. The mean annual earnings for year old working males.
POD 09/19/ B #5P a)Describe the relationship between speed and pulse as shown in the scatterplot to the right. b)The correlation coefficient, r,
CHAPTER 8 Linear Regression. Residuals Slide  The model won’t be perfect, regardless of the line we draw.  Some points will be above the line.
Business Statistics for Managerial Decision Making
Residuals Recall that the vertical distances from the points to the least-squares regression line are as small as possible.  Because those vertical distances.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 1.
CHAPTER 3 Describing Relationships
ANOVA, Regression and Multiple Regression March
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Linear Regression Chapter 7. Slide 2 What is Regression? A way of predicting the value of one variable from another. – It is a hypothetical model of the.
Chapter 8 Linear Regression. Fat Versus Protein: An Example 30 items on the Burger King menu:
Linear Regression Chapter 8. Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King.
Chapters 8 Linear Regression. Correlation and Regression Correlation = linear relationship between two variables. Summarize relationship with line. Called.
Regression Chapter 5 January 24 – Part II.
AP Statistics Section 15 A. The Regression Model When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative.
Statistics 8 Linear Regression. Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Simple Linear Regression
Multiple Regression.
CHAPTER 26: Inference for Regression
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Chapter 3: Describing Relationships
Algebra Review The equation of a straight line y = mx + b
Presentation transcript:

There be dragons Avoid stats pitfalls Jennifer LaFleur, ProPublica

Counting can be fun

But it’s not always enough

Let there be no willy-nilly analyzing (or interpreting)

Don’t let this happen to you

The Gaydar study Used the R values to test the correlation of students’ predictions versus actual sexual orientation. Correlation is a first step to see if things go up and down together, but in journalism, we usually take it one step further and run a regression.

Don’t be a Wallenda with your analysis

TWERP Transparency in your methodology Wash, rinse, repeat Experts, experts, experts Run it by your targets Prove yourself wrong attitude

Linear regression Logistic regression ANOVA Tests: Chi-Square, T-test, etc… Most commonly used stats tools

Stats in Practice

Why regression is cool: Context

Why regression is cool: Reality checks JENNIFER LAFLEUR, Staff Writer When the votes were counted, some clear patterns emerged. Majority black precincts overwhelmingly voted against the strong-mayor proposition in early May. Majority white precincts tended to support it. And where were Hispanics, who some experts describe as an emerging political force? In Dallas, evidence suggests that, at least in the strong-mayor election, they had little impact on the outcome. A Dallas Morning News analysis of voting results found that turnout in predominantly Hispanic precincts was roughly half that of predominantly white and black precincts - and no distinct voting patterns emerged.

Why regression is cool: Reality checks

Why regression is cool: Evens the playing field

Regression basics: The line

Based on the sum of the squared distances from the “prediction” line based on the input (independent) variable

And gives you this equation: y=mx+b Where x is the independent variable and y is the dependent variable for your outcome, m is the slope of the line and b is the place where the line crosses the y axis (also known as the y intercept)

Regression When you run a regression, you get a result called an R- square. That tells you how much the independent variable predicts the dependent variable. Model Summary ModelRR Square Adjusted R Square Std. Error of the Estimate 1.911(a) percent of the variation in test scores is explained by change in poverty

So in this case, percent poor explains 80 percent of the variation in test scores. (That’s really good.)

Regression in real life

But wait, There’s more. You also need to know if the result is significant. You want this to be <.05

We get excited by the R Square, but the slope of the line and also can be useful in the story.

In this case, we can say that for every 10 point increase in poverty, test scores go down by 8 points.

We can then use the formula for a line to figure out the predicted values (and residuals) So Wydown Middle School, with a score of 228.5, should have only scored given its poverty. It scored better than expected.

But how much better really is better? The last column (standardized residual) tells us how many standard deviations above or below that school fell. Experts can give you standards for how many stdev to use.

You’re looking for a high R square – but how high it needs to be depends on subject area (school test scores versus medical studies) Don’t forget the slope – you can have a strong R square with a fairly flat line TWERP it! Interpreting your results

You may have to use more than one independent variable – but be careful – they may be explaining each other more than your outcome variable. You may need to create “dummy” variables to control for categorical values. One variable may not be enough

Your variables need to be continuous Your variables should be fairly normally distributed The standard deviation should be less than the mean There are tests you can run in your stats program to check for these. The rules (known as “assumptions”)

The spurious correlation (babies and storks) Heteroskedasticity (smaller n=bigger error) Multicollinarity (relationships between your independent variables – again, there are tests to see if that’s a problem) Do all of your data checks first – one extreme value can throw off the whole analysis. Beware

Tickets issued in traffic stops: Issued, not issued Loan denials Jury selection Deaths from taking a drug Categorical or dichotomous variables What if your outcome variable is NOT continuous

Another tool: logistic regression Minorities are 8 times as likely as non- minorities to get a ticket

How fast they were going? What was their gender? What was their age? Logistic regression lets you control for all of those things. You should test every variable you have in your “model” But what about

Reporting the results “Blacks were struck at more than twice the rate of blacks…even when they gave similar answers to key questions”

Use descriptives for everything else

Don’t run with scissors Make sure you know how many records you should have and that you have them all. Double-check totals or counts. Check for studies or summary reports. Consistency-checked all fields.

Don’t run with scissors Other basic checks: make sure all states are included, all cities/counties are included, the range of fields is possible (for example, check for DOBs that would make people too old or too young.) Check for missing data or blank fields Check your methodology (if necessary) against other similar research

Vetting studies by others Ask for the methodology/report Ask how respondents were selected – and how many Who paid for it? Talk to researchers – with specialties on the subject that you are reporting on

Resources: The New Precision Journalism, by Philip Meyer Numbers in the Newsroom, by Sarah Cohen for IRE How to Lie with Statistics, by Darrell Huff A Mathematician Reads the Newspaper, by John Allen Paulos (

Paranoia is best your friend when it comes to data