Presentation is loading. Please wait.

Presentation is loading. Please wait.

There be dragons Avoid stats pitfalls Jennifer LaFleur, ProPublica.

Similar presentations


Presentation on theme: "There be dragons Avoid stats pitfalls Jennifer LaFleur, ProPublica."— Presentation transcript:

1 There be dragons Avoid stats pitfalls Jennifer LaFleur, ProPublica

2 Counting can be fun

3 But it’s not always enough

4

5 Let there be no willy-nilly analyzing (or interpreting)

6 Don’t let this happen to you http://www.slate.com/id/2299300/

7 The Gaydar study Used the R values to test the correlation of students’ predictions versus actual sexual orientation. Correlation is a first step to see if things go up and down together, but in journalism, we usually take it one step further and run a regression.

8 Don’t be a Wallenda with your analysis

9 TWERP Transparency in your methodology Wash, rinse, repeat Experts, experts, experts Run it by your targets Prove yourself wrong attitude

10 Linear regression Logistic regression ANOVA Tests: Chi-Square, T-test, etc… Most commonly used stats tools

11 Stats in Practice

12 Why regression is cool: Context

13 Why regression is cool: Reality checks JENNIFER LAFLEUR, Staff Writer When the votes were counted, some clear patterns emerged. Majority black precincts overwhelmingly voted against the strong-mayor proposition in early May. Majority white precincts tended to support it. And where were Hispanics, who some experts describe as an emerging political force? In Dallas, evidence suggests that, at least in the strong-mayor election, they had little impact on the outcome. A Dallas Morning News analysis of voting results found that turnout in predominantly Hispanic precincts was roughly half that of predominantly white and black precincts - and no distinct voting patterns emerged.

14 Why regression is cool: Reality checks

15 Why regression is cool: Evens the playing field

16 Regression basics: The line

17 Based on the sum of the squared distances from the “prediction” line based on the input (independent) variable

18 And gives you this equation: y=mx+b Where x is the independent variable and y is the dependent variable for your outcome, m is the slope of the line and b is the place where the line crosses the y axis (also known as the y intercept)

19 Regression When you run a regression, you get a result called an R- square. That tells you how much the independent variable predicts the dependent variable. Model Summary ModelRR Square Adjusted R Square Std. Error of the Estimate 1.911(a).830.82910.8330 83 percent of the variation in test scores is explained by change in poverty

20 So in this case, percent poor explains 80 percent of the variation in test scores. (That’s really good.)

21 Regression in real life

22

23

24

25

26 But wait, There’s more. You also need to know if the result is significant. You want this to be <.05

27 We get excited by the R Square, but the slope of the line and also can be useful in the story.

28 In this case, we can say that for every 10 point increase in poverty, test scores go down by 8 points.

29 We can then use the formula for a line to figure out the predicted values (and residuals) So Wydown Middle School, with a score of 228.5, should have only scored 208.7 given its poverty. It scored better than expected.

30 But how much better really is better? The last column (standardized residual) tells us how many standard deviations above or below that school fell. Experts can give you standards for how many stdev to use.

31

32

33 You’re looking for a high R square – but how high it needs to be depends on subject area (school test scores versus medical studies) Don’t forget the slope – you can have a strong R square with a fairly flat line TWERP it! Interpreting your results

34 You may have to use more than one independent variable – but be careful – they may be explaining each other more than your outcome variable. You may need to create “dummy” variables to control for categorical values. One variable may not be enough

35 Your variables need to be continuous Your variables should be fairly normally distributed The standard deviation should be less than the mean There are tests you can run in your stats program to check for these. The rules (known as “assumptions”)

36 The spurious correlation (babies and storks) Heteroskedasticity (smaller n=bigger error) Multicollinarity (relationships between your independent variables – again, there are tests to see if that’s a problem) Do all of your data checks first – one extreme value can throw off the whole analysis. Beware

37 Tickets issued in traffic stops: Issued, not issued Loan denials Jury selection Deaths from taking a drug Categorical or dichotomous variables What if your outcome variable is NOT continuous

38 Another tool: logistic regression Minorities are 8 times as likely as non- minorities to get a ticket

39 How fast they were going? What was their gender? What was their age? Logistic regression lets you control for all of those things. You should test every variable you have in your “model” But what about

40

41

42

43

44

45

46 Reporting the results “Blacks were struck at more than twice the rate of blacks…even when they gave similar answers to key questions”

47 Use descriptives for everything else

48 Don’t run with scissors Make sure you know how many records you should have and that you have them all. Double-check totals or counts. Check for studies or summary reports. Consistency-checked all fields.

49 Don’t run with scissors Other basic checks: make sure all states are included, all cities/counties are included, the range of fields is possible (for example, check for DOBs that would make people too old or too young.) Check for missing data or blank fields Check your methodology (if necessary) against other similar research

50 Vetting studies by others Ask for the methodology/report Ask how respondents were selected – and how many Who paid for it? Talk to researchers – with specialties on the subject that you are reporting on

51 Resources: The New Precision Journalism, by Philip Meyer Numbers in the Newsroom, by Sarah Cohen for IRE How to Lie with Statistics, by Darrell Huff A Mathematician Reads the Newspaper, by John Allen Paulos ( www.math.temple.edu/~paulos/) www.math.temple.edu/~paulos/

52 Paranoia is best your friend when it comes to data


Download ppt "There be dragons Avoid stats pitfalls Jennifer LaFleur, ProPublica."

Similar presentations


Ads by Google