STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists.

Slides:



Advertisements
Similar presentations
Very simple to create with each dot representing a data value. Best for non continuous data but can be made for and quantitative data 2004 US Womens Soccer.
Advertisements

Describing Quantitative Variables
Estimation in Sampling
Statistics lecture 4 Relationships Between Measurement Variables.
The Diversity of Samples from the Same Population Thought Questions 1.40% of large population disagree with new law. In parts a and b, think about role.
Newsroom math Prof. Steve Doig Cronkite School, ASU.
The Normal Distribution
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. Relationships Between Quantitative Variables Chapter 5.
Chapter 19 Confidence Intervals for Proportions.
Regression Analysis. Unscheduled Maintenance Issue: l 36 flight squadrons l Each experiences unscheduled maintenance actions (UMAs) l UMAs costs $1000.
Ka-fu Wong © 2007 ECON1003: Analysis of Economic Data Lesson2-1 Lesson 2: Descriptive Statistics.
Chap 3-1 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 3 Describing Data: Numerical.
Chapter 1 Introduction Individual: objects described by a set of data (people, animals, or things) Variable: Characteristic of an individual. It can take.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:
The Simple Regression Model
Basic Business Statistics 10th Edition
1.2: Describing Distributions
Sampling Distributions
Active Learning Lecture Slides For use with Classroom Response Systems Exploring Data with Graphs and Numerical Summaries.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Today: Central Tendency & Dispersion
STAT 211 – 019 Dan Piett West Virginia University Lecture 2.
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing distributions with numbers
BIOSTATISTICS II. RECAP ROLE OF BIOSATTISTICS IN PUBLIC HEALTH SOURCES AND FUNCTIONS OF VITAL STATISTICS RATES/ RATIOS/PROPORTIONS TYPES OF DATA CATEGORICAL.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Statistics for Managers.
Probability Distributions What proportion of a group of kittens lie in any selected part of a pile of kittens?
Active Learning Lecture Slides For use with Classroom Response Systems Exploring Data with Graphs and Numerical Summaries.
Chapter 3 – Descriptive Statistics
Chapter 3: Examining relationships between Data
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
Evidence Based Medicine
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Chapter Eleven A Primer for Descriptive Statistics.
Psychology’s Statistics Statistical Methods. Statistics  The overall purpose of statistics is to make to organize and make data more meaningful.  Ex.
Measurements, Mistakes and Misunderstandings in Sample Surveys Lecture 1.
NOTES The Normal Distribution. In earlier courses, you have explored data in the following ways: By plotting data (histogram, stemplot, bar graph, etc.)
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
Statistical Analysis Topic – Math skills requirements.
Active Learning Lecture Slides For use with Classroom Response Systems Exploring Data with Graphs and Numerical Summaries.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
Relationships Can Be Deceiving Statistics lecture 5.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Variation This presentation should be read by students at home to be able to solve problems.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
© 2010 Pearson Prentice Hall. All rights reserved. CHAPTER 12 Statistics.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
Chapter 5 Regression. u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We can then predict.
Central Tendency & Dispersion
MBA7025_04.ppt/Jan 27, 2015/Page 1 Georgia State University - Confidential MBA 7025 Statistical Business Analysis Descriptive Statistics Jan 27, 2015.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
CONFIDENCE INTERVALS.
Do the math: Outsmarting Stats Holly Hacker EWA Webinar Jan
Intro to Psychology Statistics Supplement. Descriptive Statistics: used to describe different aspects of numerical data; used only to describe the sample.
Statistical Methods © 2004 Prentice-Hall, Inc. Week 3-1 Week 3 Numerical Descriptive Measures Statistical Methods.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Chapter 3.3 – 3.4 Applications of the Standard Deviation and Measures of Relative Standing.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Chapter 2 Describing Data: Numerical
Outline Sampling Measurement Descriptive Statistics:
Descriptive Statistics ( )
Business and Economics 6th Edition
APPROACHES TO QUANTITATIVE DATA ANALYSIS
AP Statistics Empirical Rule.
CHAPTER 26: Inference for Regression
Advanced Algebra Unit 1 Vocabulary
Business and Economics 7th Edition
Presentation transcript:

STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Journalists hate math Definition of journalist: A do-gooder who hates math. “Word person, not a numbers person.” 1936 JQ article noting habitual numerical errors in newspapers Japanese 6 th graders more accurate on math test than applicants to Columbia’s Graduate School of Journalism 20% of journalists got more than half wrong on 25- question “math competency test” (Maier) 18% of 5,100 stories examined by Phil Meyer had math errors

Bad examples abound Paulos: 300% decrease in murders Detroit Free Press (2006): Compared ACS to Census data to get false drop in median income KC Star (2000): Priests dying of AIDS at 4 times the rate of all Americans Delaware ZIP Code of infant death NYT: 51% of women without spouses

Common problems Numbers that don’t add up Making the reader do the math Failure to ask “Does this make sense?” Over-precision Ignoring sampling error margins Implying that correlation equals causation

Dangers of journalistic innumeracy Misleads math-challenged readers/viewers Hurts credibility among math-capable readers/viewers Leads to charges of bias, even when cause is ignorance Makes reporters vulnerable to being used for the agendas of others

Common Research Methods Randomized experiments: Measure deliberate manipulation of the environment Observational studies: Measure the differences that occur naturally Meta-analyses: Quantitative review of multiple studies Case Study: Descriptive in-depth examination of one or a few individuals

Simple Measures......don’t exist!

Measurement Variability Variable measurements include unpredictable errors or discrepancies that aren’t easily explained. Natural variability is the result of the fact that individuals and other things are different.

Reasons for variable measures Measurement error Natural variability between individuals Natural variability over time in a single individual

Some Pitfalls in Studies

Deliberate Bias? If you found a wallet with $20, would you: “Keep it?” (23% would keep it) “Do the honest thing and return it?” (13% would keep it)

Unintentional Bias? “Do you use drugs?” “Are you religious?”

Desire to Please? People routinely say they have voted when they actually haven’t, that they don’t smoke when they do, and that they aren’t prejudiced. One study six months after an election: 96% of actual voters said they voted. 40% of non-voters said they voted.

Asking the uninformed? Washington Post poll : “Some people say the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?” 24% said yes 19% said no rest had no opinion

Asking the uninformed? Later Washington Post poll: “President Clinton says the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?” 36% of Democrats agreed 16% of Republicans agreed rest had no opinion

Unnecessary Complexity? “Do you support our soldiers in Iraq so that terrorists won’t strike the U.S. again?”

Question Order “About how many times a month do you normally go out on a date?” “How happy are you with life in general?”

Sampling

Margin of Error 95% of the time, a random sample’s characteristics will differ from the population’s by no more than about where N= sample size

Two Important Concepts about Error Margin The larger the sample, the smaller the margin of sampling error. The size of the population being surveyed doesn’t matter.* *Unless the sample is a significant fraction of the population.

Sampling realities Bigger sample means more cost (money and/or time) Diminishing return on error margin improvement as sample increases.  N=100: +/- 10 percentage points  N=400: +/- 5 percentage points  N=900: +/- 3.3 percentage points Sample needs only to be large enough to give a reasonable answer. Sampling error affects subsamples, too.

Describing data sets

Three Useful Features of a Set of Data The Center The Variability The Shape

The Center Mean (average): Total of the values, divided by the number of values Median: The middle value of an ordered list of values Mode: The most common value Outliers: Atypical values far from the center

Yankees’ Baseball Salaries Average: $7,404,762 Median: $2,500,000 Mode: $500,000 (also the minimum) Outlier: $27.5 million (Alex Rodriguez)

The Variability Some measures of variability: Maximum and minimum: Largest and smallest values Range: The distance between the largest and smallest values Quartiles: The medians of each half of the ordered list of values Standard deviation: Think of it as the average distance of all the values from the mean.

What is “normal”? Don’t consider the average to be “normal” Variability is normal Anything within about 3 standard deviations of the mean is “normal”

Bell-Shaped “Normal” Curve

Some Characteristics of a Normal Distribution Symmetrical (not skewed) One peak in the middle, at the mean The wider the curve, the greater the standard deviation Area under the curve is 1 (or 100%) mean

Percentiles Your percentile for a particular measure (like height or IQ) is the percentage of the population that falls below you. Compared to other American males: My height (5’ 11”): 75th percentile My weight ( 230 lbs. ): 85 th percentile My age (66): 88 th percentile Therefore, I am older and heavier than I am tall.

Standardized Scores A standardized score (also called the z-score) is simply the number of standard deviations a particular value is either above or below the mean. The standardized score is: Positive if above the mean Negative if below the mean Useful for defining data points as outliers.

The Empirical Rule For any normal curve, approximately: 68% of values within one StdDev of the mean 95% of values within two StdDevs of the mean 99.7% of values within three StdDevs of the mean

Outlier A value that is more than three standard deviations above or below the mean.

Correlation

Strength of Relationship Correlation (also called the correlation coefficient or Pearson’s r) is the measure of strength of the linear relationship between two variables. Think of strength as how closely the data points come to falling on a line drawn through the data.

Features of Correlation Correlation can range from +1 to -1 Positive correlation: As one variable increases, the other increases Negative correlation: As one variable increases, the other decreases Zero correlation means the best line through the data is horizontal Correlation isn’t affected by the units of measurement

Positive Correlations r = +.1 r = +.4 r = +.8 r = +1

Negative Correlations r = -.1 r = -.4 r = -.8r = -1

Zero correlation r = 0

Number of Points Doesn’t Matter r =.8

Important! Correlation does not imply causation.

Correlation of variables When considering relationships between measurement variables, there are two kinds:  Explanatory (or independent) variable: The variable that attempts to explain or is purported to cause (at least partially) differences in the…  Response (or dependent or outcome) variable Often, chronology is a guide to distinguishing them (examples: baldness and heart attacks, poverty and test scores)

Some reasons why two variables could be related The explanatory variable is the direct cause of the response variable Example: pollen counts and percent of population suffering allergies, intercourse and babies

Some reasons two variables could be related The response variable is causing a change in the explanatory variable Example: hotel occupancy and advertising spending, divorce and alcohol abuse

Some reasons two variables could be related The explanatory variable is a contributing -- but not sole -- cause Example: birth complications and violence, gun in home and homicide, hours studied and grade, diet and cancer

Some reasons two variables could be related Both variables may result from a common cause Example: SAT score and GPA, hot chocolate and tissues, storks and babies, fire losses and firefighters, WWII fighter opposition and bombing accuracy

Some reasons two variables could be related Both variables are changing over time Example: divorces and drug offenses, divorces and suicides

Some reasons two variables could be related The association may be nothing more than coincidence Example: clusters of disease, brain cancer from cell phones

So how can we confirm causation? The only way to confirm is with a designed (randomized double-blind) experiment. But non-statistical evidence of a possible connection may include: A reasonable explanation of cause and effect. A connection that happens under varying conditions. Potential confounding variables ruled out.

Regression

Linear Regression In addition to figuring the strength of the relationship, we can create a simple equation that describes the best-fit line (also called the “least-squares” line) through the data. This equation will help us predict one variable, given the other.

Best-fit (“least-squares”) Line

Best-fit Line??? (much variance)

Best-fit Line! (least variance)

Remember 9th Grade Algebra? x = horizontal axis y = vertical axis Equation for a line: y = slope * x + intercept or as it often is stated: y = mx + b

Regression in data journalism Public school test scores Cheating in school test scores Tenure of white vs. black coaches in NBA Racial bias in picking jurors Racial profiling in traffic stops

Confusion of the inverse

Confusion of the Inverse Confusing these two: Probability of actually having a condition, given a positive test for it Probability of having a positive test, given actually having the condition When the incidence of some disease or condition is very low, and the test for it is not perfect, there will be a high probability that a positive test result is false positive.

Definitions Base rate: The probability that someone has a disease or condition, without knowing any test results. Test Sensitivity: Proportion of people who correctly test positive when they have the disease or condition (true positive) Test Specificity: Proportion of people who correctly test negative when they don’t have the disease or condition (true negative)

Drug Tests Consider this scenario: Base rate: 1% of population to be tested uses dangerous drugs You use a test that’s 99% accurate in both sensitivity and specificity 10,000 people are tested

Drug Tests TestTestTotal PositiveNegative Users 100 Not 9,900 Total 10,000

Drug Tests TestTestTotal PositiveNegative Users Not 9,900 Total 10,000

Drug Tests TestTestTotal PositiveNegative Users Not9,8019,900 Total9,80210,000

Drug Tests TestTestTotal PositiveNegative Users Not???9,8019,900 Total9,80210,000

Drug Tests TestTestTotal PositiveNegative Users Not 9,8019,900 Total1989,80210,000 (50% of positives are FALSE!) 99

Confidence intervals and p-values

Confidence Intervals Like the error margin around poll results A confidence interval is a tradeoff between certainty and accuracy, like shooting at targets of different sizes The bigger the sample, the smaller the confidence interval at the 95% level When comparing results, if confidence intervals overlap, the results are NOT statistically significant

P-values P-value is the probability that the sample result is significantly different from the true result (i.e., wrong) 95% confidence interval (p < 0.05) is the most commonly used interval in social science research Hard science, particularly medicine, often needs tighter confidence intervals and smaller p-values, like p<0.01 Studies are going to be wrong about 5% of the time (and you won’t know when) On the other hand, they probably won’t be very wrong.

How to read a research study Pay attention to the method: Observational, randomized double-blind experiment, meta-analysis, case study Note the sample size Don’t ignore the confidence intervals Consider the p-value as the probability you’re writing about something that isn’t true Remember correlation doesn’t necessarily mean causation. Consider the quality of the journal (peer reviewed?) Who paid for the research?

Newsroom math bibliography “Numbers in the Newsroom”, by Sarah Cohen, IRE “News and Numbers”, by Victor Cohn and Lewis Cope “Precision Journalism (4 th edition)”, by Phil Meyer “Innumeracy”, by John Allen Paulos “A Mathematician Reads the Newspaper,” by John Allen Paulos “Damned Lies and Statistics,” by Joel Best

Questions?