Describing Data September 14, 2016
Updates This week – Lab sections begin Wed: 2-4pm (Today!) Wed: 4-6pm (Today!) Mon: 4-6pm Next week Eric Glass, guest speaker from DSSC (part of class) The following week, another speaker talking about Zotero.
Updates to assignments Updated LiPS assignment Still have to seven write-ups One must be either Fulong Wu (Monday evening Nov 14 th ) or Malo Hutson (Tuesday evening Sept. 20 th ) Assignment 2 posted to CourseWorks Due at the start of your lab in 2 weeks. Hand in a paper copy to your TA and post also to CourseWorks.
Today: Statistics Descriptive Describe and summarize our data to give insights Inferential Use statistics to make generalizations about a broader population
Types of Variables Categorical Nominal (not ranked) College major, type of property, color of car Ordinal (ordered or ranked) Useful for preferences, though no value assigned Dichotomous (two categories, not ranked) Yes/no Numerical Discrete (values are counts) Continuous (values are measures)
Variables Nominal Exclusive but not ordered or ranked Ordinal Ranked Interval Equally spaced variables
Nominal Examples Think of nominal scales as “labels” No quantitative value
Nominal Examples Think of nominal scales as “labels” No quantitative value
Nominal Examples Think of nominal scales as “labels” No quantitative value ColorCount Blue10 Black8 Red6 blue5 Purple3 Green2 Purple2 White2 BLUE1 Brown1 Burgundy1 Gray1 Pink1 Red1 Yellow1 nav1 orange1 purple1 red1 seafoam green1 turquoise1 white1
Nominal Examples Think of nominal scales as “labels” No quantitative value Other Examples: Gender Hair color Neighborhood When there are only two categories, we call this “dichotomous.” Examples – Heads/Tails, On/Off, Rural/Urban, In poverty / Not in poverty Q: What about gender? Is that a dichotomous variable?
Ordinal Ranked in order of values, but the difference between values is not always known Example: Educational attainment
Ordinal example: educational attainment
Interval Numerical scales where order of and differences between variables is known Examples: Money or income Height Weight
Likert items Allow people to respond according to some scale
Likert items Allow people to respond according to some scale Examples: Question: How frequently do you think you need to come to class to get a high pass? o Always o Often o Occasionally o Rarely o never
Likert items Allow people to respond according to some scale Examples: Question: I already know everything there is to know about “Planning Techniques” o Agree Strongly o Agree Slightly o Neutral o Disagree Slightly o Disagree Strongly
Likert items Allow people to respond according to some scale Examples – four point scale Question: I read s from Nick Klein o Most of the time o Some of the time o Seldom o Never
Likert items Allow people to respond according to some scale Examples – four point scale Question: I read s from Nick Klein o Most of the time – ALL OF THE TIME o Some of the time o Seldom o Never
Likert Scales What types of variables are these? How can we interpret them?
Descriptive stats
We need some data to describe
Lucky us!
What year were you born? 50 responses: 1993, 1991, 1960, 1993, 1994, 1992, 1989, 1992, 1993, 1993, 1994, 1991, 1990, 1992, 1987, 1989, 1994, 1992, 1989, 1992, 1994, 1985, 1994, 1991, 1991, 1992, 1993, 1993, 1993, 1992, 1991, 1985, 1992, 1992, 1992, 1985, 1994, 1993, 1995, 1991, 1985, 1993, 1990, 1992, 1994, 1994, 1994, 1994, 1992, 1990
Hard to make sense of this… 50 responses: 1993, 1991, 1960, 1993, 1994, 1992, 1989, 1992, 1993, 1993, 1994, 1991, 1990, 1992, 1987, 1989, 1994, 1992, 1989, 1992, 1994, 1985, 1994, 1991, 1991, 1992, 1993, 1993, 1993, 1992, 1991, 1985, 1992, 1992, 1992, 1985, 1994, 1993, 1995, 1991, 1985, 1993, 1990, 1992, 1994, 1994, 1994, 1994, 1992, 1990
We can use a “frequency table” Year bornFrequencyPercent
Let’s represent it another way, graphically
We can use a “dot plot” where each dot represents a response
This is similar to a histogram
But a histogram is more flexible
We can change the number of “bins”
And change the y-axis to a measure of “relative frequency” rather than a count.
Another approach is a “stem and leaf” 195. | 196. | 197. | 198. | 199. | 200. | The stem consists of the numbers with the last digit omitted. So for our years, this would mean ignore the year but keep the decade. So “1975” would become “197”
Another approach is a “stem and leaf” 195. | 196. | | 198. | | | Then add the final digits (the leaf or leaves) back in to the corresponding stem
Summary Statistics
Central Tendency and Spread Two of the most simple and most important measures
Central Tendency There are a number of measures of central tendency The most common are: Mean Median Mode Let’s focus on the first two
Mean
Median The median is the middle most value We can identify it by placing our data in order. Let’s use the same five values: The mean (1989.2) and median (1992) are often different. The median has a nice attribute in that it is generally not sensitive to outliers.
Median If there are two middle-most variables, we would take the average of the two middle values Let’s add our outlier (1960) to our data set and figure out the median: The median is now ( ) / 2 =
Mean and Median Mean ● Easy to understand. It’s the average ● Affected by extreme high or low values (outliers) ● May not best characterize skewed distributions Median ● Not affected by outliers ● May better characterize skewed distributions
What about mode? Mode ● The most frequent value ● Less often used in social science
Mode ● The most frequent value ● Less often used in social science
Percentiles Imagine a chart will all the observable values in a population; it contains 100 percent of the possible values. The p th percentile is the value of a given distribution such that p% of the distribution is less than or equal to that value. Quartiles: The 25th, 50th, and 75th percentiles Quintiles: The 20th, 40th, 60th, and 80th are quintiles Deciles: 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and 90th. The 50th percentile is the MEDIAN
10 th percentile= percent under curve (shaded red)
Basic descriptive statistics 25 th percentile= percent under curve (shaded red)
Basic descriptive statistics 50 th percentile= percent under curve (shaded red)
75 th percentile= percent under curve (shaded red)
Basic descriptive statistics 90 th percentile= percent under curve (shaded red)
Percentiles from our data
50 th Percentile / the median value is th Percentile is th Percentile is 1993
Measures of Spread
How do we describe the different distributions?
Measures Range Interquartile range Index of dispersion Standard Deviation
Interquartile Range (IQR) The IQR is a simple measure of spread: It is the difference between 25 th and 75 th percentile values. The IQR tells us about the spread from the median
Interquartile Range (IQR) 50 th Percentile / the median value is th Percentile is th Percentile is 1993
Boxplots
Standard Deviation Often, we will use and talk about st. dev. Represented by sigma : σ The st. dev tells us about the spread from the mean (The IQR tells us about the spread form the median)
Standard Deviation
But the st. dev. is really useful. If we have normally distributed data, We can expect 68% is within 1 st. dev. And 95% is within 2.
Other ways to describe spread
Skewness and Symmetry
Why might data be skewed? Why might data be bimodal?
Skewed data example: Family Income
Q: Guess the mean
$71,840
Q: Guess the mean $71,840
Q: Guess the mean $71,840 Q: Guess the median
Q: Guess the mean $71,840 Q: Guess the median $55,000
Interpreting Tables
Elements of a Table Title describes content Sample size presented Actual and percentage shares presented
Assumptions stated Source of calculations stated
Interpreting Tables From Manski (2014) Death penalty moratorium was lifted in U.S. is 1976 Three ways to interpret data presented
Interpreting Tables 1)“Before and after” Average effect of death penalty is -.6 (calculated as )
Interpreting Tables 2) Compare treated and untreated Assumes all else equal, e.g. propensity to kill is the same everywhere Average effect in 1977 is 2.8 (= )
Interpreting Tables 3) Difference in difference Changes in effects over time to account for policy changes Treated states declined from 10.3 to 9.7 = -.6 Untreated states declined from 8.0 to 6.9 = 1.1 Effect =.5 = [( )-( )]
Interpreting Tables Before and after shows reduced homicide rates Comparison of treated and untreated shows increase in rate to 2.8 Difference in difference shows increase in rate to.5 per 100,000 Explanations?
Presenting Data Tables Charts Graphs
Problems with Pie Charts No sample size Similarly sized pies suggest all groups are equal and all response rates are about the same Were yes/no the only options? What are “enough transportation options”?
When Pie Charts Are Appropriate
Bar Chart
Measures of association