Download presentation
Presentation is loading. Please wait.
Published byFranklin Kennedy Modified over 9 years ago
1
The Scientific Study of Politics (POL 51) Professor B. Jones University of California, Davis
2
Fun With Numbers Some Univariate Statistics Some Univariate Statistics Learning to Describe Data Learning to Describe Data
3
Useful to Visualize Data
4
Main Features Exhibits “Right Skew” Exhibits “Right Skew” Some “Outlying” Data Points? Some “Outlying” Data Points? Question: Are the outlying data points also “influential” data points (on measures of central tendency)? Question: Are the outlying data points also “influential” data points (on measures of central tendency)? Let’s check… Let’s check…
5
The Mean Formally, the mean is given by: Formally, the mean is given by: Or more compactly: Or more compactly:
6
Our Data Mean of Y is 260.67 Mean of Y is 260.67 Mechanically… Mechanically… (263 + 73 + … + 88)/67=260.67 (263 + 73 + … + 88)/67=260.67 Problems with the mean? Problems with the mean? No indication of dispersion or variability. No indication of dispersion or variability.
7
Variance The variance is a statistic that describes (squared) deviations around the mean: The variance is a statistic that describes (squared) deviations around the mean: Why “N-1”? Why “N-1”? Interpretation: “Average squared deviations from the mean.” Interpretation: “Average squared deviations from the mean.”
8
Our Data Variance= 202,431.8 Variance= 202,431.8 Mechanically: Mechanically: [(263-260.67) 2 + (73-260.67) 2 + + (88-260.67) 2 ]/66 [(263-260.67) 2 + (73-260.67) 2 + + (88-260.67) 2 ]/66 Interpretation: Interpretation: “The average squared deviation around Y is 202,431. “The average squared deviation around Y is 202,431. Rrrrright. (Who thinks in terms of squared deviations??) Rrrrright. (Who thinks in terms of squared deviations??) Answer: no one. Answer: no one. That’s why we have a standard deviation. That’s why we have a standard deviation.
9
Standard Deviation Take the square root of the variance and you get the standard deviation. Take the square root of the variance and you get the standard deviation. Why we like this: Why we like this: Metric is now in original units of Y. Metric is now in original units of Y. Interpretation Interpretation S.D. gives “average deviation” around the mean. S.D. gives “average deviation” around the mean. It’s a measure of dispersion that is in a metric that makes sense to us. It’s a measure of dispersion that is in a metric that makes sense to us.
10
Our Data The standard deviation is: 449.92 The standard deviation is: 449.92 Mechanically: Mechanically: {[(263-260.67) 2 + (73-260.67) 2 + + (88-260.67) 2 ]/66} ½ {[(263-260.67) 2 + (73-260.67) 2 + + (88-260.67) 2 ]/66} ½ Interpretation: “The average deviation around the mean of 260.67 is 449.92. Interpretation: “The average deviation around the mean of 260.67 is 449.92. Now, suppose Y=Votes… Now, suppose Y=Votes… The average number of votes is “about 261 and the average deviation around this number is about 450 votes.” The average number of votes is “about 261 and the average deviation around this number is about 450 votes.” The dispersion is very large. The dispersion is very large. (Imagine the opposite case: mean test score is 85 percent; average deviation is 5 percent.) (Imagine the opposite case: mean test score is 85 percent; average deviation is 5 percent.)
11
Revisiting our Data
12
Skewness and The Mean Data often exhibit skew. Data often exhibit skew. This is often true with political variables. This is often true with political variables. We have a measure of central tendency and deviation about this measure (Mean, s.d) We have a measure of central tendency and deviation about this measure (Mean, s.d) However, are there other indicators of central tendency? However, are there other indicators of central tendency? How about the median? How about the median?
13
Median “50 th ” Percentile: Location at which 50 percent of the cases lie above; 50 percent lie below. “50 th ” Percentile: Location at which 50 percent of the cases lie above; 50 percent lie below. Since it’s a locational measure, you need to “locate it.” Since it’s a locational measure, you need to “locate it.” Example Data: 32, 5, 23, 99, 54 Example Data: 32, 5, 23, 99, 54 As is, not informative. As is, not informative.
14
Median Rank it: 5, 23, 32, 54, 99 Rank it: 5, 23, 32, 54, 99 Median Location=(N+1)/2 (when n is odd) Median Location=(N+1)/2 (when n is odd) =6/2=3 =6/2=3 Location of the median is data point 3 Location of the median is data point 3 This is 32. This is 32. Hence, M=32, not 3!! Hence, M=32, not 3!! Interpretation: “50 percent of the data lie above 32; 50 percent of the data lie below 32.” Interpretation: “50 percent of the data lie above 32; 50 percent of the data lie below 32.” What would the mean be? What would the mean be? (42.6…data are __________ skewed) (42.6…data are __________ skewed)
15
Median When n is even: -67, 5, 23, 32, 54, 99 When n is even: -67, 5, 23, 32, 54, 99 M is usually taken to be the average of the two middle scores: M is usually taken to be the average of the two middle scores: (N+1)/2=7/2=3.5 (N+1)/2=7/2=3.5 The median location is 3.5 which is between 23 and 32 The median location is 3.5 which is between 23 and 32 M=(23+32)/2=27.5 M=(23+32)/2=27.5 All pretty straightforward stuff. All pretty straightforward stuff.
16
Median Voter Theorem (a sidetrip) One of the most fundamental results in social sciences is Duncan Black’s Median Voter Theorem (1948) One of the most fundamental results in social sciences is Duncan Black’s Median Voter Theorem (1948) Theorem predicts convergence to median position. Theorem predicts convergence to median position. Why do parties tend to drift toward the center? Why do parties tend to drift toward the center? Why do firms locate in close proximity to one another? Why do firms locate in close proximity to one another? The theorem: “given single-peaked preferences, majority voting, an odd number of decision makers, and a unidimensional issue space, the position taken by the median voter has an empty winset.” The theorem: “given single-peaked preferences, majority voting, an odd number of decision makers, and a unidimensional issue space, the position taken by the median voter has an empty winset.” That is, under these general conditions, all we need to know is the preference of the median chooser to determine what the outcome will be. No position can beat the median. That is, under these general conditions, all we need to know is the preference of the median chooser to determine what the outcome will be. No position can beat the median.
17
Dispersion around the Median The mean has its standard deviation… The mean has its standard deviation… What about the median? What about the median? No such thing as “standard deviation” per se, around the median. No such thing as “standard deviation” per se, around the median. But, there is the IQR But, there is the IQR Interquartile Range Interquartile Range The median is the 50 th percentile. The median is the 50 th percentile. Suppose we compute the 25 th and the 75 th percentiles and then take the difference. Suppose we compute the 25 th and the 75 th percentiles and then take the difference. 25 th Percentile is the “median” of the lower half of the data; the 75 th Percentile is the “median” of the upper half. 25 th Percentile is the “median” of the lower half of the data; the 75 th Percentile is the “median” of the upper half.
18
IQR and the 5 Number Summary Data: -67, 5, 23, 32, 54, 99 Data: -67, 5, 23, 32, 54, 99 25 th Percentile=5 25 th Percentile=5 50 th Percentile=54 50 th Percentile=54 IQR is difference between 75 th and 25 th percentiles: 54- 5=49 IQR is difference between 75 th and 25 th percentiles: 54- 5=49 Hence, M=27.5; IQR=49 Hence, M=27.5; IQR=49 “Five Number Summary” Max, Min, 25 th, 50 th, 75 th Percentiles: “Five Number Summary” Max, Min, 25 th, 50 th, 75 th Percentiles: -67, 5, 27.5, 54, 99 -67, 5, 27.5, 54, 99
19
Finding Percentiles General Formula General Formula p is desired percentile p is desired percentile n is sample size n is sample size If L is a whole number: If L is a whole number: The value of the pth percentile is between the Lth value and the next value. Find the mean of those values The value of the pth percentile is between the Lth value and the next value. Find the mean of those values If L is not a whole number: If L is not a whole number: Round L up. The value of the pth percentile is the Lth value Round L up. The value of the pth percentile is the Lth value
20
Example -67, 5, 23, 32, 54, 99 -67, 5, 23, 32, 54, 99 25 th Percentile: L=(25*6)/100=1.5 25 th Percentile: L=(25*6)/100=1.5 Round to 2. The 25 th Percentile is 5. Round to 2. The 25 th Percentile is 5. 75 th Percentile: L=(75*6)/100=4.5 75 th Percentile: L=(75*6)/100=4.5 Round to 5. The 75 th Percentile is 54. Round to 5. The 75 th Percentile is 54. 50 th Percentile: L=(50*6)/100=3 50 th Percentile: L=(50*6)/100=3 Take average of locations 3 and 4 Take average of locations 3 and 4 This is (23+32)/2=27.5. This is (23+32)/2=27.5.
21
Our Data Median=120 Votes (i.e. [50*67]/100) Median=120 Votes (i.e. [50*67]/100) 25 th Percentile=46 Votes 25 th Percentile=46 Votes 75 th Percentile=289 Votes 75 th Percentile=289 Votes IQR: 243 Votes IQR: 243 Votes 5 number summary: 5 number summary: Min=9, 25 th P=46, Median=120, 75 th P=289, Max=3407 Min=9, 25 th P=46, Median=120, 75 th P=289, Max=3407 (massive dispersion!) (massive dispersion!) Mean was 260.67. Median=120. Mean was 260.67. Median=120. The Mean is much closer to the 75 th percentile. The Mean is much closer to the 75 th percentile. That’s SKEW in action. That’s SKEW in action.
22
Revisiting our Data: Odd Ball Cases
23
“Influential Observations” Two data points: Two data points: Y=(1013, 3407) Y=(1013, 3407) Suppose we omit them (not recommended in applied research) Suppose we omit them (not recommended in applied research) Mean plummets to 200.69 (drop of 60 votes) Mean plummets to 200.69 (drop of 60 votes) s.d. is cut by more than half: 203.92 s.d. is cut by more than half: 203.92 Med=114 (note, it hardly changed) Med=114 (note, it hardly changed) Let’s look at a scatterplot Let’s look at a scatterplot
24
Useful to Visualize Data
25
Main Features? Y and X are positively related. There are clearly visible “outliers.” With respect to Y, which “outlier” worries you most? Influence!
26
Simple Description You can learn a lot from just these simple indicators. You can learn a lot from just these simple indicators. Suppose that our Y was a real variable? Suppose that our Y was a real variable?
27
Palm Beach County, FL 2000 Election
28
Descriptive Statistics Help to Clarify Some Issues. Palm Beach County Palm Beach County Largely a Jewish community Largely a Jewish community Heavily Democratic Heavily Democratic Yet an overwhelming number of Buchanan Votes Yet an overwhelming number of Buchanan Votes The Ballot created massive confusion. The Ballot created massive confusion. Margin of Victory in Florida: 537 votes. Margin of Victory in Florida: 537 votes. Number of Buchanan Votes in PBC: 3407 Number of Buchanan Votes in PBC: 3407
31
Univariate Statistics We can clearly learn a lot from very simple statistics We can clearly learn a lot from very simple statistics Some quick illustrations in R using data from last year’s election (on Prop. 8) Some quick illustrations in R using data from last year’s election (on Prop. 8)
32
Univariate Quantities in R Our Data Our Data Yes on Proposition 8 by County Yes on Proposition 8 by County Graphical Displays of data Graphical Displays of data Histogram Histogram Dot Chart Dot Chart Box Plots Box Plots Stem and Leaf Stem and Leaf Strip Plot Strip Plot
33
First, the basic statistics in R Mean (by county): Mean (by county): > mean(proportionforprop8) > mean(proportionforprop8) [1] 56.7202 [1] 56.7202 Standard deviation: Standard deviation: > sd(proportionforprop8) > sd(proportionforprop8) [1] 13.39508 [1] 13.39508 Five-number summary: Five-number summary: > fivenum(proportionforprop8) > fivenum(proportionforprop8) [1] 23.50787 46.93203 59.25364 68.03883 75.37070 [1] 23.50787 46.93203 59.25364 68.03883 75.37070
34
Histogram
35
Dot Chart
36
Box Plot
37
Stem and Leaf The decimal point is 1 digit(s) to the right of the | 2 | 459 3 | 4888 4 | 024445578 5 | 0113344566779 6 | 0000023344457889 7 | 0011123334455
38
Strip Plot
39
R Code for Previous row.names<- cbind(county) hist(proportionforprop8, xlab="Percentage Yes on Prop. 8", ylab="Frequency", main="Histogram of Yes on 8 by County", col="yellow") dotchart(proportionforprop8, labels=row.names, cex=.7, xlim=c(0, 100), main="Yes on 8 by County", xlab="Percent Yes") abline(v=50) abline(h=16) boxplot(proportionforprop8, col="light blue", names=c("Proposition 8"), xlab="California Counties", ylab="Percent Yes on 8", main="Box Plots for Prop. 8 by County", sub="Source: Los Angeles Times") abline(h=50) stem(proportionforprop8) stripchart(proportionforprop8, method="stack", xlab="Percentage Yes on Prop. 8", main="Vote on Prop. 8 by County: Strip Chart", pch=1)
40
Combined
41
Plots of Two Variables plot(proportionforprop8, proportionforprop4, xlab="Prop. 8 Vote", ylab="Prop. 4 Vote", main="Prop. 8 Vote by Prop. 4 Vote“, col=“red”) > abline(h=50) > abline(v=50)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.