Download presentation
1
Describing Data: Two Variables
STAT 250 Dr. Kari Lock Morgan Describing Data: Two Variables SECTIONS 2.4, 2.5 One quantitative variable (2.4) One quantitative and one categorical (2.4) Two quantitative (2.5)
2
z-score Which is better, an ACT score of 28 or a combined SAT score of 2100? ACT: μ = 21, σ = 5 SAT: μ = 1500, σ = 325 Assume ACT and SAT scores have approximately bell-shaped distributions ACT score of 28 SAT score of 2100 I don’t know
3
Honeybee Waggle Dances
4
Honeybee Waggle Dance Honeybee scouts investigate new home or food source options; the scouts communicate the information to the hive with a “waggle dance” Scientists took bees to an island with only two possible options for nesting: one of very high quality and one of low quality. They recorded Quality of nesting site Distance to nesting site Number of waggle dance circuits performed Duration of waggle dance B Seeley, T., Honeybee Democracy, Princeton University Press, Princeton, NJ, 2010, p. 128
5
Questions of the Day How many circuits of the waggle dance do honey bees do? How is this related to quality of a nesting site? How is duration of the dance related to distance to a nesting site?
6
Other Measures of Location
Maximum = largest data value Minimum = smallest data value Quartiles: Q1 = median of the values below m. Q3 = median of the values above m.
7
Five Number Summary Five Number Summary: Min Max Q1 Q3 m 25%
Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics
8
Five Number Summary The distribution of number of circuits is
Symmetric Right-skewed Left-skewed Impossible to tell
9
The Pth percentile is the value which is greater than P% of the data
We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better We could also have used percentiles: ACT score of 28: 91st percentile SAT score of 2100: 97th percentile
10
Five Number Summary Five Number Summary: Min Max Q1 Q3 m 25%
0th percentile 25th percentile 50th percentile 75th percentile 100th percentile
11
Measures of Spread Range = Max – Min
Interquartile Range (IQR) = Q3 – Q1 Is the range resistant to outliers? Yes No Is the IQR resistant to outliers?
12
Comparing Statistics Measures of Center: Measures of Spread:
Mean (not resistant) Median (resistant) Measures of Spread: Standard deviation (not resistant) IQR (resistant) Range (not resistant) Most often, we use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information
13
Boxplot Middle 50% of data Median
*For boxplots, outliers are defined as any point more than 1.5 IQRs beyond the quartiles (although you don’t have to know that) Outlier Outlier Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier Middle 50% of data Q3 Median Q1 Minitab: Graph -> Boxplot -> One Y -> Simple
14
Boxplot This boxplot shows a distribution that is Symmetric
Left-skewed Right-skewed
15
One Quantitative and One Categorical
How is number of waggle circuits related to the quality of the nesting site? Two variables One quantitative (number of circuits) One categorical (quality – low or high) Can do anything for one quantitative variable, broken down by categorical groups
16
Side-by-Side Boxplots
Minitab: Graph -> Boxplot -> One Y -> With Groups
17
Stacked Dotplots Minitab: Graph -> Dotplot -> One Y -> With Groups
18
Overlaid Histograms Minitab: Graph -> Histogram -> With Groups
19
Quantitative Statistics by a Categorical Variable
Any of the statistics we use for a quantitative variable can be looked at separately for each level of a categorical variable Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics -> By variables
20
Difference in Means Often, when comparing a quantitative variable across two categories, we compute the difference in means Honeybees perform 60.5 circuits more, on average, for the high quality site as opposed to the low quality site.
21
Association? Does there appear to be an association between number of waggle circuits and quality of potential nesting site? Yes No
22
Summary: One Quantitative and One Categorical
Summary Statistics Any summary statistics for quantitative variables, broken down by groups Difference in means Visualization Side-by-side graphs
23
Two Quantitative Variables
How is duration of the dance related to distance to a nesting site? Two quantitative variables Summary Statistics: correlation Visualization: scatterplot
24
Scatterplot A scatterplot is the graph of the relationship between two quantitative variables. Minitab: Graph -> Scatterplot -> Simple
25
Direction of Association
A positive association means that values of one variable tend to be higher when values of the other variable are higher A negative association means that values of one variable tend to be lower when values of the other variable are higher Two variables are not associated if knowing the value of one variable does not give you any information about the value of the other variable
26
Correlation The correlation is a measure of the strength and direction of linear association between two quantitative variables Sample correlation: r Population correlation: ρ (“rho”) r = for duration of dance and distance to site Minitab: Stat -> Basic Statistics -> Correlation
27
Correlation -1 ≤ r ≤ 1 The sign indicates the direction of association
positive association: r > 0 negative association: r < 0 no linear association: r ≠ 0 The closer r is to ±1, the stronger the linear association r has no units and does not depend on the units of measurement The correlation between X and Y is the same as the correlation between Y and X
28
Correlation Guessing Game
Enter PennState for the group ID. Highest scorer in the class by the first exam gets one extra credit point on Exam 1!
29
Correlation NFL Teams r = 0.43
30
Correlation Cautions Correlation can be heavily affected by outliers. Always plot your data!
31
Testosterone Levels and Time
What is the correlation between testosterone levels and hour of the day? Positive Negative About 0 Are testosterone level and hour of the day associated? Yes No
32
Correlation Cautions Correlation can be heavily affected by outliers. Always plot your data! r = 0 means no linear association. The variables could still be otherwise associated. Always plot your data!
33
TVs and Life Expectancy
34
Correlation Cautions Correlation can be heavily affected by outliers. Always plot your data! r = 0 means no linear association. The variables could still be otherwise associated. Always plot your data! Correlation does not imply causation!
35
Summary: Two Quantitative Variables
Summary Statistics: correlation Visualization: scatterplot
36
To Do Read Sections 2.4 and 2.5 Do HW 2.2, 2.3, 2.4, 2.5 (due Friday, 9/18)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.