Download presentation
Presentation is loading. Please wait.
1
Chi-square Test for Goodness of Fit (GOF)
2
What would it have been like to be a crew member on the Titanic?
The crew of the Titanic made up 40% of the people on board. But 45% of the deaths were from the crew (685 of 1517 death were from the crew) Did the crew pay a heavier cost?
4
Percent Observed Expected On board Lost O-E (O-E)^2/E First class 329 15% 130 224.51 -94.51 39.79 Second Class 285 13% 166 194.49 -28.49 4.17 Third Class 710 32% 536 484.51 51.49 5.47 Crew 899 40% 685 613.49 71.51 8.34 2223 100% 1517 0.00 57.77
6
What new for Chi-squared GOF
No parameter of interest; rather a model describing the distribution amongst several categories. H0: the model is good. Ha: the model is not good. No normality check; rather no expected counts lower than 1. No more than 20% of the expected counts lower than 5. df=number of categories-1
7
The statistic for goodness of fit
The Chi-square test statistic measures how closely the observed data matches what is expected given a particular model
8
Step 3: Test Statistic AP Statistics, Section 13.1
9
Observed Expected Count Percentage (O-E)^2/E Blue 24% Brown 13% Green
Observed Expected Count Percentage (O-E)^2/E Blue 24% Brown 13% Green 16% Orange 20% Red Yellow 14% AP Statistics, Section 13.1
11
Scatterplots, Association, and Correlation
Chapter 7 Scatterplots, Association, and Correlation Copyright © 2009 Pearson Education, Inc.
12
Looking at Scatterplots
Scatterplots may be the most common and most effective display for data. In a scatterplot, you can see patterns, trends, relationships, and even the occasional extraordinary value sitting apart from the others. Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables.
13
Looking at Scatterplots (cont.)
When looking at scatterplots, we will look for direction, form, strength, and unusual features. Direction: A pattern that runs from the upper left to the lower right is said to have a negative direction. A trend running the other way has a positive direction.
14
Looking at Scatterplots (cont.)
The figure shows a negative direction between the year since 1970 and the and the prediction errors made by NOAA. As the years have passed, the predictions have improved (errors have decreased).
15
Looking at Scatterplots (cont.)
The example in the text shows a negative association between central pressure and maximum wind speed As the central pressure increases, the maximum wind speed decreases.
16
Looking at Scatterplots (cont.)
Form: If there is a straight line (linear) relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form.
17
Looking at Scatterplots (cont.)
Form: If the relationship curves sharply, the methods of this book cannot really help us.
18
Looking at Scatterplots (cont.)
Strength: At one extreme, the points appear to follow a single stream (whether straight, curved, or bending all over the place).
19
Looking at Scatterplots (cont.)
Strength: At the other extreme, the points appear as a vague cloud with no discernable trend or pattern: Note: we will quantify the amount of scatter soon.
20
Looking at Scatterplots (cont.)
Unusual features: Look for the unexpected. Often the most interesting thing to see in a scatterplot is the thing you never thought to look for. One example of such a surprise is an outlier standing away from the overall pattern of the scatterplot. Clusters or subgroups should also raise questions.
21
Roles for Variables It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis. This determination is made based on the roles played by the variables. When the roles are clear, the explanatory or predictor variable goes on the x-axis, and the response variable goes on the y-axis.
22
Roles for Variables (cont.)
The roles that we choose for variables are more about how we think about them rather than about the variables themselves. Just placing a variable on the x-axis doesn’t necessarily mean that it explains or predicts anything. And the variable on the y-axis may not respond to it in any way.
23
Correlation Data collected from students in Statistics classes included their heights (in inches) and weights (in pounds): Here we see a positive association and a fairly straight form, although there seems to be a high outlier.
24
Correlation (cont.) How strong is the association between weight and height of Statistics students? If we had to put a number on the strength, we would not want it to depend on the units we used. A scatterplot of heights (in centimeters) and weights (in kilograms) doesn’t change the shape of the pattern:
25
Correlation (cont.) The correlation coefficient (r) gives us a numerical measurement of the strength of the linear relationship between the explanatory and response variables.
26
Correlation (cont.) For the students’ heights and weights, the correlation is What does this mean in terms of strength? We’ll address this shortly.
27
Correlation Conditions
Correlation measures the strength of the linear association between two quantitative variables. Before you use correlation, you must check several conditions: Quantitative Variables Condition Straight Enough Condition Outlier Condition
28
Correlation Conditions (cont.)
Quantitative Variables Condition: Correlation applies only to quantitative variables. Don’t apply correlation to categorical data masquerading as quantitative. Check that you know the variables’ units and what they measure.
29
Correlation Conditions (cont.)
Straight Enough Condition: You can calculate a correlation coefficient for any pair of variables. But correlation measures the strength only of the linear association, and will be misleading if the relationship is not linear.
30
Correlation Conditions (cont.)
Outlier Condition: Outliers can distort the correlation dramatically. An outlier can make an otherwise small correlation look big, or hide a large correlation. It can even give an otherwise positive association a negative correlation coefficient (and vice versa). When you see an outlier, it’s often a good idea to report the correlations with and without that point.
31
Correlation Properties
The sign of a correlation coefficient gives the direction of the association. Correlation is always between -1 and +1. Correlation can be exactly equal to -1 or +1, but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line. A correlation near zero corresponds to a weak linear association.
32
Correlation Properties (cont.)
Correlation treats x and y symmetrically: The correlation of x with y is the same as the correlation of y with x. Correlation has no units. Correlation is not affected by changes in the center or scale of either variable. Correlation depends only on the z-scores, and they are unaffected by changes in center or scale.
33
Correlation Properties (cont.)
Correlation measures the strength of the linear association between the two variables. Variables can have a strong association but still have a small correlation if the association isn’t linear. Correlation is sensitive to outliers. A single outlying value can make a small correlation large or make a large one small.
34
Correlation ≠ Causation
Whenever we have a strong correlation, it is tempting to explain it by imagining that the predictor variable has caused the response to help. Scatterplots and correlation coefficients never prove causation. A hidden variable that stands behind a relationship and determines it by simultaneously affecting the other two variables is called a lurking variable.
35
What Can Go Wrong? Don’t confuse “correlation” with “causation.”
Scatterplots and correlations never demonstrate causation. These statistical tools can only demonstrate an association between variables.
36
What Can Go Wrong? (cont.)
Don’t correlate categorical variables. Be sure to check the Quantitative Variables Condition. Be sure the association is linear. There may be a strong association between two variables that have a nonlinear association.
37
What Can Go Wrong? (cont.)
Don’t assume the relationship is linear just because the correlation coefficient is high. Here the correlation is 0.979, but the relationship is actually bent.
38
What Can Go Wrong? (cont.)
Beware of outliers. Even a single outlier can dominate the correlation value. Make sure to check the Outlier Condition.
39
What have we learned? We examine scatterplots for direction, form, strength, and unusual features. Although not every relationship is linear, when the scatterplot is straight enough, the correlation coefficient is a useful numerical summary. The sign of the correlation tells us the direction of the association. The magnitude of the correlation tells us the strength of a linear association. Correlation has no units, so shifting or scaling the data, standardizing, or swapping the variables has no effect on the numerical value.
40
What have we learned? (cont.)
Doing Statistics right means that we have to Think about whether our choice of methods is appropriate. Before finding or talking about a correlation, check the Straight Enough Condition. Watch out for outliers! Don’t assume that a high correlation or strong association is evidence of a cause-and-effect relationship—beware of lurking variables!
41
Chapter 8 Linear Regression Copyright © 2009 Pearson Education, Inc.
42
Fat Versus Protein: An Example
The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu:
43
Residuals The model won’t be perfect, regardless of the line we draw.
Some points will be above the line and some will be below. The estimate made from a model is the predicted value (denoted as ).
44
Residuals (cont.) The difference between the observed value and its associated predicted value is called the residual. To find the residuals, we always subtract the predicted value from the observed one:
45
Residuals (cont.) A negative residual means the predicted value’s too big (an overestimate). A positive residual means the predicted value’s too small (an underestimate).
46
“Best Fit” Means Least Squares
Some residuals are positive, others are negative, and, on average, they cancel each other out. So, we can’t assess how well the line fits by adding up all the residuals. Similar to what we did with deviations, we square the residuals and add the squares. The smaller the sum, the better the fit. The line of best fit is the line for which the sum of the squared residuals is smallest.
47
The Linear Model Remember from Algebra that a straight line can be written as: In Statistics we use a slightly different notation: We write to emphasize that the points that satisfy this equation are just our predicted values, not the actual data values.
48
The Linear Model (cont.)
We write b1 and b0 for the slope and intercept of the line. The b’s are called the coefficients of the linear model. The coefficient b1 is the slope, which tells us how rapidly changes with respect to x. The coefficient b0 is the intercept, which tells where the line hits (intercepts) the y-axis.
49
The Least Squares Line In our model, we have a slope (b1):
The slope is built from the correlation and the standard deviations: Our slope is always in units of y per unit of x.
50
The Least Squares Line (cont.)
In our model, we also have an intercept (b0). The intercept is built from the means and the slope: Our intercept is always in units of y.
51
Fat Versus Protein: An Example
The regression line for the Burger King data fits the data well: The equation is The predicted fat content for a BK Broiler chicken sandwich is (30) = 35.9 grams of fat.
52
The Least Squares Line (cont.)
Since regression and correlation are closely related, we need to check the same conditions for regressions as we did for correlations: Quantitative Variables Condition Straight Enough Condition Outlier Condition
53
Correlation and the Line
Moving one standard deviation away from the mean in x moves us r standard deviations away from the mean in y. This relationship is shown in a scatterplot of z-scores for fat and protein:
54
Correlation and the Line (cont.)
Put generally, moving any number of standard deviations away from the mean in x moves us r times that number of standard deviations away from the mean in y.
55
How Big Can Predicted Values Get?
r cannot be bigger than 1 (in absolute value), so each predicted y tends to be closer to its mean (in standard deviations) than its corresponding x was. This property of the linear model is called regression to the mean; the line is called the regression line.
56
Residuals Revisited The linear model assumes that the relationship between the two variables is a perfect straight line. The residuals are the part of the data that hasn’t been modeled. Data = Model + Residual or (equivalently) Residual = Data – Model Or, in symbols,
57
Residuals Revisited (cont.)
Residuals help us to see whether the model makes sense. When a regression model is appropriate, nothing interesting should be left behind. After we fit a regression model, we usually plot the residuals in the hope of finding…nothing.
58
Residuals Revisited (cont.)
The residuals for the BK menu regression look appropriately boring:
59
Regression Assumptions and Conditions
Quantitative Variables Condition: Regression can only be done on two quantitative variables, so make sure to check this condition. Straight Enough Condition: The linear model assumes that the relationship between the variables is linear. A scatterplot will let you check that the assumption is reasonable.
60
Regressions Assumptions and Conditions (cont.)
It’s a good idea to check linearity again after computing the regression when we can examine the residuals. You should also check for outliers, which could change the regression. If the data seem to clump or cluster in the scatterplot, that could be a sign of trouble worth looking into further.
61
Regressions Assumptions and Conditions (cont.)
If the scatterplot is not straight enough, stop here. You can’t use a linear model for any two variables, even if they are related. They must have a linear association or the model won’t mean a thing. Some nonlinear relationships can be saved by re-expressing the data to make the scatterplot more linear.
62
Regressions Assumptions and Conditions (cont.)
Outlier Condition: Watch out for outliers. Outlying points can dramatically change a regression model. Outliers can even change the sign of the slope, misleading us about the underlying relationship between the variables.
63
Reality Check: Is the Regression Reasonable?
Statistics don’t come out of nowhere. They are based on data. The results of a statistical analysis should reinforce your common sense, not fly in its face. If the results are surprising, then either you’ve learned something new about the world or your analysis is wrong. When you perform a regression, think about the coefficients and ask yourself whether they make sense.
64
What Can Go Wrong? Don’t fit a straight line to a nonlinear relationship. Beware of extraordinary points (y-values that stand off from the linear pattern or extreme x-values). Don’t invert the regression. To swap the predictor-response roles of the variables, we must fit a new regression equation. Don’t extrapolate beyond the data—the linear model may no longer hold outside of the range of the data. Don’t infer that x causes y just because there is a good linear model for their relationship—association is not causation. Don’t choose a model based on R2 alone.
65
What have we learned? When the relationship between two quantitative variables is fairly straight, a linear model can help summarize that relationship. The regression line doesn’t pass through all the points, but it is the best compromise in the sense that it has the smallest sum of squared residuals.
66
What have we learned? (cont.)
The correlation tells us several things about the regression: The slope of the line is based on the correlation, adjusted for the units of x and y. For each SD in x that we are away from the x mean, we expect to be r SDs in y away from the y mean. Since r is always between -1 and +1, each predicted y is fewer SDs away from its mean than the corresponding x was (regression to the mean).
67
What have we learned? (cont.)
The residuals also reveal how well the model works. If a plot of the residuals against predicted values shows a pattern, we should re-examine the data to see why. The standard deviation of the residuals quantifies the amount of scatter around the line.
68
What have we learned? (cont.)
The linear model makes no sense unless the Linear Relationship Assumption is satisfied. Also, we need to check the Straight Enough Condition and Outlier Condition with a scatterplot. For the standard deviation of the residuals, we must make the Equal Variance Assumption. We check it by looking at both the original scatterplot and the residual plot for Does the Plot Thicken? Condition.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.