Welcome to the Unit 5 Seminar Kristin Webster

Welcome to the Unit 5 Seminar Kristin Webster
MM207 Statistics Welcome to the Unit 5 Seminar Kristin Webster

Correlation and Scatter Diagrams
A correlation exists between two variables when higher values of one variable consistently go with higher values of another variable or when higher values of one variable consistently go with lower values of another variable. A scatter diagram (or scatterplot) is a graph in which each point represents the values of two variables. The x variable is on the horizontal axis The y variable is on the vertical axis The scatter plot is the location for each x,y pair. Here are a few examples of correlations: There is a correlation between the variables amount of smoking and likelihood of lung cancer; that is heavier smokers are more likely to get lung cancer. There is a correlation between the variables height and weight for people; that is, taller people tend to weigh more than shorter people. There is a correlation between the variables demand for apples and price of apples; that is, demand tends to decrease as price increases. There is a correlation between practice time and skill among piano players; that is, those who practice more tend to be more skilled.

Types of Correlations Positive: x and y move in the same direction
Negative: x and y move in opposite directions Zero: no pattern of movement in x and y Nonlinear relationship: The two variables are related, but the relationship results in a scatter diagram that does not follow a straight-line pattern. See page 289

Strength of the Correlation
Statisticians measure the strength of a correlation with a number called the correlation coefficient, represented by the letter r.

Properties of the Correlation Coefficient, r
The correlation coefficient, r, is a measure of the strength of a correlation. Its value can range only from -1 to 1. If there is no correlation, the points do not follow any ascending or descending straightline pattern, and the value of r is close to 0. If there is a positive correlation, the correlation coefficient is positive (0 < r ≤ 1): Both variables increase together. A perfect positive correlation (in which all the points on a scatter diagram lie on an ascending straight line) has a correlation coefficient r = 1. Values of r close to 1 mean a strong positive correlation and positive values closer to 0 mean a weak positive correlation. If there is a negative correlation, the correlation coefficient is negative (-1 ≤ r < 0): When one variable increases, the other decreases. A perfect negative correlation (in which all the points lie on a descending straight line) has a correlation coefficient r = -1. Values of r close to -1 mean a strong negative correlation and negative values closer to 0 mean a weak negative correlation.

Beware of Outliers If you calculate the correlation coefficient for these data, you’ll find that it is a relatively high r = 0.880, suggesting a very strong correlation. However, if you cover the data point in the upper right corner, the apparent correlation disappears. In fact, without this data point, the correlation coefficient is r = 0.

Correlation Does Not Imply Causality
Possible Explanations for a Correlation The correlation may be a coincidence. Both correlation variables might be directly influenced by some common underlying cause. One of the correlated variables may actually be a cause of the other. But note that, even in this case, it may be just one of several causes.

Best-Fit Line The best-fit line (or regression line) on a scatter diagram is a line that lies closer to the data points than any other possible line (according to a standard statistical measure of closeness).

Cautions in Making Predictions from Best-Fit Lines
Don’t expect a best-fit line to give a good prediction unless the correlation is strong and there are many data points. If the sample points lie very close to the best-fit line, the correlation is very strong and the prediction is more likely to be accurate. If the sample points lie away from the best-fit line by substantial amounts, the correlation is weak and predictions tend to be much less accurate. Don’t use a best-fit line to make predictions beyond the bounds of the data points to which the line was fit. A best-fit line based on past data is not necessarily valid now and might not result in valid predictions of the future. Don’t make predictions about a population that is different from the population from which the sample data were drawn. Remember that a best-fit line is meaningless when there is no significant correlation or when the relationship is nonlinear.

Coefficient of Determination
The square of the correlation coefficient, or r2, is the proportion of the variation in a variable that is accounted for by the best-fit line. Political scientists are interested in knowing what factors affect voter turnout in elections. One such factor is the unemployment rate. Data collected in presidential election years since 1964 show a very weak negative correlation between voter turnout and the unemployment rate, with a correlation coefficient of about r = Based on this correlation, should we use the unemployment rate to predict voter turnout in the next presidential election? Solution: The square of the correlation coefficient is r2 = (-0.1)2 = 0.01, which means that only about 1% of the variation in the data is accounted for by the best-fit line. Nearly all of the variation in the data must therefore be explained by other factors. We conclude that unemployment is not a reliable predictor of voter turnout.

Multiple Regression The use of multiple regression allows the calculation of a best-fit equation that represents the best fit between one variable (such as price) and a combination of two or more other variables (such as weight and color). The coefficient of determination, R2, tells us the proportion of the scatter in the data accounted for by the best-fit equation.

The Search for Causality
A correlation may suggest causality, but by itself a correlation never establishes causality. Much more evidence is required to establish that one factor causes another. Earlier, we found that a correlation between two variables may be the result of either (1) coincidence, (2) a common underlying cause, or (3) one variable actually having a direct influence on the other. The process of establishing causality is essentially a process of ruling out the first two explanations.

Determining Causality
We can rule out coincidence by repeating the experiment many times or using a large number of subjects in the experiment. Because coincidences occur randomly, they should not occur consistently in many subjects or experiments. If the controls rule out confounding variables, any remaining effects must be caused by the variables being studied.

Hidden Causality Sometimes correlations—or the lack of a correlation—can hide an underlying causality. For example, studies suggested patients who had heart bypass surgery fared no better than those who didn’t. But researchers found confounding variables that early studies had not considered, such as amount of blockage and surgical techniques. These confounding variables prevented the studies from finding a real correlation between the surgery and prolonged life.

Using StatCrunch to calculate correlation

Questions?

Welcome to the Unit 5 Seminar Kristin Webster

Similar presentations

Presentation on theme: "Welcome to the Unit 5 Seminar Kristin Webster"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Welcome to the Unit 5 Seminar Kristin Webster

Similar presentations

Presentation on theme: "Welcome to the Unit 5 Seminar Kristin Webster"— Presentation transcript:

Similar presentations

About project

Feedback