Elementary Statistics Correlation and Regression
Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes smoked per day Score on SAT Height Hours of Training Explanatory (Independent) Variable Response (Dependent) Variable A relationship between two variables Number of Accidents Shoe SizeHeight Lung Capacity Grade Point Average IQ
Negative Correlation–as x increases, y decreases x = hours of training y = number of accidents Scatter Plots and Types of Correlation Hours of Training Accidents
Positive Correlation–as x increases, y increases x = SAT score y = GPA GPA Scatter Plots and Types of Correlation Math SAT
No linear correlation x = height y = IQ Scatter Plots and Types of Correlation Height IQ
Correlation Coefficient A measure of the strength and direction of a linear relationship between two variables The range of r is from –1 to 1. If r is close to 1 there is a strong positive correlation. If r is close to –1 there is a strong negative correlation. If r is close to 0 there is no linear correlation. –1 0 1
x y Absences Final Grade Application Final Grade X Absences
xy x 2 y2y2 Computation of r n x y
r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho). The sampling distribution for r is a t-distribution with n – 2 d.f. Standardized test statistic For a two tail test for significance: For left tail and right tail to test negative or positive significance: Hypothesis Test for Significance (The correlation is not significant) (The correlation is significant)
A t-distribution with 5 degrees of freedom Test of Significance You found the correlation between the number of times absent and a final grade r = – There were seven pairs of data.Test the significance of this correlation. Use = Write the null and alternative hypothesis. 2. State the level of significance. 3. Identify the sampling distribution. (The correlation is not significant) (The correlation is significant) = 0.01
t –4.032 Rejection Regions Critical Values ± t 0 4. Find the critical value. 5. Find the rejection region. 6. Find the test statistic.
t 0 –4.032 t = –9.811 falls in the rejection region. Reject the null hypothesis. There is a significant correlation between the number of times absent and final grades. 7. Make your decision. 8. Interpret your decision.
The equation of a line may be written as y = mx + b where m is the slope of the line and b is the y-intercept. The line of regression is: The slope m is: The y-intercept is: Once you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line. The Line of Regression
Ad $ = a residual (xi,yi)(xi,yi) = a data point revenue = a point on the line with the same x-value
Calculate m and b. Write the equation of the line of regression with x = number of absences and y = final grade. The line of regression is:= –3.924x xy x 2 y2y2 x y
Absences Final Grade m = –3.924 and b = The line of regression is: Note that the point = (8.143, ) is on the line. The Line of Regression
The regression line can be used to predict values of y for values of x falling within the range of the data. The regression equation for number of times absent and final grade is: Use this equation to predict the expected grade for a student with (a) 3 absences(b) 12 absences (a) (b) Predicting y Values = –3.924(3) = = –3.924(12) = = –3.924x
The coefficient of determination, r 2, is the ratio of explained variation in y to the total variation in y. The correlation coefficient of number of times absent and final grade is r = – The coefficient of determination is r 2 = (–0.975) 2 = Interpretation: About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc. The Coefficient of Determination
The Standard Error of Estimate, s e,is the standard deviation of the observed y i values about the predicted value. The Standard Error of Estimate
= xy Calculate for each x. The Standard Error of Estimate
Given a specific linear regression equation and x 0, a specific value of x, a c-prediction interval for y is: where Use a t-distribution with n – 2 degrees of freedom. The point estimate is and E is the maximum error of estimate. Prediction Intervals
Construct a 90% confidence interval for a final grade when a student has been absent 6 times. 1. Find the point estimate: The point (6, ) is the point on the regression line with x-coordinate of 6. Application
Construct a 90% confidence interval for a final grade when a student has been absent 6 times. 2. Find E, At the 90% level of confidence, the maximum error of estimate is Application
Construct a 90% confidence interval for a final grade when a student has been absent 6 times. When x = 6, the 90% confidence interval is from to Find the endpoints. Application – E = – = E = = < y <