Lecture on Correlation and Regression Analyses
REVIEW - Variable A variable is a characteristic that changes or varies over time or different individuals or objects under consideration. Broad Classification of Variables: QUANTITATIVE DISCRETE CONTINUOUS QUALITATIVE
Types of Variable Qualitative assumes values that are not numerical but can be categorized categories may be identified by either non- numerical descriptions or by numeric codes
Types of Variable Quantitative indicates the quantity or amount of a characteristic data are always numeric can be discrete or continuous
2.A.5 Types of Quantitative Variables Discrete – variable with a finite or countable number of possible values Continuous – variable that assumes any value in a given interval
Data may be classified into four hierarchical levels of measurement: Nominal Ordinal Interval Ratio Note: The type of statistical analysis that is appropriate for a particular variable depends on its level of measurement. Levels/Scales of Measurement
Data collected are labels, names or categories. Frequencies or counts of observations belonging to the same category can be obtained. It is the lowest level of measurement. NOMINAL SCALE
ORDINAL SCALE Data collected are labels with implied ordering. The difference between two data labels is meaningless.
INTERVAL SCALE Data can be ordered or ranked. The difference between two data values is meaningful. Data at this level may lack an absolute zero point.
RATIO SCALE Data have all the properties of the interval scale. The number zero indicates the absence of the characteristic being measured. It is the highest level of measurement.
Learning Points – PART II 1. What is a correlation analysis 2. What is a regression analysis 3. When do we use correlation analysis? 4. When do we use regression analysis? 5. How do we compare regression versus correlation analysis?
5.F.12 CORRELATION ANALYSIS It is a statistical technique used to determine the strength of the relationship between two variables, X and Y. It provides a measure of strength of the linear relationship between two variables measured in at least interval scale.
5.F.13 ILLUSTRATION The UP Admissions office may be interested in the relationship between UPCAT scores in Math and Reading Comprehension of UPCAT qualifiers.
5.F.14 A social scientist might be concerned with how a city’s crime rate is related to its unemployment rate. ILLUSTRATION
5.F.15 A nutritionist might try to relate the quantity of carbohydrates in the diet consumed to the amount of sugar in the blood of diabetic individuals. ILLUSTRATION
5.F.16 PEARSON’S CORRELATION COEFFICIENT, where XY = covariance between X and Y X = standard deviation of the X values Y = standard deviation of the Y values N = number of paired observations in the population
5.F.17 X and Y increases (decreases) together, >0 X Y Y as X PEARSON’S CORRELATION COEFFICIENT,
5.F.18 X increases (decreases) while Y decreases (increases), < 0 Y X Y as X PEARSON’S CORRELATION COEFFICIENT,
5.F.19 X and Y have no linear relationship, = 0 X Y No pattern PEARSON’S CORRELATION COEFFICIENT,
5.F.20 SAMPLE CORRELATION COEFFICIENT, r where s XY = sample covariance of X and Y values s X = sample standard deviation of X values s Y = sample standard deviation of Y values n = sample size
5.F.21 QUALITATIVE INTERPRETATION OF AND r Absolute Value of the Correlation Coefficient Strength of Linear Relationship 0.0 – 0.2Very weak 0.2 – 0.4Weak 0.4 – 0.6Moderate 0.6 – 0.8Strong 0.8 – 1.0Very Strong
It is of interest to study the relationship between the number of hours spent studying and the student’s grade in an examination. A random sample of twenty students is selected and the data are given in the following table.table. Compute and interpret the sample correlation coefficient. EXAMPLE
Score (%) Hours Studied Student Slide No. V.F.15
SCATTER PLOT Number of Hours Spent Studying Examination Score
5.F.25
Sample Correlation Coefficient Interpretation: There is a strong positive linear relationship between the number of hours the student spent studying for the exam and exam score of students.
5.F.27 TEST OF HYPOTHESIS ABOUT Ho: = 0; There is no linear relationship between X and Y. vs. Ha: 0; There is a linear relationship between X and Y. or Ha: > 0; There is a positive linear relationship between X and Y. or Ha: < 0; There is a negative linear relationship between X and Y.
5.F.28 The standardized form of the test statistic is which follows the Student’s t distribution with n - 2 df when the null hypothesis is TRUE. This is commonly referred to as t- test for correlation coefficient. TEST OF HYPOTHESIS ABOUT
5.F.29 Decision Rule Alternative Hypothesis Reject Ho if t c < t tab = - t α(n-2). Fail to reject Ho, otherwise. Ha: < 0 (one-tailed test) Reject Ho if t c > t tab = t α(n-2). Fail to reject Ho, otherwise. Ha: > 0 (one-tailed test) Reject Ho if |t c | > t tab = t α/2(n-2). Fail to reject Ho, otherwise. Ha: ≠ 0 (two-tailed test) With a given level of significance, TEST OF HYPOTHESIS ABOUT
In the study of the relationship between the number of hours spent studying and the student’s grade in an examination. Is there evidence to say that longer number of hours spent studying is associated with higher exam scores at 5% level of significance? EXAMPLE
Test of Hypothesis Ho: = 0; There is no linear relationship between the number of hours a student spent studying for the exam and his exam score. Ha: > 0; There is a positive linear relationship between the number of hours a student spent studying for the exam and his exam score.
Test of Hypothesis The test statistic is Test procedure: One-tailed t-test for correlation coefficient Decision rule: Reject Ho if t c > t.tab = t.05(18) = Reject Ho, otherwise.
Test of Hypothesis Decision: Reject Ho. Conclusion: At α=5%, there is evidence to say that longer number of hours spent studying is associated with higher exam scores.
5.F.34 WORD OF CAUTION Correlation is a measure of the strength of linear relationship between two variables, with no suggestion of “cause and effect” or causal relationship. A correlation coefficient equal to zero only indicates lack of linear relationship and does not discount the possibility that other forms of relationship may exist.
5.F.35 REGRESSION ANALYSIS A statistical technique used to study the functional relationship between variables which allows predicting the value of one variable, say Y, given the value of another variable, say X
5.F.36 REGRESSION ANALYSIS Y – dependent variable A variable whose variation/value depends on that of another. X – independent variable - A variable whose variation/value does not depend on that of another.
5.F.37 ILLUSTRATION The relationship between the number of hours spent studying and the student’s exam score may be expressed in equation form. This equation may be used to predict the student’s exam score knowing the number of hours the student spent studying.
5.F.38 A child’s height is studied to see whether it is related to his father’s height such that some equation can be used to predict a child’s height given his father’s height. Sales of a product may be related to the corresponding advertising expenditures. ILLUSTRATION
5.F.39 SAMPLE REGRESSION MODEL where b 0 = estimated Y-intercept; the predicted value of Y when X = 0; b 1 = estimated slope of the line; measures the change in the predicted value of Y per unit change in X
5.F.40 ESTIMATORS where = mean of the Y values = estimated common variance of the Y’s = mean of the X values
EXAMPLE In the previous example, we may want to predict the examination score of a student given the number of hours he spent studying. Estimated regression line: Predicted exam score for X i = 2.5 is ~ 69
EXAMPLE Score (%) Hours Studied Student Slide No. V.F.15
5.F.43 Ho: = where is the hypothesized value of TEST OF HYPOTHESIS ABOUT 1 Ha: or Ha: > or Ha: <
5.F.44 where and it follows the Student’s t distribution with n -2 df when the null hypothesis is TRUE. This is commonly referred to as t-test for regression coefficient. The standardized form of the test statistic is TEST OF HYPOTHESIS ABOUT 1
5.F.45 Decision Rule Alternative Hypothesis Reject Ho if t c < t tab = -t α(n-2). Fail to reject Ho, otherwise. Ha: < (one-tailed test) Reject Ho if t c > t tab = t α(n-2). Fail to reject Ho, otherwise. Ha: > (one-tailed test) Reject Ho if |t c | > t tab = t α/2(n-2). Fail to reject Ho, otherwise. Ha: ≠ (two-tailed test) With a given level of significance, TEST OF HYPOTHESIS ABOUT 1
EXAMPLE Using the previous example, test at = 5% if a student’s examination score will increase by at least 1 percent with an additional hour of study time. Ho:Ha: Test statistic: Test procedure: One-tailed t-test for regression coefficient < >
EXAMPLE Decision Rule: Reject Ho if t c > t.tab = -t.05(18) = Otherwise, Fail to reject Ho, Computations:
EXAMPLE Decision: Since t c = > t.tab = , we reject Ho. Conclusion: At =5%, the student’s exam score will increase by at least 1 percent for an additional hour of study time.
5.F.49 Ho: = where is the hypothesized value of TEST OF HYPOTHESIS ABOUT 0 Ha: or Ha: > or Ha: <
5.F.50 where and it follows the Student’s t distribution with n -2 df when the null hypothesis is TRUE. This is commonly referred to as t-test for regression constant. The standardized form of the test statistic is TEST OF HYPOTHESIS ABOUT 0
5.F.51 Decision Rule Alternative Hypothesis Reject Ho if t c < t tab = - t α(n-2). Fail to reject Ho, otherwise. Ha: < (one-tailed test) Reject Ho if t c > t tab = t α(n-2). Fail to reject Ho, otherwise. Ha: > (one-tailed test) Reject Ho if |t c | > t tab = t α/2(n-2). Fail to reject Ho, otherwise. Ha: ≠ (two-tailed test) With a given level of significance, TEST OF HYPOTHESIS ABOUT 0
Ho: Ha: Test statistic: Test procedure: One-tailed t-test for regression constant EXAMPLE At = 5%, test if the data indicate that the student will fail (a score less than 60) if he did not study. >
EXAMPLE Decision rule: Reject Ho if t c > t.05(18) = otherwise, Fail to reject Ho Computations:
EXAMPLE Decision: Since t c = > t tab = , we reject Ho. Conclusion: At = 5%, the student will get a score less than 60 or the student will fail if he/she did not study for the examination.
5.F.55 ADEQUACY OF THE MODEL Coefficient of Determination (R 2 ) - proportion of the total variation in Y that is explained by X, usually expressed in percent
EXAMPLE Interpretation: Around 55% of the total variation in examination scores is explained by the number of hours spent studying. The remaining 45% is explained by other variables not in the model, or by the fact that the relationship is not exactly linear.
SUMMARY 1. Correlation analysis 2. Regression analysis 3. Application with computer output 4. Interpretation
Regression analysis is a causality relationship, where you can predict the value of one variable given the values of the other variable/s.
Correlation analysis is a relationship between two variables but without the causality clause. Regression analysis in policy analysis is usually used to forecast certain events. For example, our trend line is an example of a regression analysis.
Illustrations: Knowing the effect of TV spot advertising on the number of people visiting the Family Planning clinic would allow the population commission official to decide rationally whether or not to increase the amount to be spent on TV spot advertising. The officer would be able to predict how many people the commission would be able to attract to the Family Planning clinic if it increased the number of TV ads run. (See series p.176)
The relationship between two variables (in our example, the number of TV ad runes and the number of people visiting Family Planning clinic can be summarized by a line. This is called the regression line. This is the line that we will use to predict the value of one variable, given the other.
Formula of the regression line: Where: b = the slope of the line; a = the Y intercept or the value of Y when x=0; e = the error term.
Example: Relationship between TV ads and number of people visiting the family planning clinic: MunicipalitiesNumber of TV ads (X) Number of people visiting the clinic (Y)
The equation of the line is Y= X If X= 5, our predicted value for Y will be Y= (5) = 29.2 If X=7, our predicted value for Y will be Y= (7)= 40.7 Interpretation: An increase of one in the number of TV ad runs will generate a 5.76 increase in the number of people visiting the family planning clinic. So the family planning officer can now proceed with evaluating the cost effectiveness of the program ads.
Coefficient of Determination The coefficient of determination is the percent variation in Y explained or accounted for by the variability of X. It is derived by squaring R and multiplying by 100. It is expressed in percentage term. Thus, if R=.9, the coefficient of determination will be 81%. Formula:
Hypothesis Testing for a and b- We use the t-statistic to test the Hypothesis that a and b are significantly different from zero. Excel analysis of the problem
Summary Output dF: k, n-(k+1), n-1 Revised Figure
DUMMY VARIABLE Represents nominal or categorical variable in the regression model For Example: Y= b0 + b1X1 + b2X2 Y= scores, X1=hours spent in studying, X2=M/F taking a value of 1 if male, otherwise 0