M248: Analyzing data Block D
Related variables Regression Block D UNITS D3, D2 Related variables Regression Contents Introduction Section 1: Correlation Section 2: Measures of correlation Section 3: Regression Section 4: Independence (Association in Contingency tables) Terms to know and use
Introduction The slides are based on the following points Correlation between two variables and the strength of this correlation. Regression Studying contingency tables
Section 1: Correlation Two random variables are said to be related (or correlated or associated) if knowing the value of one variable tells you something about the value of the other. Alternatively we say there is a relationship between the two variables. The first thing to do when investigating a possible relationship between two variables, is to produce a scatterplot of the data.
Section 1: Correlation Section 1.1: Are the variables related? The two variables are said to be positively related if the pattern of the points in the scatterplot slopes upwards from left to right. Example:
Section 1: Correlation The two variables are said to be negatively related if the pattern of the points in the scatterplot slopes downwards from left to right. Example:
Section 1: Correlation Strong positive Linear relationship between the two variables There is an overall downward trend but there is no suggestion about any relationship between the two variables Strong relationship between the two variables but it is not linear
Section 1: Correlation Sometimes, a relationship between two variables is more complicated and the variables cannot be classified as either positively or negatively related. Example: Read Examples 1.1, 1.2, 1.3 and 1.4 Solve Activity 1.1 Pattern seems to be like
Section 1: Correlation
Section 1: Correlation Section 1.2: Correlation and Causation Causation and correlation are not equivalent. Causation means that the value of one variable is caused by the value of the other, while correlation means that there is a relationship between the two variables.
Section 2: Measures of correlation There are two measures of correlations, The Pearson and the Spearman correlation. These measures are called correlation coefficients. A correlation coefficient is a number between -1 and +1, the closer it is of those limits, the stronger the relationship between the two variables. Correlation coefficients which measure how well a straight line can explain the relationship between two variables are called linear correlation coefficients.
Section 2: Measures of correlation The range of the correlation coefficient is from 1 to 1. If there is a strong positive linear relationship between the variables, the value of r will be close to 1. If there is a strong negative linear relationship between the variables, the value of r will be close to 1.
Section 2: Measures of correlation Section 2.1: The Pearson correlation coefficient Two variables are said to be positively related if they increase together and decrease together, then the correlation coefficient will be positive. And if they are negatively related then it will be negative. A correlation coefficient of 0 implies that there is no systematic linear relationship between the two variables.
Section 2: Measures of correlation The Pearson correlation coefficients is a measure of how well a straight line can explain the relationship between two variables. It is only appropriate to use this coefficient if the scatterplot shows a roughly linear pattern.
Example 1: Car Rental Companies Cars x Income y Company (in 10,000s) (in billions) xy x 2 y 2 A 63.0 7.0 441.00 3969.00 49.00 B 29.0 3.9 113.10 841.00 15.21 C 20.8 2.1 43.68 432.64 4.41 D 19.1 2.8 53.48 364.81 7.84 E 13.4 1.4 18.76 179.56 1.96 F 8.5 1.5 2.75 72.25 2.25 Σ x = Σ y = Σ xy = Σ x 2 = Σ y 2 = 153.8 18.7 682.77 5859.26 80.67
Example 1: Car Rental Companies Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26, Σy2 = 80.67, n = 6
Example 2: Absences/Final Grades Number of Final Grade Student absences, x y (pct.) xy x 2 y 2 A 6 82 492 36 6,724 B 2 86 172 4 7,396 C 15 43 645 225 1,849 D 9 74 666 81 5,476 E 12 58 696 144 3,364 F 5 90 450 25 8,100 G 8 78 624 64 6,084 Σ x = Σ y = Σ xy = Σ x 2 = Σ y 2 = 57 511 3745 579 38,993
Example 2: Absences/Final Grades Σx = 57, Σy = 511, Σxy = 3745, Σx2 = 579, Σy2 = 38,993, n = 7
Example 3: Exercise/Milk Intake Subject Hours, x Amount y xy x 2 y 2 A 3 48 144 9 2,304 B 8 64 C 2 32 64 4 1,024 D 5 64 320 25 4,096 E 8 10 80 64 100 F 5 32 160 25 1,024 G 10 56 560 100 3,136 H 2 72 144 4 5,184 I 1 48 48 1 2,304 Σ x = Σ y = Σ xy = Σ x 2 = Σ y 2 = 36 370 1,520 232 19,236
Example 3: Exercise/Milk Intake Σx = 36, Σy = 370, Σxy = 1520, Σx2 = 232, Σy2 = 19,236, n = 9
Section 2: Measures of correlation Section 2.2: The Spearman rank correlation coefficient Replacing the original data by their ranks, and measuring the strength of association between two variables by calculating the Pearson correlation coefficient with the ranks is known as the Spearman rank correlation coefficient, and is denoted by rs The values of rs is a measure of the linearity of the relationship between the ranks.
Section 2: Measures of correlation Example: GNP and adult literacy Spearman’s Rank correlation coefficient (rs) is the best method to use. GNP per capita % adult literacy Nepal 210 39.7 Sudan 290 55.5 Gambia 340 34.5 Peru 2460 89 Turkey 3160 81.4 Brazil 4570 84 Argentina 8970 97 Israel 15940 96 U.A.E. 18220 74.3 Netherlands 24760 100 Remember, Spearman’s Rank can only be used with ordinal data. It is necessary, therefore, to rank-order the data first.
Section 2: Measures of correlation Example: GNP and adult literacy
Section 2: Measures of correlation Example: GNP and adult literacy Σx = 55, Σy = 55, Σxy = 363, Σx2 = 385, Σy2 = 385, n = 10
Section 2: Measures of correlation A relationship is known as a monotonic increasing relationship if the value of rs is equal to +1. that is they have an exact curvilinear positive relationship. Example: Similarly a data has a Spearman rank correlation coefficient of -1 if the two variables have a monotonic decreasing relationship
Section 2: Measures of correlation Section 2.3: Testing for association The sampling distribution of the Pearson correlation: Under the null hypothesis that there is no association between two variables, the sampling distribution of the Pearson correlation R, is such that:
Example : Car Rental Companies Test the significance of the correlation coefficient found in example 1. Use r = 0.982. H0: 0 This null hypothesis means that there is no correlation between the x and y variables in the population. H1: 0 This alternative hypothesis means that there is a significant correlation between the variables in the population.
Example : Car Rental Companies Step 2: Compute the test value. Step 3: Compute the p-value df = 4 (use table 3 of handbook) then, 0 < p-value < 2(1-0.999) 0 < p-value < 0.002 Step 4: Decision making Since the p-value is ≤ 0.01, then we have a strong evidence against H0 Step 5: Summarize the results. There is a relationship between the number of cars a rental agency owns and its annual income.
Section 2: Measures of correlation The approximate sampling distribution of the Spearman correlation: For large samples, under the null hypothesis of no association, the sampling distribution of Rs, is such that:
Example : GNP and adult literacy Test the significance of the correlation coefficient found in the example. Use rs = 0.73. H0: 0 This null hypothesis means that there is no correlation between the x and y variables in the population. H1: 0 This alternative hypothesis means that there is a significant correlation between the variables in the population.
Example : GNP and adult literacy Step 2: Compute the test value. Step 3: Compute the p-value df = 8 (use table 3 of handbook) then 2(1-0.995) < p-value < 2(1-0.99) 0.01 < p-value < 0.02 Step 4: Since the 0.01< p-value is ≤ 0.05, then we have a moderate evidence against H0 Step 5: Summarize the results. There is a relationship between GNP and adult literacy.
Section 2: Measures of correlation Section 2.4: Correlation using MINITAB See MINITAB Videos on moodle
Regression If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit.
Regression Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum.
Least Square line of a linear regression The response variable (dependent) is denoted by Y and the predicator /explanatory (independent) variable by x
Example : Car Rental Companies Find the equation of the regression line for the data in Example 1. Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26, Σy2 = 80.67, n = 6
Example: Car Rental Companies Use the equation of the regression line to predict the income of a car rental agency that has 200,000 automobiles. x = 20 corresponds to 200,000 automobiles. Hence, when a rental agency has 200,000 automobiles, its revenue will be approximately $2.52 billion.
Chi-Square Distributions The chi-square variable is similar to the t variable in that its distribution is a family of curves based on the number of degrees of freedom. The symbol for chi-square is (Greek letter chi, pronounced “ki”). A chi-square variable cannot be negative, and the distributions are skewed to the right.
Chi-Square Distributions At about 100 degrees of freedom, the chi-square distribution becomes somewhat symmetric. The area under each chi-square distribution is equal to 1.00, or 100%.
The Chi-Square Test When data can be tabulated in table form in terms of frequencies, several types of hypotheses can be tested by using the chi-square test. The test of independence of variables is used to determine whether two variables are independent of or related to each other when a single sample is selected. Formula for the test: Where d.f. = number of categories minus 1 O = observed frequency E = expected frequency
Test for Independence The chi-square test can be used to test the independence of two variables. The hypotheses are: H0: There is no relationship between two variables. H1: There is a relationship between two variables. If the null hypothesis is rejected, there is some relationship between the variables. In order to test the null hypothesis, one must compute the expected frequencies, assuming the null hypothesis is true.
Contingency Tables (Association) When data are arranged in table form for the independence test, the table is called a contingency table. The degrees of freedom for any contingency table are d.f. = (rows – 1) (columns – 1) = (R – 1)(C – 1).
Example 1: College Education and Place of Residence A sociologist wishes to see whether the number of years of college a person has completed is related to her or his place of residence. A sample of 88 people is selected and classified as shown. Can the sociologist conclude that a person’s location is dependent on the number of years of college? Location No College Four-Year Degree Advanced Degree Total Urban 15 12 8 35 Suburban 9 32 Rural 6 7 21 29 24 88
Example 1: College Education and Place of Residence Step 1: State the hypotheses and identify the claim. H0: A person’s place of residence is independent of the number of years of college completed. H1: A person’s place of residence is dependent on the number of years of college completed (claim).
Example 1: College Education and Place of Residence Compute the expected values. Location No College Four-Year Degree Advanced Degree Total Urban 15 12 8 35 Suburban 9 32 Rural 6 7 21 29 24 88 (11.53) (13.92) (9.55) (10.55) (12.73) (8.73) (6.92) (8.35) (5.73)
Example 1: College Education and Place of Residence Step 2: Compute the test value. The degrees of freedom are (3 – 1)(3 – 1) = 4.
Example 1: College Education and Place of Residence Step 3: Compute the P-value. Use Table 4 in the handbook (1-0.5) < p-value < (1-0.1) 0.5< p- value < 0.9 Step 4: Making the decision Since the p-value > 0.1 then we have a little evidence against H0: Step 5: Summarize the results. There is not enough evidence to support the claim that a person’s place of residence is dependent on the number of years of college completed.
Example 2: Alcohol and Gender A researcher wishes to determine whether there is a relationship between the gender of an individual and the amount of alcohol consumed. A sample of 68 people is selected, and the following data are obtained. Can the researcher conclude that alcohol consumption is related to gender? Gender Alcohol Consumption Total Low Moderate High Male 10 9 8 27 Female 13 16 12 41 23 25 20 68
Example 2: Alcohol and Gender Step 1: State the hypotheses and identify the claim. H0: The amount of alcohol that a person consumes is independent of the individual’s gender. H1: The amount of alcohol that a person consumes is dependent on the individual’s gender (claim).
Example 2: Alcohol and Gender Compute the expected values. Gender Alcohol Consumption Total Low Moderate High Male 10 9 8 27 Female 13 16 12 41 23 25 20 68 (9.13) (9.93) (7.94) (13.87) (15.07) (12.06)
Example 2: Alcohol and Gender Step 2: Compute the test value. The degrees of freedom are (2 – 1 )(3 – 1) = (1)(2) = 2.
Example 2: Alcohol and Gender Step 3: Compute the P-value. Use Table 4 in the handbook (1-0.5) < p-value < (1-0.1) 0.5< p- value < 0.9 Step 4: Making the decision. Since the p-value > 0.1, then we have a little evidence against H0 Step 5: Summarize the results. There is not enough evidence to support the claim that the amount of alcohol a person consumes is dependent on the individual’s gender.
Terms to know and use Related variable Monotonic relationship Correlation Correlation coefficient Association Pearson correlation coefficient Causation Spearman rank correlation coefficient Positively related Negatively related Monotonic increasing Monotonic decreasing
Terms to know and use Explanatory variables Independence Response variable Association Linear regression model Regression line Least square line Contingency Table