M248: Analyzing data Block D.

Slides:



Advertisements
Similar presentations
Correlation and Regression
Advertisements

Correlation and Regression
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Correlation and Regression
SIMPLE LINEAR REGRESSION
Chapter 9: Correlation and Regression
Chi-Square and Analysis of Variance (ANOVA)
Presentation 12 Chi-Square test.
How Can We Test whether Categorical Variables are Independent?
Correlation and Regression
GOODNESS OF FIT TEST & CONTINGENCY TABLE
Chapter Correlation and Regression 1 of 84 9 © 2012 Pearson Education, Inc. All rights reserved.
Correlation.
Correlation and Regression
Regression Section 10.2 Bluman, Chapter 101. Example 10-1: Car Rental Companies Construct a scatter plot for the data shown for car rental companies in.
© The McGraw-Hill Companies, Inc., Chapter 11 Correlation and Regression.
Correlation and Regression
Test of Independence. The chi squared test statistic and test procedure can also be used to investigate association between 2 categorical variables in.
Production Planning and Control. A correlation is a relationship between two variables. The data can be represented by the ordered pairs (x, y) where.
Correlation & Regression
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
Unit 10 Correlation and Regression McGraw-Hill, Bluman, 7th ed., Chapter 10 1.
Other Chi-Square Tests
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Other Chi-Square Tests
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
11.2 Tests Using Contingency Tables When data can be tabulated in table form in terms of frequencies, several types of hypotheses can be tested by using.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 7 Calculation of Pearson Coefficient of Correlation, r and testing its significance.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
Spearman Rank Correlation Co-Efficient used with ordinal data... Note that a ‘statistical relationship’ can occur where no ‘meaningful relationship’ is.
Correlation and Regression. O UTLINE Introduction  10-1 Scatter plots.  10-2 Correlation.  10-3 Correlation Coefficient.  10-4 Regression.
Section 10.2 Objectives Use a contingency table to find expected frequencies Use a chi-square distribution to test whether two variables are independent.
Determining and Interpreting Associations between Variables Cross-Tabs Chi-Square Correlation.
Correlation and Regression
Other Chi-Square Tests
Testing the Difference between Means, Variances, and Proportions
Scatter Plots and Correlation
CHAPTER 10 & 13 Correlation and Regression
Regression and Correlation
Presentation 12 Chi-Square test.
CHAPTER 10 & 13 Correlation and Regression
Testing the Difference between Means and Variances
10 Chapter Chi-Square Tests and the F-Distribution Chapter 10
Spearman’s Rank correlation coefficient
10.2 Regression If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is.
Correlation and Simple Linear Regression
Chapter 5 STATISTICS (PART 4).
SIMPLE LINEAR REGRESSION MODEL
Correlation and Regression
Elementary Statistics
Correlation and Regression
CHAPTER fourteen Correlation and Regression Analysis
Correlation and Regression
AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…
Statistical Inference about Regression
Association, correlation and regression in biomedical research
Chapter 10 Analyzing the Association Between Categorical Variables
Correlation and Regression
Analyzing the Association Between Categorical Variables
M248: Analyzing data Block D UNIT D3 Related variables.
Product moment correlation
Correlation and the Pearson r
Making Use of Associations Tests
COMPARING VARIABLES OF ORDINAL OR DICHOTOMOUS SCALES: SPEARMAN RANK- ORDER, POINT-BISERIAL, AND BISERIAL CORRELATIONS.
Presentation transcript:

M248: Analyzing data Block D

Related variables Regression Block D UNITS D3, D2 Related variables Regression Contents Introduction Section 1: Correlation Section 2: Measures of correlation Section 3: Regression Section 4: Independence (Association in Contingency tables) Terms to know and use

Introduction The slides are based on the following points Correlation between two variables and the strength of this correlation. Regression Studying contingency tables

Section 1: Correlation Two random variables are said to be related (or correlated or associated) if knowing the value of one variable tells you something about the value of the other. Alternatively we say there is a relationship between the two variables. The first thing to do when investigating a possible relationship between two variables, is to produce a scatterplot of the data.

Section 1: Correlation Section 1.1: Are the variables related? The two variables are said to be positively related if the pattern of the points in the scatterplot slopes upwards from left to right. Example:

Section 1: Correlation The two variables are said to be negatively related if the pattern of the points in the scatterplot slopes downwards from left to right. Example:

Section 1: Correlation Strong positive Linear relationship between the two variables There is an overall downward trend but there is no suggestion about any relationship between the two variables Strong relationship between the two variables but it is not linear

Section 1: Correlation Sometimes, a relationship between two variables is more complicated and the variables cannot be classified as either positively or negatively related. Example: Read Examples 1.1, 1.2, 1.3 and 1.4 Solve Activity 1.1 Pattern seems to be like

Section 1: Correlation

Section 1: Correlation Section 1.2: Correlation and Causation Causation and correlation are not equivalent. Causation means that the value of one variable is caused by the value of the other, while correlation means that there is a relationship between the two variables.

Section 2: Measures of correlation There are two measures of correlations, The Pearson and the Spearman correlation. These measures are called correlation coefficients. A correlation coefficient is a number between -1 and +1, the closer it is of those limits, the stronger the relationship between the two variables. Correlation coefficients which measure how well a straight line can explain the relationship between two variables are called linear correlation coefficients.

Section 2: Measures of correlation The range of the correlation coefficient is from 1 to 1. If there is a strong positive linear relationship between the variables, the value of r will be close to 1. If there is a strong negative linear relationship between the variables, the value of r will be close to 1.

Section 2: Measures of correlation Section 2.1: The Pearson correlation coefficient Two variables are said to be positively related if they increase together and decrease together, then the correlation coefficient will be positive. And if they are negatively related then it will be negative. A correlation coefficient of 0 implies that there is no systematic linear relationship between the two variables.

Section 2: Measures of correlation The Pearson correlation coefficients is a measure of how well a straight line can explain the relationship between two variables. It is only appropriate to use this coefficient if the scatterplot shows a roughly linear pattern.

Example 1: Car Rental Companies Cars x Income y Company (in 10,000s) (in billions) xy x 2 y 2 A 63.0 7.0 441.00 3969.00 49.00 B 29.0 3.9 113.10 841.00 15.21 C 20.8 2.1 43.68 432.64 4.41 D 19.1 2.8 53.48 364.81 7.84 E 13.4 1.4 18.76 179.56 1.96 F 8.5 1.5 2.75 72.25 2.25 Σ x = Σ y = Σ xy = Σ x 2 = Σ y 2 = 153.8 18.7 682.77 5859.26 80.67

Example 1: Car Rental Companies Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26, Σy2 = 80.67, n = 6

Example 2: Absences/Final Grades Number of Final Grade Student absences, x y (pct.) xy x 2 y 2 A 6 82 492 36 6,724 B 2 86 172 4 7,396 C 15 43 645 225 1,849 D 9 74 666 81 5,476 E 12 58 696 144 3,364 F 5 90 450 25 8,100 G 8 78 624 64 6,084 Σ x = Σ y = Σ xy = Σ x 2 = Σ y 2 = 57 511 3745 579 38,993

Example 2: Absences/Final Grades Σx = 57, Σy = 511, Σxy = 3745, Σx2 = 579, Σy2 = 38,993, n = 7

Example 3: Exercise/Milk Intake Subject Hours, x Amount y xy x 2 y 2 A 3 48 144 9 2,304 B 8 64 C 2 32 64 4 1,024 D 5 64 320 25 4,096 E 8 10 80 64 100 F 5 32 160 25 1,024 G 10 56 560 100 3,136 H 2 72 144 4 5,184 I 1 48 48 1 2,304 Σ x = Σ y = Σ xy = Σ x 2 = Σ y 2 = 36 370 1,520 232 19,236

Example 3: Exercise/Milk Intake Σx = 36, Σy = 370, Σxy = 1520, Σx2 = 232, Σy2 = 19,236, n = 9

Section 2: Measures of correlation Section 2.2: The Spearman rank correlation coefficient Replacing the original data by their ranks, and measuring the strength of association between two variables by calculating the Pearson correlation coefficient with the ranks is known as the Spearman rank correlation coefficient, and is denoted by rs The values of rs is a measure of the linearity of the relationship between the ranks.

Section 2: Measures of correlation Example: GNP and adult literacy Spearman’s Rank correlation coefficient (rs) is the best method to use.   GNP per capita % adult literacy Nepal 210 39.7 Sudan 290 55.5 Gambia 340 34.5 Peru 2460 89 Turkey 3160 81.4 Brazil 4570 84 Argentina 8970 97 Israel 15940 96 U.A.E. 18220 74.3 Netherlands 24760 100 Remember, Spearman’s Rank can only be used with ordinal data. It is necessary, therefore, to rank-order the data first.

Section 2: Measures of correlation Example: GNP and adult literacy

Section 2: Measures of correlation Example: GNP and adult literacy Σx = 55, Σy = 55, Σxy = 363, Σx2 = 385, Σy2 = 385, n = 10

Section 2: Measures of correlation A relationship is known as a monotonic increasing relationship if the value of rs is equal to +1. that is they have an exact curvilinear positive relationship. Example: Similarly a data has a Spearman rank correlation coefficient of -1 if the two variables have a monotonic decreasing relationship

Section 2: Measures of correlation Section 2.3: Testing for association The sampling distribution of the Pearson correlation: Under the null hypothesis that there is no association between two variables, the sampling distribution of the Pearson correlation R, is such that:

Example : Car Rental Companies Test the significance of the correlation coefficient found in example 1. Use r = 0.982. H0:   0 This null hypothesis means that there is no correlation between the x and y variables in the population. H1:   0 This alternative hypothesis means that there is a significant correlation between the variables in the population.

Example : Car Rental Companies Step 2: Compute the test value. Step 3: Compute the p-value df = 4 (use table 3 of handbook) then, 0 < p-value < 2(1-0.999) 0 < p-value < 0.002 Step 4: Decision making Since the p-value is ≤ 0.01, then we have a strong evidence against H0 Step 5: Summarize the results. There is a relationship between the number of cars a rental agency owns and its annual income.

Section 2: Measures of correlation The approximate sampling distribution of the Spearman correlation: For large samples, under the null hypothesis of no association, the sampling distribution of Rs, is such that:

Example : GNP and adult literacy Test the significance of the correlation coefficient found in the example. Use rs = 0.73. H0:   0 This null hypothesis means that there is no correlation between the x and y variables in the population. H1:   0 This alternative hypothesis means that there is a significant correlation between the variables in the population.

Example : GNP and adult literacy Step 2: Compute the test value. Step 3: Compute the p-value df = 8 (use table 3 of handbook) then 2(1-0.995) < p-value < 2(1-0.99) 0.01 < p-value < 0.02 Step 4: Since the 0.01< p-value is ≤ 0.05, then we have a moderate evidence against H0 Step 5: Summarize the results. There is a relationship between GNP and adult literacy.

Section 2: Measures of correlation Section 2.4: Correlation using MINITAB See MINITAB Videos on moodle

Regression If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit.

Regression Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum.

Least Square line of a linear regression The response variable (dependent) is denoted by Y and the predicator /explanatory (independent) variable by x

Example : Car Rental Companies Find the equation of the regression line for the data in Example 1. Σx = 153.8, Σy = 18.7, Σxy = 682.77, Σx2 = 5859.26, Σy2 = 80.67, n = 6

Example: Car Rental Companies Use the equation of the regression line to predict the income of a car rental agency that has 200,000 automobiles. x = 20 corresponds to 200,000 automobiles. Hence, when a rental agency has 200,000 automobiles, its revenue will be approximately $2.52 billion.

Chi-Square Distributions The chi-square variable is similar to the t variable in that its distribution is a family of curves based on the number of degrees of freedom. The symbol for chi-square is (Greek letter chi, pronounced “ki”). A chi-square variable cannot be negative, and the distributions are skewed to the right.

Chi-Square Distributions At about 100 degrees of freedom, the chi-square distribution becomes somewhat symmetric. The area under each chi-square distribution is equal to 1.00, or 100%.

The Chi-Square Test When data can be tabulated in table form in terms of frequencies, several types of hypotheses can be tested by using the chi-square test. The test of independence of variables is used to determine whether two variables are independent of or related to each other when a single sample is selected. Formula for the test: Where d.f. = number of categories minus 1 O = observed frequency E = expected frequency

Test for Independence The chi-square test can be used to test the independence of two variables. The hypotheses are: H0: There is no relationship between two variables. H1: There is a relationship between two variables. If the null hypothesis is rejected, there is some relationship between the variables. In order to test the null hypothesis, one must compute the expected frequencies, assuming the null hypothesis is true.

Contingency Tables (Association) When data are arranged in table form for the independence test, the table is called a contingency table. The degrees of freedom for any contingency table are d.f. = (rows – 1) (columns – 1) = (R – 1)(C – 1).

Example 1: College Education and Place of Residence A sociologist wishes to see whether the number of years of college a person has completed is related to her or his place of residence. A sample of 88 people is selected and classified as shown. Can the sociologist conclude that a person’s location is dependent on the number of years of college? Location No College Four-Year Degree Advanced Degree Total Urban 15 12 8 35 Suburban 9 32 Rural 6 7 21 29 24 88

Example 1: College Education and Place of Residence Step 1: State the hypotheses and identify the claim. H0: A person’s place of residence is independent of the number of years of college completed. H1: A person’s place of residence is dependent on the number of years of college completed (claim).

Example 1: College Education and Place of Residence Compute the expected values. Location No College Four-Year Degree Advanced Degree Total Urban 15 12 8 35 Suburban 9 32 Rural 6 7 21 29 24 88 (11.53) (13.92) (9.55) (10.55) (12.73) (8.73) (6.92) (8.35) (5.73)

Example 1: College Education and Place of Residence Step 2: Compute the test value. The degrees of freedom are (3 – 1)(3 – 1) = 4.

Example 1: College Education and Place of Residence Step 3: Compute the P-value. Use Table 4 in the handbook (1-0.5) < p-value < (1-0.1) 0.5< p- value < 0.9 Step 4: Making the decision Since the p-value > 0.1 then we have a little evidence against H0: Step 5: Summarize the results. There is not enough evidence to support the claim that a person’s place of residence is dependent on the number of years of college completed.

Example 2: Alcohol and Gender A researcher wishes to determine whether there is a relationship between the gender of an individual and the amount of alcohol consumed. A sample of 68 people is selected, and the following data are obtained. Can the researcher conclude that alcohol consumption is related to gender? Gender Alcohol Consumption Total Low Moderate High Male 10 9 8 27 Female 13 16 12 41 23 25 20 68

Example 2: Alcohol and Gender Step 1: State the hypotheses and identify the claim. H0: The amount of alcohol that a person consumes is independent of the individual’s gender. H1: The amount of alcohol that a person consumes is dependent on the individual’s gender (claim).

Example 2: Alcohol and Gender Compute the expected values. Gender Alcohol Consumption Total Low Moderate High Male 10 9 8 27 Female 13 16 12 41 23 25 20 68 (9.13) (9.93) (7.94) (13.87) (15.07) (12.06)

Example 2: Alcohol and Gender Step 2: Compute the test value. The degrees of freedom are (2 – 1 )(3 – 1) = (1)(2) = 2.

Example 2: Alcohol and Gender Step 3: Compute the P-value. Use Table 4 in the handbook (1-0.5) < p-value < (1-0.1) 0.5< p- value < 0.9 Step 4: Making the decision. Since the p-value > 0.1, then we have a little evidence against H0 Step 5: Summarize the results. There is not enough evidence to support the claim that the amount of alcohol a person consumes is dependent on the individual’s gender.

Terms to know and use Related variable Monotonic relationship Correlation Correlation coefficient Association Pearson correlation coefficient Causation Spearman rank correlation coefficient Positively related Negatively related Monotonic increasing Monotonic decreasing

Terms to know and use Explanatory variables Independence Response variable Association Linear regression model Regression line Least square line Contingency Table