Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 2: Examining the relationship between two quantitative.

Similar presentations


Presentation on theme: "Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 2: Examining the relationship between two quantitative."— Presentation transcript:

1 Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 2: Examining the relationship between two quantitative variables Priyantha Wijayatunga, Department of Statistics, Umeå University priyantha.wijayatunga@stat.umu.se These materials are altered ones from copyrighted lecture slides (© 2009 W.H. Freeman and Company) from the homepage of the book: The Practice of Business Statistics Using Data for Decisions :Second Edition by Moore, McCabe, Duckworth and Alwan.

2 Examining relationship between two quantitative variables Reference to the book: Chapter 2.1 and 2.2  Explanatory and response variables  Scatterplots and interpreting, outliers  Categorical variables in scatterplots  Quantifying linear relationships with correlation coefficient “r”  Properties of correlation coefficient

3 Examining Relationships Most statistical studies involve more than one variable. Questions:  What individuals do the data describe?  What variables are present and how are they measured?  Are all of the variables quantitative?  Do some of the variables explain or even cause changes in other variables?

4 Relationships between two variables Most models are linear 1.Probabilistic models: Eg. Real estate prices in Umeå may be related to population per km 2 in the local area plus some random variation 2.Deterministic models: Eg: in electric current theory V=IR (unless for measurement errors, valtage of a given wire is proportaional to current flow) Most models in economics and finance may be probabilistic and often linear too!

5 Looking at relationships  Start with a graph  Look for an overall pattern and deviations from the pattern  Use numerical descriptions of the data and overall pattern (if appropriate)

6 Explanatory and response variables  A response variable measures or records an outcome of a study. Also called dependent variable.  An explanatory variable explains changes in the response variable (also called independent variable). response variable: real estate price explanatory variable: population per Km 2

7 Scatterplot  A scatterplot shows the relationship between two quantitative variables measured on the same individuals.  Typically, the explanatory or independent variable is plotted on the x axis, and the response or dependent variable is plotted on the y axis.  Each individual in the data appears as a point in the plot.

8 StudentBeersBlood Alcohol 150.1 220.03 390.19 670.095 730.07 930.02 1140.07 1350.085 480.12 530.04 850.06 1050.05 1260.1 1470.09 1510.01 1640.05  Here, we have two quantitative variables for each of 16 students.  1) How many beers they drank, and  2) Their blood alcohol level (BAC)  We are interested in the relationship between the two variables: How is one affected by changes in the other one?

9 StudentBeersBAC 150.1 220.03 390.19 670.095 730.07 930.02 1140.07 1350.085 480.12 530.04 850.06 1050.05 1260.1 1470.09 1510.01 1640.05 Scatterplot example

10 Some plots don’t have clear explanatory and response variables. Do calories explain sodium amounts? Does percent return on Treasury bills explain percent return on common stocks?

11 Interpreting scatterplots  After plotting two variables on a scatterplot, we describe the relationship by examining the form, direction, and strength of the association. We look for an overall pattern …  Form: linear, curved, clusters, no pattern  Direction: positive, negative, no direction  Strength: how closely the points fit the “form”  … and deviations from that pattern.  Outliers

12 Form and direction of an association Linear Nonlinear No relationship

13 Positive association: High values of one variable tend to occur together with high values of the other variable. Negative association: High values of one variable tend to occur together with low values of the other variable.

14 No relationship: X and Y vary independently. Knowing X tells you nothing about Y.

15 Strength of the association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. With a strong relationship, you can get a pretty good estimate of y if you know x. With a weak relationship, for any x you might get a wide range of y values.

16 This is a very strong relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value. This is a weak relationship. For a particular state median household income, you can’t predict the state per capita income very well.

17 How to scale a scatterplot Using an inappropriate scale for a scatterplot can give an incorrect impression. Both variables should be given a similar amount of space: Plot roughly square Points should occupy all the plot space (no blank space)

18 Outliers An outlier is a data value that has a low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall (far) outside of the overall pattern of the relationship.

19 Not an outlier: The upper right-hand point here is not an outlier of the relationship—It is what you would expect for this many beers given the linear relationship between beers/weight and blood alcohol. This point is not in line with the others, so it is an outlier of the relationship. Outliers

20 IQ score and Grade Point Average a)Describe in words what this plot shows. b)Describe the direction, shape, and strength. Are there outliers? c)What is the deal with these people?

21 Categorical variables in scatterplots Often, things are not simple and one-dimensional. We need to group the data into categories to reveal trends. What may look like a positive linear relationship is in fact a series of negative linear associations. Plotting different habitats in different colors allows us to make that important distinction.

22 Comparison of men and women racing records over time. Each group shows a very strong negative linear relationship that would not be apparent without the gender categorization. Relationship between lean body mass and metabolic rate in men and women. Both men and women follow the same positive linear trend, but women show a stronger association. As a group, males typically have larger values for both variables.

23 Categorical explanatory variables When the explanatory variable is categorical, you cannot make a scatterplot, but you can compare the different categories side-by-side on the same graph (boxplots, or mean +/  standard deviation). Comparison of income (quantitative response variable) for different education levels (five categories). But be careful in your interpretation: This is NOT a positive association, because education is not quantitative.

24 Stronger association? Two scatterplots of the same data. The straight-line pattern in the lower plot appears stronger because of the surrounding open space.

25 The correlation coefficient "r"  The correlation coefficient is a measure of the direction and strength of a linear relationship.  It is calculated using the mean and the standard deviation of both the x and y variables.  Correlation can only be used to describe quantitative variables. Categorical variables don’t have means and standard deviations.

26 The correlation coefficient "r" Time to swim: x = 35, s x = 0.7 Pulse rate: y = 140 s y = 9.5

27 Scatterplot: data on baby birth length and height Seems to be linearly related, therefore measure the correlation

28 Correlation between weight and length for newborn babies Correlations 1,765**,,000 35,765**1,000, 35 Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N LENGTH WEIGHT LENGTHWEIGHT Correlation is significant at the 0.01 level (2-tailed). **.

29 Calculating sample correlation coefficient If you have n number of observations on parir of variables X and Y: Sample correlation coefficient between X and Y is x1x1 x2x2 x3x3 xnxn y1y1 y2y2 y3y3 ynyn

30 Detailed example: For randomly selected 6 students number of studying hours for the exam and exam marks are recorded studenthours (x) marks (y) 11017 22032 33058 44060 55087 66099 Total mean3558.83

31 Detailed example: For randomly selected 6 students number of studying hours for the exam and exam marks are recorded studenthours (x) marks (y) 11017-25-41.83 1045.83 6251750.03 22032-15-26.83 402.50 225 720.03 33058 -5 -0.83 4.17 25 0.69 44060 5 1.67 5.83 25 1.36 550871528.17 422.50 225 793.36 660992540.17 1004.17 6251613.36 Total 2885 17504878.83 mean3558.83

32 Detailed example Correlations hoursmarks hoursPearson Correlation1,987 ** Sig. (2-tailed),000 N66 marksPearson Correlation,987 ** 1 Sig. (2-tailed),000 N66 **. Correlation is significant at the 0.01 level (2-tailed).

33 Facts about correlation  r ignores the distinction between response and explanatory variables  r measures the strength and direction of a linear relationship between two quantitative variables  r is not affected by changes in the unit of measurement  Positive value of r means association between the two variables is positive  Negative value of r means association between the variables is negative  r is always between -1 and +1  r is strongly affected by outliers

34 "r" ranges from -1 to +1 Strength: how closely the points follow a straight line. Direction: is positive when individuals with higher X values tend to have higher values of Y.

35 When variability in one or both variables decreases, the correlation coefficient gets stronger (  closer to +1 or -1).

36 No matter how strong the association, r does not describe curved relationships. Correlation only describes linear relationships

37 Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers. Influential points Just moving one point away from the general trend here decreases the correlation from -0.91 to -0.75

38 Review Example  Estimate r 1. r = 1.00 2. r = -0.94 3. r = 1.12 4. r = 0.94 5. r = 0.21 (in 1000’s)


Download ppt "Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 2: Examining the relationship between two quantitative."

Similar presentations


Ads by Google