Download presentation
Presentation is loading. Please wait.
Published byEric Fowler Modified over 9 years ago
1
Maureen Meadows Senior Lecturer in Management, Open University Business School
2
Discuss different forms of dependency and correlation that we might find in our datasets Explore why it might sometimes be problematic to get a good measure of correlation Look at examples of data analysis where correlation is an important issue Discuss why it is important to deal with correlations, and not to forget about them!
3
Dependence is any statistical relationship between two random variables, or two sets of data In other words, they are not independent of each other E.g. the relationship between the height of parents and the height of children, or the relationship between the demand for a product and its price
4
Two variables are said to be correlated if changes in one variable are associated with changes in the other variable So, if we know how one variable is changing, we have a good idea how the other variable is likely to be changing too Hence it is widely used in many forms of forecasting etc.
5
Pearson’s product moment correlation coefficient is also known as r, R, or Pearson's r It is a measure of the strength and direction of the association - in particular, the linear relationship - between two (metric) variables It is defined as the covariance of the variables divided by the product of their standard deviations
6
It is a measure of the linear dependence or correlation between the two variables The sign (+ or -) indicates the direction of the relationship The value can range from -1 to +1, with +1 indicating a perfect positive relationship, 0 indicating no relationship and -1 indicating a perfect negative or reverse relationship
7
A rank correlation is any of several statistics that measure the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the labels "first", "second", "third", etc. to different observations of a particular variable A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them
8
Spearman’s rank correlation coefficient is a measure of how well the relationship between two variables can be described by a monotonic function Kendall tau rank correlation coefficient is a measure of the portion of ranks that match between two data sets Goodman and Kruskal's gamma is a measure of the strength of association of the cross tabulated data when both variables are measured at the ordinal level
9
Is a correlation coefficient significantly different from zero or not? Example: three different correlation coefficients: 0.50, 0.35, and 0.17. Assume that we want to test whether there is no significant relationship between the two variables at hand. The null hypothesis (H0) to be tested is that these r values are not statistically different from zero (rho = 0).
10
For rho = 0, H0 can be tested using a two tailed t-test at a given confidence level, usually at a 95% level If t calculated ≥ t table, H0 is rejected If t calculated < t table H0 is not rejected and there is no significant correlation between variables Here t calculated is computed as r/SEr = r*SQRT[((n – 2)/(1 – r 2 ))] while t table values are obtained from the literature
11
For n = 14, all three r values (0.50, 0.35, and 0.17) are not statistically different from zero For n = 30, r = 0.50 is statistically different from zero while r = 0.35 and r = 0.17 are not Conversely, r = 0.50 is not statistically different from zero when n is equal to or less than 14 while r = 0.35 is not different from zero when n is equal to or less than 30 Finally, r = 0.17 is not statistically different from zero at any of the sample sizes tested
12
It is a table showing the correlations between a set of variables Often inspected before multivariate methods are applied, e.g. regression analysis or factor analysis Examples: Kim et al (2011), Mohammed (2013)
13
The square of the correlation coefficient, typically denoted r 2, is called the coefficient of determination It estimates the fraction of the variance in Y that is explained by X in a simple linear regression Example: McDaniel (1981)
14
An expression of the relationship between two variables (collinearity), or more than two variables (multicollinearity) Two variables exhibit complete collinearity if their correlation coefficient is 1, and complete lack of collinearity if their correlation coefficient is 0 Multicollinearity occurs when a variable is highly correlated with a set of other variables
15
Multicollinearity is the extent to which a variable can be ‘explained’ by the other variables in the analysis As multicollinearity increases, it complicates the interpretation of the variate (the linear combination of variables formed in a technique such as regression) It becomes more difficult to ascertain the effect of any single variables, because of their inter-relationships
16
An indicator of the effect that the other independent variables have on the standard error of a regression coefficient Large VIF values indicate a high degree of collinearity or multicollinearity among the independent variables The VIF is directly related to the tolerance value (VIF = 1/tol) Example: Zhou et al (2013)
17
Another commonly used measure of collinearity and multicollinearity The tolerance of a variable is 1- r 2, where r 2 is the coefficient of determination for the prediction of that variable by the other independent variables As the tolerance grows smaller, the variable is more highly predicted by the other independent variables (collinearity)
18
Variables that are multicollinear are implictly weighted more heavily E.g. if we cluster on 10 equally weighted variables, and they form two dimensions (one of 8 variables and the other of the remaining 2), then the first dimension will have four times as many chances to affect the similarity measure
19
Some degree of multicollinearity is desirable, because the objective is to identify sets of variables that are interrelated Examples: Kim et al (2011), Mohammed (2013)
20
Hair, Anderson, Tatham and Black, Multivariate Data Analysis, 7 th edition (Prentice Hall, 2009) Kim, JY, Shim, JP and Ahn, KM (2011) ‘Social Networking Service: Motivation, Pleasure, and Behavioral Intention to Use’, Journal of Computer Information Systems, Summer, 92-101. McDaniel, SW (1981) ‘Multicollinearity in advertising-related data’, Journal of Advertising Research, 21:3, 59-63. Mohammed, S. (2013) ‘Factors Affecting E-Banking Usage in India: an Empirical Analysis’, Economic Insights – Trends and Challenges, Vol. II (LXV) No. 1, 17-25. Zhou, X, Han, Y and Wang, R (2013) ‘An Empirical Investigation on Firms’ Proactive and Passive Motivation for Bribery in China’, Journal of Business Ethics, 118:461-472.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.