Download presentation
Presentation is loading. Please wait.
Published bySuzanna Harrington Modified over 9 years ago
1
Class 3 Relationship Between Variables CERAM February-March-April 2008 Lionel Nesta Observatoire Français des Conjonctures Economiques Lionel.nesta@ofce.sciences-po.fr
2
Introduction Typically, the social scientist is less interested in describing one variable than in describing the association between two or more variables. This class is devoted to the study of whether (yes or no) two variables are related (associated). By relationship between variables, we mean any association between two dimensions, qualitative or quantitative or both, which appears to be systematic in some ways.
3
Structure of the Class One qualitative (multinomial) and one quantitative (continuous/discrete) variables Analysis of variance Two qualitative (multinomial) variables Chi-square ( χ² ) independence test Two quantitative (continuous/discrete) variables Correlation coefficient
4
ANOVA
5
ANOVA: ANalysis Of VAriance ANOVA is a generalization of Student t test Student test applies to two categories only: H 0 : μ 1 = μ 2 H 1 : μ 1 ≠ μ 2 ANOVA is a method to test whether group means are equal or not. H 0 : μ 1 = μ 2 = μ 3 =... = μ n H 1 : At least one mean differs significantly
6
ANOVA This method is called after the fact that it is based on measures of variance. The F-statistics is a ratio comparing the variance due to group differences (explained variance) with the variance due to other phenomena (unexplained variance). Higher F means more explanatory power, thus more significance of groups.
7
Revenues (in million of US $ ) Sector 1Sector 2Sector 3 Firm 118.021.534.8 Firm 218.021.534.8 Firm 318.021.534.8 Firm 418.021.534.8 Firm 518.021.534.8
8
Revenues (in million of US $ ) Sector 1Sector 2Sector 3 Firm 118.0 Firm 221.5 Firm 325.0 Firm 428.7 Firm 534.8
9
Revenues (in million of US $ ) Sector 1Sector 2Sector 3 Firm 119.623.730.8 Firm 219.428.432.9 Firm 321.928.535.3 Firm 421.231.731.8 Firm 524.637.035.7 Do sectors differ significantly in their revenues? H 0 : μ 1 = μ 2 = μ 3 =... = μ n H 1 : At least one mean differs significantly.
10
ANOVA df = (k – 1) df = n – kdf = n – 1 residual This decomposition produces Fisher’s Statistics as follows:
11
Origin of variationSSd.f.MSSF-StatProb>F SS-between379.12189.6 SS-within (residual)132.51211.0 SS-total511.61436.5417.70.0003 The result tells me that I can reject the null Hypothesis H 0 with 0.03% chances of rejecting the null Hypothesis H 0 while H 0 holds true (being wrong). I WILL TAKE THE CHANCE!!! The ANOVA decomposition on Revenues
12
Verify that US companies are larger than those from the rest of the world with an ANOVA Are there systematic Sectoral differences in terms of labour; R&D, sales Write out H 0 and H 1 for each variables Analyse Comparer les moyennes ANOVA à un fateur What do you conclude at 5% level? What do you conclude at 1% level? SPSS Application: ANOVA
13
SPSS Application: t test comparing means
15
Chi-Square Independence Test
16
Introduction to Chi-Square This part devoted to the study of whether two qualitative (categorical) variables are independent: H 0 : Independent: the two qualitative variables do not exhibit any systematic association. H 1 : Dependent: the category of one qualitative variable is associated with the category of another qualitative variable in some systematic way which departs significantly from randomness.
17
The Four Steps Towards The Test 1.Build the cross tabulation to compute observed joint frequencies 2.Compute expected joint frequencies under the assumption of independence 3.Compute the Chi-square ( χ²) distance between observed and expected joint frequencies 4.Compute the significance of the χ² distance and conclude on H 0 and H 1
18
1. Cross Tabulation A cross tabulation displays the joint distribution of two or more variables. They are usually referred to as a contingency tables. A contingency table describes the distribution of two (or more) variables simultaneously. Each cell shows the number of respondents that gave a specific combination of responses, that is, each cell contains a single cross tabulation.
19
1. Cross Tabulation We have data on two qualitative and categorical dimensions and we wish to know whether they are related Region (AM, ASIA, EUR) Type of company (DBF, LDF)
20
1. Cross Tabulation We have data on two qualitative and categorical dimensions and we wish to know whether they are related Region (AM, ASIA, EUR) Type of company (DBF, LDF) Analyse Statistiques descriptives Effectifs
21
1. Cross Tabulation We have data on two qualitative and categorical dimensions and we wish to know whether they are related Region (AM, ASIA, EUR) Type of company (DBF, LDF) Analyse Statistiques descriptives Effectifs
22
1. Cross Tabulation Crossing Region (AM, ASIA, EUR) × Type of company (DBF, LDF) Analyse Statistiques descriptives Tableaux Croisés Cellule Observé
23
2. Expected Joint Frequencies In order to say something on the relationship between two categorical variables, it would be nice to produce expected, also called theoretical, frequencies under the assumption of independence between the two variables.
24
Crossing Region (AM, ASIA, EUR) × Type of company (DBF, LDF) Analyse Statistiques descriptives Tableaux Croisés Cellule Théorique 2. Expected Joint Frequencies
25
Analyse Statistiques descriptives Tableaux Croisés Cellule Observé & Théorique 2. Expected Joint Frequencies
26
3. Computing the χ² statistics We can now compare what we observe with what we should observe, would the two variables be independent. The larger the difference, the less independent the two variables. This difference is termed a Chi-Square distance. With a contingency table of n lines and m columns, the statistics follows a χ² distribution with ( n -1)×( m -1) degree of freedom, with the lowest expected frequency being at least 5.
27
Analyse Statistiques descriptives Tableaux Croisés Statistique Chi-deux 3. Computing the χ² statistics
28
4. Conclusion on H 0 versus H 1 We reject H 0 with 0.00% chances of being wrong I will take the chance, and I tentatively conclude that the type of companies and the regional origins are not independent. Using our appreciative knowledge on biotechnology, it makes sense: biotechnology was first born in the USA, with European companies following and Asian (i.e. Japanese) companies being mainly large pharmaceutical companies. Most DBFs are found in the US, then in Europe. This is less true now.
29
Correlations
30
Introduction to Correlations This part is devoted to the study of whether – and the extent to which – two or more quantitative variables are related: Positively correlated: the values of one variable “varying somewhat in step” with the values of another variable Negatively correlated: the values of one continuous variable “varying somewhat in opposite step” with the values of another variable Not correlated: the values of one continuous variable “varying randomly” when the values of another variable vary.
31
Scatter Plot of Fertilizer and Production
32
Scatter Plot of R&D and Patents (log)
34
The Pearson product-moment correlation coefficient is a measure of the co-relation between two variables x and y. Pearson's r reflects the intensity of linear relationship between two variables. It ranges from +1 to -1. r near 1 : Positive Correlation r near -1 : Positive Correlation r near 0 : No or poor correlation Pearson’s Linear Correlation Coefficient r
35
Cov(x,y) : Covariance between x and y x et y : Standard deviation of x and Standard deviation of y n : Number of observations Pearson’s Linear Correlation Coefficient r
36
Is significantly different from 0 ? H 0 : r x,y = 0 H 1 : r x,y 0 t* : if t* > t with (n – 2) degree of freedom and critical probability α (5%), we reject H 0 and conclude that r significantly different from 0. Pearson’s Linear Correlation Coefficient r
37
Analyse Corrélation Bivariée Click on Pearson Pearson’s Linear Correlation Coefficient r
38
Assumptions of Pearson’s r There is a linear relationships between x and y Both x and y are continuous random variables Both variables are normally distributed Equal differences between measurements represent equivalent intervals. We may want to relax (one of) these assumptions Pearson’s Linear Correlation Coefficient r
39
Spearman’s Rank Correlation Coefficient ρ Spearman's rank correlation is a non parametric measure of the intensity of a correlation between two variables, without making any assumptions about the distribution of the variables, i.e. about the linearity, normality or scale of the relationship. near 1 : Positive Correlation near -1 : Positive Correlation near 0 : No or poor correlation
40
d² : the difference between ranks of paired values of x and y n : Number of observations ρ is simply a special case of the Pearson product-moment coefficient in which the data are converted to rankings before calculating the coefficient. Spearman’s Rank Correlation Coefficient ρ
41
Analyse Corrélation Bivariée Click on “Spearman” Spearman’s Rank Correlation Coefficient ρ
42
Pearson’s r or Spearman’s ρ ? Relationship between tastes and levels of consumption on a large sample? (ρ) Relationship between income and Consumption on a large sample? (r) Relationship between income and Consumption on a small sample? Both (ρ) and (r)
43
Assignments on CERAM_LMC Produce descriptive statistics on R&D, sales and number of employees, by sector Perform an ANOVA to test whether there are significant differences between sectors in these three variables Perform an ANOVA using the log of these three variables. What do you observe Is the sector composition of the LMCs region specific?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.