Download presentation
Presentation is loading. Please wait.
1
Lecture 4 Statistical analysis
Heidi Hogset September, 2014
2
Lecture aim and objectives
Investigate methods of statistical analysis Objectives Research questions and hypotheses Statistical tests SCM300
3
Research questions Survey research is all about asking questions
Descriptive questions, univariate – covered in lecture 3 E.g., how has gender composition of enrolled students changed over the last 20 years? Causal relationships, multivariate – topic today E.g., what is the relationship between unemployment rates and applications for admission to higher education? SCM300
4
Research questions The research question is The variables required are
Does the unemployment rate influence people’s propensity to seek higher education? The variables required are Unemployment rate each year in Norway (choose period) Number of applications for admission submitted to colleges and universities in Norway each year (same period) SCM300
5
Research questions Suggested causal relationship?
Source for left image: MS clipart Source for right image: SCM300
6
Research questions Hypothesis – a statement to test a particular proposition Example: In periods with higher levels of unemployment, colleges and universities receive more applications for admission SCM300
7
Research questions Observed trends Variables:
Share of population in each age bracket that is enrolled in higher education (%) Share of labor force in age bracket that is registered as totally unemployed (%) SCM300
8
Research questions We assume causal relationship goes from unemployment to school applications, not vice versa “Applications” is a Dependent variable (DV) “Unemployment” is an Independent variable (IV) SCM300
9
Research questions Null hypothesis
There is NO relationship between unemployment rates and school applications SCM300
10
Research questions Alternative hypotheses
There is a significant relationship between unemployment rates and school applications (non-directional) Two-tailed There is a significant and positive relationship between unemployment and school applications (directional) One-tailed SCM300
11
Research questions Hypothesis testing One-tailed Two-tailed
Use 2-tailed unless you have a good reason to choose 1-tailed SCM300
12
Your survey You are expected to develop (a) research question(s) based on theory developed in your discipline of interest Example: Green Taxing Does it work? How strong is the effect? SCM300
13
Your survey Variables needed?
How might you create the variables using a survey? What hypothesis might you use? SCM300
14
Statistical analysis The significance of each hypothesis is tested using statistical analysis The objective is to reject or accept each hypothesis “Accept” means you have not disproved it “Accept” does not mean the hypothesis has been proved SCM300
15
Statistical analysis Relationship between two variables
One-Sample T Test Paired Samples T Test Independent Samples T Test Chi-square Test One-Way ANOVA Correlation analysis Simple Regression Analysis SCM300
16
Statistical analysis Relationship between > two variables
Multiple Regression Analysis Logistic Regression Analysis SCM300
17
Compare two variables – example 1
Number of students aged enrolled in higher education in Norway, by sex, “Male” is number of male students “Female” is number of female students 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Male 27 826 27 103 28 040 28 151 28 612 30 817 30 099 31 478 37 564 40 345 42 544 Female 31 721 31 827 33 669 33 204 33 392 37 062 37 122 39 690 46 188 49 131 51 446 SCM300
18
One Sample T Test Compute difference: Female – Male
Test: Is the difference significantly different from zero? SCM300
19
One Sample T Test SCM300
20
One Sample T Test SCM300
21
One Sample T Test If our test is directional, we should use the one-tailed significance, which is half of the 2-tailed significance. However, here, 0,000/2 = 0,000: No difference. SCM300
22
Paired Samples T Test Compare variables directly: Female vs. Male
Test: Are their means significantly different? SCM300
23
Paired Samples T Test SCM300
24
Paired Samples T Test SCM300
25
Paired Samples T Test SCM300
26
Independent Samples T Test
Has the finance crisis in influenced enrolment in higher education? Test: Are the means significantly different before and after 2007? SCM300
27
Independent Samples T Test
SCM300
28
Independent Samples T Test
SCM300
29
Independent Samples T Test
Note – the difference between the means is not only due to the finance crisis. There is also a long-term trend towards higher enrolment rates that we should have controlled for. The F is Levene’s test for Equality of Variances. A small value of significance here indicates that the appropriate T Test is that where Equal variances are not assumed. SCM300
30
Compare two variables – example 2
A survey of pigeonpea farmers in Tanzania in 2008 – variables: District (there are 4) Respondent characteristics (there are 609 respondents) The respondent’s sex, age, number of years in school, number of dependents in the household, distance to the nearest main market (for farm products) The respondent’s farm operation The number of plots planted in improved varieties of pigeonpeas divided by the total number of plots planted in pigeonpeas (“share”) SCM300
31
Chi-square Test We need to compare two nominal or ordinal variables (discrete data) We want to check if our sampling procedure has produced a biased sample with respect to gender composition We assume the proportion of households with a female household head is independent of districts Test: Is the sex distribution different between districts? Chi-square Test for Independence of Discrete Data Data for which the only meaningful statistics are frequencies and percentages SCM300
32
Chi-square Test SCM300
33
Chi-square Test Select Statistics, then tick Chi-square
Select Cells, then tick Expected (and untick Observed) SCM300
34
Chi-square Test Chi-square Test of Independence
Karatu o Arumeru o Chi-square Test of Independence A small chi-square statistic indicates that there is a significant relationship between the two variables – they are NOT independent of each other Here, we have a large number, close to 1. Therefore, we ACCEPT the null that the two variables are independent SCM300
35
One-Way ANOVA Compare two variables Procedure:
Test 1: Is the share of pigeonpea fields in improved varieties different between districts? Test 2: Does the share of pigeonpea fields in improved varieties vary by the farmer’s school experience? Procedure: Analyze/ Compare Means/ One-Way ANOVA/ Select DV (Share improved pigeonpeas)/ Select factor (District)/ OK The DV should be interval or ratio data type The populations should be normally distributed and the population variances should be equal. This procedure becomes cumbersome when the number of factors goes beyond 3-4. Mainly used in psychological research using experimental data, which is not common in economics research Problem: Is Share an interval or ratio type data? SCM300
36
One-Way ANOVA The variable Share has range from 0 (no improved pigeonpea) to 1 (only imporved pigeonpea), with values clustering around the values 0 (75,5% of observations) and 1 (15,3%), ¼, ¾, ½, 1/3, 2/3 The variable School measures the number of years the respondent has attended school. Observations vary from 0 to 16. There are 4 districts. SCM300
37
Correlation analysis Examines bivariate relationships between ≥ 2 ORDINAL or INTERVAL/ RATIO variables They are CORRELATED if they are systematically related Positively: The variables tend to move in the same direction Negatively: The variables tend to move in opposite directions Un-correlated: No relationship Can be run with any kind of data, but is not appropriate for nominal variables with more than 2 categories SCM300
38
Correlation analysis Correlation is measured by the correlation coefficient, r Helps to think of correlation in visual terms Perfect – Mod. – No rel. Mod. + Perfect + -1 -0.7 -0.5 -0.1 0.1 0.5 0.7 1 Strong – Weak – Weak + Strong + Mod. = Moderate SCM300
39
Correlation analysis r -1 r 1 SCM300
40
Correlation analysis r 0? r 0 SCM300
41
Correlation analysis Scatter-plot procedure in SPSS Graphs
Legacy Dialogs Scatter/ Dot Select Simple Scatter Define IV for x-axis DV for y-axis OK SCM300
42
Correlation analysis Correlation procedure SPSS Analyze Correlate
Bivariate Add variables to variables list Tick Pearson’s for interval/ ratio data (Spearman’s for ordinal) OK SCM300
43
Correlation analysis SCM300
44
Correlation analysis Correlation shows pairwise strength of relationship, but not causality Causality indicates the likely impact of IV on DV E.g. what is the long-term trend in enrolment in higher education? It is possible to calculate correlations between a large number of variables You get a matrix with the same number of rows and columns as the total number of variables (If you use the partial correlations procedure, you can calculate correlation between two variables, controlling for the effect of a third variable) SCM300
45
Simple regression analysis
Linear regression procedure in SPSS Analyze Regression Linear Transfer DV and IV(s) to list OK Note – the DV should be continuous It is not a condition that the DV has normal distribution SCM300
46
Simple regression analysis
Enrolment = , ,373*Year SCM300
47
Simple regression analysis
Best fit line procedure in SPSS Analyze Regression Curve estimation Place variables on RHS Tick linear OK SCM300
48
Simple regression analysis
SCM300
49
Summary 2 variables One-Sample T Test Paired Samples T Test
Test if one variable is different from (for example) zero Paired Samples T Test Test if two variables in the same sample are different from each other Independent Samples T Test Test if (the same) variable(s) in different samples are different from each other SCM300
50
Summary 2 variables Chi-square Test One-Way ANOVA Correlation analysis
Appropriate for discrete data One-Way ANOVA The DV must be a ratio variable and continuous Correlation analysis Can be run with any kind of data, but is not appropriate for nominal variables with more than 2 categories SCM300
51
Summary 2 variables Simple Regression Analysis Best Fit Line
Same procedure as multiple linear regression analysis, only with fewer variables (just one IV) Appropriate if DV is continuous Best Fit Line Is a simple linear regression analysis under a different name SCM300
52
Multiple regression analysis
Let’s return to the African farmers What explains whether farmers grow improved or traditional varieties of pigeonpeas? Available variables: Farmer characteristics (sex, age, education, household size) Environmental factors (distance to the nearest main market, district) SCM300
53
Multiple regression analysis
Design issues There are several regression methods, based on theoretical considerations In the absence of a strong theoretical reason to choose otherwise, use the standard procedure, i.e., “Enter” If IVs are correlated, they will generate a variance inflation effect that reduces the statistical significance of results To check for correlation (“multicollinearity”), we want to run a test along with the regression SCM300
54
Multiple regression analysis
SCM300
55
Multiple regression analysis
SCM300
56
Multiple regression analysis
SCM300
57
Multiple regression analysis
(1 = Female) Using standardized coefficients, interpretations are based on the standard deviations of the variables. Each coefficient indicates the number of standard deviations that the predicted response changes for a one standard deviation change in a predictor, all other predictors remaining constant. A value of Tolerance <0.2 or of VIF >5 indicates presence of multicollinearity. SCM300
58
Multiple regression analysis
Nominal variables are converted to dummies Sex (binary, 0 = Male, 1 = Female) Districts One category is omitted (here: Karatu District) The others are represented by binary variables (0 = No, 1 = Yes) SCM300
59
Logistic regression The DV is a discrete variable
Binary Nominal with more than 2 categories Sensitive to problems like… Multicollinearity Small sample size SCM300
60
Logistic regression Missing observations are farmers with no pigeonpeas at all. SCM300
61
Logistic regression SCM300
62
Logistic regression The reference category is farmers in Karatu who have planted all of their pp fields in improved varieties (… and District = Karatu) SCM300
63
Summary ≥ 2 variables Multiple regression analysis Logistic regression
Maps the relationship between one DV and many IVs DV must be a ratio variable and continuous IV can be ratio or nominal (dummy) Logistic regression Appropriate if the DV is a discrete variable Sensitive to multicollinearity and small sample size SCM300
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.