Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 4 Statistical analysis

Similar presentations


Presentation on theme: "Lecture 4 Statistical analysis"— Presentation transcript:

1 Lecture 4 Statistical analysis
Heidi Hogset September, 2014

2 Lecture aim and objectives
Investigate methods of statistical analysis Objectives Research questions and hypotheses Statistical tests SCM300

3 Research questions Survey research is all about asking questions
Descriptive questions, univariate – covered in lecture 3 E.g., how has gender composition of enrolled students changed over the last 20 years? Causal relationships, multivariate – topic today E.g., what is the relationship between unemployment rates and applications for admission to higher education? SCM300

4 Research questions The research question is The variables required are
Does the unemployment rate influence people’s propensity to seek higher education? The variables required are Unemployment rate each year in Norway (choose period) Number of applications for admission submitted to colleges and universities in Norway each year (same period) SCM300

5 Research questions Suggested causal relationship?
Source for left image: MS clipart Source for right image: SCM300

6 Research questions Hypothesis – a statement to test a particular proposition Example: In periods with higher levels of unemployment, colleges and universities receive more applications for admission SCM300

7 Research questions Observed trends Variables:
Share of population in each age bracket that is enrolled in higher education (%) Share of labor force in age bracket that is registered as totally unemployed (%) SCM300

8 Research questions We assume causal relationship goes from unemployment to school applications, not vice versa “Applications” is a Dependent variable (DV) “Unemployment” is an Independent variable (IV) SCM300

9 Research questions Null hypothesis
There is NO relationship between unemployment rates and school applications SCM300

10 Research questions Alternative hypotheses
There is a significant relationship between unemployment rates and school applications (non-directional) Two-tailed There is a significant and positive relationship between unemployment and school applications (directional) One-tailed SCM300

11 Research questions Hypothesis testing One-tailed Two-tailed
Use 2-tailed unless you have a good reason to choose 1-tailed SCM300

12 Your survey You are expected to develop (a) research question(s) based on theory developed in your discipline of interest Example: Green Taxing Does it work? How strong is the effect? SCM300

13 Your survey Variables needed?
How might you create the variables using a survey? What hypothesis might you use? SCM300

14 Statistical analysis The significance of each hypothesis is tested using statistical analysis The objective is to reject or accept each hypothesis “Accept” means you have not disproved it “Accept” does not mean the hypothesis has been proved SCM300

15 Statistical analysis Relationship between two variables
One-Sample T Test Paired Samples T Test Independent Samples T Test Chi-square Test One-Way ANOVA Correlation analysis Simple Regression Analysis SCM300

16 Statistical analysis Relationship between > two variables
Multiple Regression Analysis Logistic Regression Analysis SCM300

17 Compare two variables – example 1
Number of students aged enrolled in higher education in Norway, by sex, “Male” is number of male students “Female” is number of female students 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Male 27 826 27 103 28 040 28 151 28 612 30 817 30 099 31 478 37 564 40 345 42 544 Female 31 721 31 827 33 669 33 204 33 392 37 062 37 122 39 690 46 188 49 131 51 446 SCM300

18 One Sample T Test Compute difference: Female – Male
Test: Is the difference significantly different from zero? SCM300

19 One Sample T Test SCM300

20 One Sample T Test SCM300

21 One Sample T Test If our test is directional, we should use the one-tailed significance, which is half of the 2-tailed significance. However, here, 0,000/2 = 0,000: No difference. SCM300

22 Paired Samples T Test Compare variables directly: Female vs. Male
Test: Are their means significantly different? SCM300

23 Paired Samples T Test SCM300

24 Paired Samples T Test SCM300

25 Paired Samples T Test SCM300

26 Independent Samples T Test
Has the finance crisis in influenced enrolment in higher education? Test: Are the means significantly different before and after 2007? SCM300

27 Independent Samples T Test
SCM300

28 Independent Samples T Test
SCM300

29 Independent Samples T Test
Note – the difference between the means is not only due to the finance crisis. There is also a long-term trend towards higher enrolment rates that we should have controlled for. The F is Levene’s test for Equality of Variances. A small value of significance here indicates that the appropriate T Test is that where Equal variances are not assumed. SCM300

30 Compare two variables – example 2
A survey of pigeonpea farmers in Tanzania in 2008 – variables: District (there are 4) Respondent characteristics (there are 609 respondents) The respondent’s sex, age, number of years in school, number of dependents in the household, distance to the nearest main market (for farm products) The respondent’s farm operation The number of plots planted in improved varieties of pigeonpeas divided by the total number of plots planted in pigeonpeas (“share”) SCM300

31 Chi-square Test We need to compare two nominal or ordinal variables (discrete data) We want to check if our sampling procedure has produced a biased sample with respect to gender composition We assume the proportion of households with a female household head is independent of districts Test: Is the sex distribution different between districts? Chi-square Test for Independence of Discrete Data Data for which the only meaningful statistics are frequencies and percentages SCM300

32 Chi-square Test SCM300

33 Chi-square Test Select Statistics, then tick Chi-square
Select Cells, then tick Expected (and untick Observed) SCM300

34 Chi-square Test Chi-square Test of Independence
Karatu o Arumeru o Chi-square Test of Independence A small chi-square statistic indicates that there is a significant relationship between the two variables – they are NOT independent of each other Here, we have a large number, close to 1. Therefore, we ACCEPT the null that the two variables are independent SCM300

35 One-Way ANOVA Compare two variables Procedure:
Test 1: Is the share of pigeonpea fields in improved varieties different between districts? Test 2: Does the share of pigeonpea fields in improved varieties vary by the farmer’s school experience? Procedure: Analyze/ Compare Means/ One-Way ANOVA/ Select DV (Share improved pigeonpeas)/ Select factor (District)/ OK The DV should be interval or ratio data type The populations should be normally distributed and the population variances should be equal. This procedure becomes cumbersome when the number of factors goes beyond 3-4. Mainly used in psychological research using experimental data, which is not common in economics research Problem: Is Share an interval or ratio type data? SCM300

36 One-Way ANOVA The variable Share has range from 0 (no improved pigeonpea) to 1 (only imporved pigeonpea), with values clustering around the values 0 (75,5% of observations) and 1 (15,3%), ¼, ¾, ½, 1/3, 2/3 The variable School measures the number of years the respondent has attended school. Observations vary from 0 to 16. There are 4 districts. SCM300

37 Correlation analysis Examines bivariate relationships between ≥ 2 ORDINAL or INTERVAL/ RATIO variables They are CORRELATED if they are systematically related Positively: The variables tend to move in the same direction Negatively: The variables tend to move in opposite directions Un-correlated: No relationship Can be run with any kind of data, but is not appropriate for nominal variables with more than 2 categories SCM300

38 Correlation analysis Correlation is measured by the correlation coefficient, r Helps to think of correlation in visual terms Perfect – Mod. – No rel. Mod. + Perfect + -1 -0.7 -0.5 -0.1 0.1 0.5 0.7 1 Strong – Weak – Weak + Strong + Mod. = Moderate SCM300

39 Correlation analysis r  -1 r  1 SCM300

40 Correlation analysis r  0? r  0 SCM300

41 Correlation analysis Scatter-plot procedure in SPSS Graphs
Legacy Dialogs Scatter/ Dot Select Simple Scatter Define IV for x-axis DV for y-axis OK SCM300

42 Correlation analysis Correlation procedure SPSS Analyze Correlate
Bivariate Add variables to variables list Tick Pearson’s for interval/ ratio data (Spearman’s for ordinal) OK SCM300

43 Correlation analysis SCM300

44 Correlation analysis Correlation shows pairwise strength of relationship, but not causality Causality indicates the likely impact of IV on DV E.g. what is the long-term trend in enrolment in higher education? It is possible to calculate correlations between a large number of variables You get a matrix with the same number of rows and columns as the total number of variables (If you use the partial correlations procedure, you can calculate correlation between two variables, controlling for the effect of a third variable) SCM300

45 Simple regression analysis
Linear regression procedure in SPSS Analyze Regression Linear Transfer DV and IV(s) to list OK Note – the DV should be continuous It is not a condition that the DV has normal distribution SCM300

46 Simple regression analysis
Enrolment = , ,373*Year SCM300

47 Simple regression analysis
Best fit line procedure in SPSS Analyze Regression Curve estimation Place variables on RHS Tick linear OK SCM300

48 Simple regression analysis
SCM300

49 Summary 2 variables One-Sample T Test Paired Samples T Test
Test if one variable is different from (for example) zero Paired Samples T Test Test if two variables in the same sample are different from each other Independent Samples T Test Test if (the same) variable(s) in different samples are different from each other SCM300

50 Summary 2 variables Chi-square Test One-Way ANOVA Correlation analysis
Appropriate for discrete data One-Way ANOVA The DV must be a ratio variable and continuous Correlation analysis Can be run with any kind of data, but is not appropriate for nominal variables with more than 2 categories SCM300

51 Summary 2 variables Simple Regression Analysis Best Fit Line
Same procedure as multiple linear regression analysis, only with fewer variables (just one IV) Appropriate if DV is continuous Best Fit Line Is a simple linear regression analysis under a different name SCM300

52 Multiple regression analysis
Let’s return to the African farmers What explains whether farmers grow improved or traditional varieties of pigeonpeas? Available variables: Farmer characteristics (sex, age, education, household size) Environmental factors (distance to the nearest main market, district) SCM300

53 Multiple regression analysis
Design issues There are several regression methods, based on theoretical considerations In the absence of a strong theoretical reason to choose otherwise, use the standard procedure, i.e., “Enter” If IVs are correlated, they will generate a variance inflation effect that reduces the statistical significance of results To check for correlation (“multicollinearity”), we want to run a test along with the regression SCM300

54 Multiple regression analysis
SCM300

55 Multiple regression analysis
SCM300

56 Multiple regression analysis
SCM300

57 Multiple regression analysis
(1 = Female) Using standardized coefficients, interpretations are based on the standard deviations of the variables. Each coefficient indicates the number of standard deviations that the predicted response changes for a one standard deviation change in a predictor, all other predictors remaining constant. A value of Tolerance <0.2 or of VIF >5 indicates presence of multicollinearity. SCM300

58 Multiple regression analysis
Nominal variables are converted to dummies Sex (binary, 0 = Male, 1 = Female) Districts One category is omitted (here: Karatu District) The others are represented by binary variables (0 = No, 1 = Yes) SCM300

59 Logistic regression The DV is a discrete variable
Binary Nominal with more than 2 categories Sensitive to problems like… Multicollinearity Small sample size SCM300

60 Logistic regression Missing observations are farmers with no pigeonpeas at all. SCM300

61 Logistic regression SCM300

62 Logistic regression The reference category is farmers in Karatu who have planted all of their pp fields in improved varieties (… and District = Karatu) SCM300

63 Summary ≥ 2 variables Multiple regression analysis Logistic regression
Maps the relationship between one DV and many IVs DV must be a ratio variable and continuous IV can be ratio or nominal (dummy) Logistic regression Appropriate if the DV is a discrete variable Sensitive to multicollinearity and small sample size SCM300


Download ppt "Lecture 4 Statistical analysis"

Similar presentations


Ads by Google