19 May Crawford School 1 Basic Statistics – 1 Semester 1, 2009 POGO8096/8196: Research Methods Crawford School of Economics and Government
219 May Crawford School This week Introduction Introduction Data and variables Data and variables Statistics and statistical analysis Statistics and statistical analysis Univariate analysis Univariate analysis Bivariate analysis Bivariate analysis Relationships between variables Relationships between variables Regression analysis Regression analysis Correlational anlaysis Correlational anlaysis
319 May Crawford School Data and variables – 1 Data are observed numerical facts for analysis. Data are observed numerical facts for analysis. Survey data Survey data Time-series data Time-series data Cross-section data Cross-section data Q. What is the unit of analysis/observations? Q. What is the unit of analysis/observations? A variable is an empirical property that can take on two or more different values. A variable is an empirical property that can take on two or more different values.
419 May Crawford School Data and variables – 2 Levels of measurement (review) Levels of measurement (review) Nominal variable (categorical) Nominal variable (categorical) Ordinal variable (categorical) Ordinal variable (categorical) Interval variable (continuous) Interval variable (continuous) Dichotomous variable (or “dummy variable”) Dichotomous variable (or “dummy variable”) It is a variable that has two, and only two, possible values or categories. It is a variable that has two, and only two, possible values or categories. e.g., {voted, abstained}, {male, female}, {yes, no}. e.g., {voted, abstained}, {male, female}, {yes, no}.
519 May Crawford School Data and variables – 3 Most questions in a survey are nominal, ordinal or dichotomous. Most questions in a survey are nominal, ordinal or dichotomous. Interval variables are common in time-series data and cross-section data. Interval variables are common in time-series data and cross-section data. Dichotomous variables can be used to measure institutional differences in cross-section data and structural changes in time-series data. Dichotomous variables can be used to measure institutional differences in cross-section data and structural changes in time-series data. e.g., in cross-national data; 0 if democracy, 1 otherwise e.g., in cross-national data; 0 if democracy, 1 otherwise e.g., in yearly data; 0 if before 1995, 1 if 1995 onward e.g., in yearly data; 0 if before 1995, 1 if 1995 onward
619 May Crawford School Statistics A statistic is a numerical summary of data. A statistic is a numerical summary of data. Univariate statistics Univariate statistics Numerical summaries of a particular variable. Numerical summaries of a particular variable. e.g., the “proportion” of respondents in a survey supporting a proposed policy change. e.g., the “proportion” of respondents in a survey supporting a proposed policy change. Bivariate/multivariate statistics Bivariate/multivariate statistics Numerical summaries of relationships between variables. Numerical summaries of relationships between variables. e.g., the “correlation” between inequality and growth. e.g., the “correlation” between inequality and growth.
719 May Crawford School Statistical analysis Statistical analysis includes two main activities: Statistical analysis includes two main activities: Statistical measurement Statistical measurement It consists of measuring statistics (a plural form of statistic), including measuring relationships between variables. It consists of measuring statistics (a plural form of statistic), including measuring relationships between variables. Statistical inference Statistical inference It consists of estimating how likely it is that a particular result (e.g., correlation between variables) could be due to chance. It consists of estimating how likely it is that a particular result (e.g., correlation between variables) could be due to chance.
819 May Crawford School Univariate statistics – 1 Measures of central tendency Measures of central tendency Mean Mean Median (the middle value) Median (the middle value) Mode (the most frequently occurring value) Mode (the most frequently occurring value) Measures of dispersion Measures of dispersion Range (the distance from the lowest to the highest value) Range (the distance from the lowest to the highest value) Concentration (the relative frequency of occurring of a score) Concentration (the relative frequency of occurring of a score) Standard deviation Standard deviation
919 May Crawford School Univariate statistics – 2 NominalOrdinalIntervalDichotomous Mean (proportion) Median Mode Range Concent. Std. Dev. ( ) Check both if two different measures are available. Which measures do we (usually) use for each type of variables?
1019 May Crawford School Observations = 1,307 Japanese voters 2 (Primary) 3 (Secondary) 4 (University) Example – Education
1119 May Crawford School Observations = 50 US States Mean = 5.5 Median = 4.0 Minimum = 0.5 Maximum = 33.0 Std. Dev. = 6.0 Example – Population Q. Other examples of skewed variables?
1219 May Crawford School Bivariate relationships A variable is related or unrelated to another. A variable is related or unrelated to another. A variable is positively or negatively related to another. A variable is positively or negatively related to another. A variable is strongly or weakly related to another. A variable is strongly or weakly related to another. A variable has a large or small effect on another. A variable has a large or small effect on another. A variable is significantly or insignificantly related to another [“statistical inference”]. A variable is significantly or insignificantly related to another [“statistical inference”].
1319 May Crawford School Related or unrelated?
1419 May Crawford School Positively or negatively related?
1519 May Crawford School Strongly or weakly related?
1619 May Crawford School Large or small effect?
1719 May Crawford School The level of measurement matters Depending on the level of measurement, … Depending on the level of measurement, … You cannot measure whether the relationships between variables is positive or negative, if one of the variables is nominal. You cannot measure whether the relationships between variables is positive or negative, if one of the variables is nominal. You can measure whether a variable has a large or small effect on another, only if the two variables are interval. You can measure whether a variable has a large or small effect on another, only if the two variables are interval. You can always measure whether variables are strongly or weakly related, regardless of the variables’ levels of measurement. You can always measure whether variables are strongly or weakly related, regardless of the variables’ levels of measurement.
1819 May Crawford School Bivariate analysis with categorical variables A visual presentation A visual presentation The way data on two nominal or ordinal categorical variables are customarily presented is by use of a “cross tabulation” or “contingency table”. The way data on two nominal or ordinal categorical variables are customarily presented is by use of a “cross tabulation” or “contingency table”. Bivariate statistics for categorical variables? Bivariate statistics for categorical variables? There are some bivariate statistics, such as Lamda, Gamma, Phi, Tau-b, etc. None of these measures is all that satisfactory and is not free from drawbacks. There are some bivariate statistics, such as Lamda, Gamma, Phi, Tau-b, etc. None of these measures is all that satisfactory and is not free from drawbacks.
1919 May Crawford School Cross tabulation – 1 EducationIncome LowMiddleHigh Middle Low Numbers in cells are the numbers of observations. Numbers in cells are the numbers of observations. There is a positive correlation between the two variables, but you cannot say how much change is produced in one variable by a change in another. There is a positive correlation between the two variables, but you cannot say how much change is produced in one variable by a change in another.
2019 May Crawford School Cross tabulation – 2 Voted for candidate … Party support LaborLiberalOthers Mr. A Mr. B Mr. C There is a correlation between the two variables, but you can say neither whether the correlation is positive or negative, nor how much change is produced in one variable by a change in another. There is a correlation between the two variables, but you can say neither whether the correlation is positive or negative, nor how much change is produced in one variable by a change in another.
2119 May Crawford School Bivariate analysis with interval variables A visual presentation A visual presentation A “scattergram” or “scatterplot” A “scattergram” or “scatterplot” The horizontal axis is used for the independent variable (X) and the vertical axis for the dependent variable (Y). The horizontal axis is used for the independent variable (X) and the vertical axis for the dependent variable (Y). Bivariate statistics for interval variables Bivariate statistics for interval variables The “effect-descriptive” characteristics of a scattergram is the “regression coefficient.” The “effect-descriptive” characteristics of a scattergram is the “regression coefficient.” The “correlational” characteristics of a scattergram is the “correlation coefficient.” The “correlational” characteristics of a scattergram is the “correlation coefficient.”
2219 May Crawford School Regression analysis – 1 Find the single line that best approximates the pattern in the dots of a scattergram. Find the single line that best approximates the pattern in the dots of a scattergram. The best method (OLS) is to choose the line that minimizes the squared differences between observed values of the dependent variable and its predicted values. The best method (OLS) is to choose the line that minimizes the squared differences between observed values of the dependent variable and its predicted values.
2319 May Crawford School Regression analysis – 2 The regression equation: The regression equation: y = a + bx y is the predicted value of the dependent variable. y is the predicted value of the dependent variable. x is the value of the independent variable. x is the value of the independent variable. a is the “intercept” of the regression line. a is the “intercept” of the regression line. b is the “slope” of the regression equation. b is the “slope” of the regression equation. The main quantity of interest!
2419 May Crawford School Regression analysis – 3 The slope, often simply called the “regression coefficient,” is the most valuable part of this equation for most purposes in empirical research. The slope, often simply called the “regression coefficient,” is the most valuable part of this equation for most purposes in empirical research. Why? Because it provides a single, precise summary measure of how great an impact the independent variable has on the dependent variable. Why? Because it provides a single, precise summary measure of how great an impact the independent variable has on the dependent variable. It is important to know that researchers must assume the direction of causation. It is important to know that researchers must assume the direction of causation.
2519 May Crawford School Regression analysis – 4 Residuals Residuals Some observations are higher or lower than the predicted values on the regression line. Some observations are higher or lower than the predicted values on the regression line. The “residual” = the observed value – the predicted value. The “residual” = the observed value – the predicted value. Examining the residuals often helps us find some other factors affecting the dependent variable. (See Figure 8-8 on Shively p. 121, as an example.) Examining the residuals often helps us find some other factors affecting the dependent variable. (See Figure 8-8 on Shively p. 121, as an example.)
2619 May Crawford School An Example – 1 Lijphart, Arend Patterns of Democracy, Chapter 5 (Party Systems). Lijphart, Arend Patterns of Democracy, Chapter 5 (Party Systems). 36 democracies 36 democracies X = the effective number of political parties X = the effective number of political parties Y = the number of issue dimensions Y = the number of issue dimensions X is expected to have a positive impact on Y. X is expected to have a positive impact on Y. A regression equation: Y = a + b X. A regression equation: Y = a + b X. Estimate a and b using OLS. Estimate a and b using OLS.
2719 May Crawford School An Example – 2 Predicted equation: Y = X Prediction (e.g., US) X = 2.4 Y (observed) = 1 Y (predicted) = 1.71 Residual = − 0.71 Over-prediction for US US
2819 May Crawford School The “regression coefficient” measures how much difference the independent variable makes in the dependent variable. The “regression coefficient” measures how much difference the independent variable makes in the dependent variable. The “correlation coefficient” (or “r”) measures how widely data spread around a regression line. The “correlation coefficient” (or “r”) measures how widely data spread around a regression line. Correlation analysis – 1
2919 May Crawford School Correlation analysis – 2 A complete lack of relationship: r = 0 A complete lack of relationship: r = 0 A completely negative relationship: r = –1 A completely negative relationship: r = –1 A completely positive relationship r = 1 A completely positive relationship r = 1 Some positive relationship: 0 < r < 1 Some positive relationship: 0 < r < 1 Some negative relationship: –1 < r < 0 Some negative relationship: –1 < r < 0 An example (Lijphart): r = An example (Lijphart): r = 0.84.
3019 May Crawford School An Example – 1 X1X1 % of respondents who agree with the US military action in Afghanistan X2X2 % of respondents who agree that should take part with the US in military action against Afghanistan. X3X3 % of respondents who think American foreign policy has a positive effect on. X4X4 % of respondents who are worried that the war between US and its allies against terrorism may grow into a broader war against Islam.
3119 May Crawford School An Example – 2 X1X1 X2X2 X3X3 X4X4 X1X X2X X3X X4X4 – – Source: Gallup International, End of Year Terrorism Poll Number of countries included in the sample = 59.
3219 May Crawford School Remarks If you are interested in causal relationship between variables, regression analysis is superior to correlation analysis. If you are interested in causal relationship between variables, regression analysis is superior to correlation analysis. Correlation analysis is often done as a first-cut analysis prior to regression analysis. Correlation analysis is often done as a first-cut analysis prior to regression analysis. In regression analysis, you need to decide a direction of causation (i.e., impact of X on Y) and control the effects of other variables. In regression analysis, you need to decide a direction of causation (i.e., impact of X on Y) and control the effects of other variables.
3319 May Crawford School Next week Statistical inference Statistical inference Multivariate analysis Multivariate analysis More topics (if we have time) More topics (if we have time)