Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.

Slides:



Advertisements
Similar presentations
Dummy Dependent variable Models
Advertisements

Design of Experiments Lecture I
Brief introduction on Logistic Regression
Economics 20 - Prof. Anderson1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Inference for Regression
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Econ 140 Lecture 81 Classical Regression II Lecture 8.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
1 BINARY CHOICE MODELS: PROBIT ANALYSIS In the case of probit analysis, the sigmoid function F(Z) giving the probability is the cumulative standardized.
1 Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.
Session 2. Applied Regression -- Prof. Juran2 Outline for Session 2 More Simple Regression –Bottom Part of the Output Hypothesis Testing –Significance.
Models with Discrete Dependent Variables
The World’s Fastest Crash Course in Statistics Or, What You Need to Know to Answer Your Research Question 13 November 2006.
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 7. Specification and Data Problems.
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Multiple Regression Models
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Stat 112: Lecture 13 Notes Finish Chapter 5: –Review Predictions in Log-Log Transformation. –Polynomials and Transformations in Multiple Regression Start.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
BINARY CHOICE MODELS: LOGIT ANALYSIS
Chapter 12 Section 1 Inference for Linear Regression.
Generalized Linear Models
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 10) Slideshow: binary choice logit models Original citation: Dougherty, C. (2012) EC220.
1 PROXY VARIABLES Suppose that a variable Y is hypothesized to depend on a set of explanatory variables X 2,..., X k as shown above, and suppose that for.
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Hypothesis Testing in Linear Regression Analysis
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Using the Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects Richard Williams
Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
CHAPTER 14 MULTIPLE REGRESSION
Andrew Thomson on Generalised Estimating Equations (and simulation studies)
Using Quasi-variance to Communicate Sociological Results from Statistical Models Vernon Gayle & Paul S. Lambert University of Stirling Gayle and Lambert.
9-1 MGMG 522 : Session #9 Binary Regression (Ch. 13)
Presenting results from statistical models Professor Vernon Gayle and Dr Paul Lambert (Stirling University) Wednesday 1st April 2009.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.
Multiple Logistic Regression STAT E-150 Statistical Methods.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Logistic Regression Analysis Gerrit Rooks
Sample enumeration: Forecasting from statistical models Dr Vernon Gayle and Dr Paul Lambert (Stirling University) Tuesday 29th April 2008.
1 REPARAMETERIZATION OF A MODEL AND t TEST OF A LINEAR RESTRICTION Linear restrictions can also be tested using a t test. This involves the reparameterization.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
STATA WORKSHOP
The Probit Model Alexander Spermann University of Freiburg SS 2008.
Stats Methods at IC Lecture 3: Regression.
Exploring Group Differences
BINARY LOGISTIC REGRESSION
Performance Eval Slides originally from Williamson at Calgary
Notes on Logistic Regression
Inference for Regression
Analysis of Covariance (ANCOVA)
Generalized Linear Models
CHAPTER 29: Multiple Regression*
Chapter 7: The Normality Assumption and Inference with OLS
Chapter 13 Excel Extension: Now You Try!
Presentation transcript:

Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling The SSSN is funded under Phase II of the ESRC Research Development Initiative

Handling coefficients 1)Some general issues (Some thoughts on statistical modelling in Stata, and some tricks and tips …) 2) Using Quasi-variance

Statistical Modelling Process Model formulation [make assumptions] Model fitting [quantify systematic relationships & random variation] (Model criticism) [review assumptions] Model interpretation [assess results] Davies and Dale, 1994 p.5

Building Models REMEMBER – Real data is much more messy, badly behaved (people do odd stuff), harder to interpret etc. than the data used in books and at workshops

Building Models Always be guided by substantive theory (the economists are good at this – but a bit rigid) Form of the outcome variable (or process) Main effects – more complicated models later Don’t use stepwise regression (stepwise, pr(.05):regress wage married children educ age) An example…

A regression model GHS Data Y = age left education (years) X Vars Female Social Class (Advantaged; Lower Supervisory; Semi-routine; Routine) Age (centred at 40)

Regression Estimates ABCDE Female Age (40) Supervisory Semi- Routine Routine Constant

Linear Regression Models 1 unit change in X leading to a  change in Y The  is consistent – minor insignificant random variation (survey data) As long as the X vars are uncorrelated (a classical regression assumption)

A logit model (non-linear) GHS Data Y = Graduate / Non Graduate X Vars Female Social Class (Advantaged; Lower Supervisory; Semi-routine; Routine) Age (centred at 40)

Estimates Logit (log scale) ABCDE Female Age (40) Supervisory Semi-Routine Routine Constant Parameterization ??

Logit Model Estimates on a log scale The  estimates a shift from X 1 =0 to X 1 =1 leads to a change in the log odds of y=1 Even when the X vars are uncorrelated, including additional variables can lead to changes in  estimates The  estimates the effect given all other X vars in the model Fixed variance in the logit model(   / 3 )

Non-Linear Models Be sensible about how you parameterize them Be careful interpreting them… Don’t throw variables in like a ‘bull in a china shop’ Model checking – make sure you understand how the ‘left hand side’ (lhs) is working Some bad examples using SARs (large dataset with many significant X variables)

Reference A technical explanation of the issue is given in Davies, R.B. (1992) ‘Sample Enumeration Methods for Model Interpretation’ in P.G.M. van der Heijden, W. Jansen, B. Francis and G.U.H. Seeber (eds) Statistical Modelling, Elsevier.

A Few Tricks The outreg2 command was written by John Luke Gallup and appears in the Stata Tech. Bullet. #59 You can download outreg2 from within Stata Outputs regression results in a more ‘publishable’ format e.g. in a word document

A Few Tricks statsby – tells Stata to collect statistics for a command across a by list Attractive because it saves the data simply, and can be used in Graphs In our experience statsby can save a lot of manual editing of results You can re-run the models with small adjustments; subsequent operations such as graph generation can be better automated

Handling coefficients 1)Some general issues (Some thoughts on statistical modelling in Stata, and some tricks and tips …) 2) Using Quasi-variance

Using Quasi-variance to Communicate Sociological Results from Statistical Models Vernon Gayle & Paul S. Lambert University of Stirling Gayle and Lambert (2007) Sociology, 41(6):

“One of the useful things about mathematical and statistical models [of educational realities] is that, so long as one states the assumptions clearly and follows the rules correctly, one can obtain conclusions which are, in their own terms, beyond reproach. The awkward thing about these models is the snares they set for the casual user; the person who needs the conclusions, and perhaps also supplies the data, but is untrained in questioning the assumptions….

…What makes things more difficult is that, in trying to communicate with the casual user, the modeller is obliged to speak his or her language – to use familiar terms in an attempt to capture the essence of the model. It is hardly surprising that such an enterprise is fraught with difficulties, even when the attempt is genuinely one of honest communication rather than compliance with custom or even subtle indoctrination” (Goldstein 1993, p. 141).

A little biography (or narrative)… Since being at Centre for Applied Stats in 1998/9 I has been thinking about the issue of model presentation Done some work on Sample Enumeration Methods with Richard Davies Summer 2004 (with David Steele’s help) began to think about “quasi-variance” Summer 2006 began writing a paper with Paul Lambert

Statistical Models Statistical models offer an attractive way for sociological researchers to summarize patterns from social survey datasets They offer techniques to summarize the joint relative effects of several different variables in a research study This is achieved by estimating statistical values (‘parameters’ or ‘coefficient estimates’) that indicate the magnitude and direction of the effect of each explanatory variable The appropriate sociological interpretation of the parameter estimates from statistical models is by no means trivial

The Reference Category Problem In standard statistical models the effects of a categorical explanatory variable are assessed by comparison to one category (or level) that is set as a benchmark against which all other categories are compared The benchmark category is usually referred to as the ‘reference’ or ‘base’ category

The Reference Category Problem An example of Some English Government Office Regions 0 = North East of England = North West England 2 = Yorkshire & Humberside 3 = East Midlands 4 = West Midlands 5 = East of England

Government Office Region

1234 BetaStandard Error Prob.95% Confidence Intervals No Higher qualifications Higher Qualifications < Males Females < North East North West < Yorkshire & Humberside < East Midlands < West Midlands < East of England < South East < South West < Inner London < Outer London < Constant < Table 1: Logistic regression prediction that self-rated health is ‘good’ (Parameter estimates for model 1 )

BetaStandard Error Prob.95% Confidence Intervals North East----- North West Yorkshire & Humberside

Conventional Confidence Intervals Since these confidence intervals overlap we might be beguiled into concluding that the two regions are not significantly different to each other However, this conclusion represents a common misinterpretation of regression estimates for categorical explanatory variables These confidence intervals are not estimates of the difference between the North West and Yorkshire and Humberside, but instead they indicate the difference between each category and the reference category (i.e. the North East) Critically, there is no confidence interval for the reference category because it is forced to equal zero

Formally Testing the Difference Between Parameters - The banana skin is here!

Standard Error of the Difference Variance North West (s.e. 2 ) Variance Yorkshire & Humberside (s.e. 2 ) Only Available in the variance covariance matrix

Table 2: Variance Covariance Matrix of Parameter Estimates for the Govt Office Region variable in Model 1 Column Row North West Yorkshire & Humberside East Midlands West Midlands East England South EastSouth WestInner London Outer London 1North West Yorkshire & Humberside East Midlands West Midlands East England South East South West Inner London Outer London Covariance

Standard Error of the Difference Variance North West (s.e. 2 ) Variance Yorkshire & Humberside (s.e. 2 ) Only Available in the variance covariance matrix =

Formal Tests t = / = -3.6 Wald  2 = (-0.03 /0.0083) 2 = 12.97; p = Remember – earlier because the two sets of confidence intervals overlapped we could wrongly conclude that the two regions were not significantly different to each other

Comment Only the primary analyst who has the opportunity to make formal comparisons Reporting the matrix is seldom, if ever, feasible in paper-based publications In a model with q parameters there would, in general, be ½q (q-1) covariances to report

Firth’s Method (made simple) s.e. difference ≈

Table 1: Logistic regression prediction that self-rated health is ‘good’ (Parameter estimates for model 1, featuring conventional regression results, and quasi-variance statistics ) BetaStandard Error Prob.95% Confidence Intervals Quasi- Variance No Higher qualifications Higher Qualifications < Males Females < North East North West < Yorkshire & Humberside <

Firth’s Method (made simple) s.e. difference ≈ = t = ( ) / = -3.6 Wald  2 = (-.03 / ) 2 = 12.97; p = These results are identical to the results calculated by the conventional method

The QV based ‘comparison intervals’ no longer overlap

Firth QV Calculator (on-line)

Table 2: Variance Covariance Matrix of Parameter Estimates for the Govt Office Region variable in Model 1 Column Row North WestYorkshire & Humberside East Midlands West Midlands East England South EastSouth WestInner London Outer London 1North West Yorkshire & Humberside East Midlands West Midlands East England South East South West Inner London Outer London

Information from the Variance-Covariance Matrix Entered into the Data Window (Model 1)

Conclusion – We should start using method Benefits Overcomes the reference category problem when presenting models Provides reliable results (even though based on an approximation) Easy(ish) to calculate Has extensions to other models Costs Extra Column in results Time convincing colleagues that this is a good thing

Conclusion – Why have we told you this… Categorical X vars are ubiquitous Interpretation of coefficients is critical to sociological analyses –Subtleties / slipperiness –Emphasis often on precision rather than communication (e.g. in economics)