Presenting results from statistical models Professor Vernon Gayle and Dr Paul Lambert (Stirling University) Wednesday 1st April 2009.

Slides:



Advertisements
Similar presentations
Dummy Dependent variable Models
Advertisements

Brief introduction on Logistic Regression
EC220 - Introduction to econometrics (chapter 10)
Simple Logistic Regression
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
1 BINARY CHOICE MODELS: PROBIT ANALYSIS In the case of probit analysis, the sigmoid function F(Z) giving the probability is the cumulative standardized.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Data Freshman Clinic II. Overview n Populations and Samples n Presentation n Tables and Figures n Central Tendency n Variability n Confidence Intervals.
Stat 512 – Lecture 12 Two sample comparisons (Ch. 7) Experiments revisited.
Multiple Regression Models
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
BINARY CHOICE MODELS: LOGIT ANALYSIS
Generalized Linear Models
1 PREDICTION In the previous sequence, we saw how to predict the price of a good or asset given the composition of its characteristics. In this sequence,
STAT E-150 Statistical Methods
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Chapter 13: Inference in Regression
Propensity Score Matching
1 BINARY CHOICE MODELS: PROBIT ANALYSIS In the case of probit analysis, the sigmoid function is the cumulative standardized normal distribution.
Modeling Possibilities
Estimation of Statistical Parameters
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
What major reference points has your research established to date in your field? What are your research plans over the next three years? Dr Vernon Gayle.
Using Quasi-variance to Communicate Sociological Results from Statistical Models Vernon Gayle & Paul S. Lambert University of Stirling Gayle and Lambert.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
9-1 MGMG 522 : Session #9 Binary Regression (Ch. 13)
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
1 Chapter 6 Estimates and Sample Sizes 6-1 Estimating a Population Mean: Large Samples / σ Known 6-2 Estimating a Population Mean: Small Samples / σ Unknown.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Conducting post-hoc tests of compound coefficients using simple slopes for a categorical.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Interpreting multivariate OLS and logit coefficients Jane E. Miller, PhD.
Warsaw Summer School 2015, OSU Study Abroad Program Advanced Topics: Interaction Logistic Regression.
Multiple Logistic Regression STAT E-150 Statistical Methods.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Logistic Regression Analysis Gerrit Rooks
Logistic regression (when you have a binary response variable)
Sample enumeration: Forecasting from statistical models Dr Vernon Gayle and Dr Paul Lambert (Stirling University) Tuesday 29th April 2008.
Multiple Regression The equation that describes how the dependent variable y is related to the independent variables: x1, x2, xp and error term e.
1 Probability and Statistics Confidence Intervals.
1 REPARAMETERIZATION OF A MODEL AND t TEST OF A LINEAR RESTRICTION Linear restrictions can also be tested using a t test. This involves the reparameterization.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
The Probit Model Alexander Spermann University of Freiburg SS 2008.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Other tests of significance. Independent variables: continuous Dependent variable: continuous Correlation: Relationship between variables Regression:
Stats Methods at IC Lecture 3: Regression.
Multiple Regression Analysis: Inference
Exploring Group Differences
Step 1: Specify a null hypothesis
BINARY LOGISTIC REGRESSION
Notes on Logistic Regression
Hypothesis Testing Review
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Analysing Means II: Nonparametric techniques.
John Loucks St. Edward’s University . SLIDES . BY.
Generalized Linear Models
Multiple logistic regression
Categorical Data Analysis Review for Final
Chapter 7: The Normality Assumption and Inference with OLS
Social Change: The birth cohort: Evidence from the BHPS
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Presenting results from statistical models Professor Vernon Gayle and Dr Paul Lambert (Stirling University) Wednesday 1st April 2009

Structure of the Seminar Should take 1 semester!!! 1.Principals of model construction and interpretation 2.Key variables – measurement and func. Form 3.Presenting results 4.Longitudinal data analysis 5.Individuals in households – multilevel models

“One of the useful things about mathematical and statistical models [of educational realities] is that, so long as one states the assumptions clearly and follows the rules correctly, one can obtain conclusions which are, in their own terms, beyond reproach. The awkward thing about these models is the snares they set for the casual user; the person who needs the conclusions, and perhaps also supplies the data, but is untrained in questioning the assumptions….

…What makes things more difficult is that, in trying to communicate with the casual user, the modeller is obliged to speak his or her language – to use familiar terms in an attempt to capture the essence of the model. It is hardly surprising that such an enterprise is fraught with difficulties, even when the attempt is genuinely one of honest communication rather than compliance with custom or even subtle indoctrination” (Goldstein 1993, p. 141).

Structure of the this session 1.Presenting results This talk could also take weeks on end Two topics only - not the final word –Quasi-Variances –Sample Enumeration methods Many more topics emerging, –propensity score matching –simulation modelling

Using Quasi-variance to Communicate Sociological Results from Statistical Models Vernon Gayle & Paul S. Lambert University of Stirling Gayle and Lambert (2007) Sociology, 41(6):

A little biography (or narrative)… Since being at Centre for Applied Stats in 1998/9 I has been thinking about the issue of model presentation Done some work on Sample Enumeration Methods with Richard Davies Summer 2004 (with David Steele’s help) began to think about “quasi-variance” Summer 2006 began writing a paper with Paul Lambert

The Reference Category Problem In standard statistical models the effects of a categorical explanatory variable are assessed by comparison to one category (or level) that is set as a benchmark against which all other categories are compared The benchmark category is usually referred to as the ‘reference’ or ‘base’ category

The Reference Category Problem An example of Some English Government Office Regions 0 = North East of England = North West England 2 = Yorkshire & Humberside 3 = East Midlands 4 = West Midlands 5 = East of England

Government Office Region

1234 BetaStandard Error Prob.95% Confidence Intervals No Higher qualifications Higher Qualifications < Males Females < North East North West < Yorkshire & Humberside < East Midlands < West Midlands < East of England < South East < South West < Inner London < Outer London < Constant < Table 1: Logistic regression prediction that self-rated health is ‘good’ (Parameter estimates for model 1 )

BetaStandard Error Prob.95% Confidence Intervals North East----- North West Yorkshire & Humberside

Conventional Confidence Intervals Since these confidence intervals overlap we might be beguiled into concluding that the two regions are not significantly different to each other However, this conclusion represents a common misinterpretation of regression estimates for categorical explanatory variables These confidence intervals are not estimates of the difference between the North West and Yorkshire and Humberside, but instead they indicate the difference between each category and the reference category (i.e. the North East) Critically, there is no confidence interval for the reference category because it is forced to equal zero

Formally Testing the Difference Between Parameters - The banana skin is here!

Standard Error of the Difference Variance North West (s.e. 2 ) Variance Yorkshire & Humberside (s.e. 2 ) Only Available in the variance covariance matrix

Table 2: Variance Covariance Matrix of Parameter Estimates for the Govt Office Region variable in Model 1 Column Row North West Yorkshire & Humberside East Midlands West Midlands East England South EastSouth WestInner London Outer London 1North West Yorkshire & Humberside East Midlands West Midlands East England South East South West Inner London Outer London Covariance

Standard Error of the Difference Variance North West (s.e. 2 ) Variance Yorkshire & Humberside (s.e. 2 ) Only Available in the variance covariance matrix =

Formal Tests t = / = -3.6 Wald  2 = (-0.03 /0.0083) 2 = 12.97; p = Remember – earlier because the two sets of confidence intervals overlapped we could wrongly conclude that the two regions were not significantly different to each other

Comment Only the primary analyst who has the opportunity to make formal comparisons Reporting the matrix is seldom, if ever, feasible in paper-based publications In a model with q parameters there would, in general, be ½q (q-1) covariances to report

Firth’s Method (made simple) s.e. difference ≈

Table 1: Logistic regression prediction that self-rated health is ‘good’ (Parameter estimates for model 1, featuring conventional regression results, and quasi-variance statistics ) BetaStandard Error Prob.95% Confidence Intervals Quasi- Variance No Higher qualifications Higher Qualifications < Males Females < North East North West < Yorkshire & Humberside <

Firth’s Method (made simple) s.e. difference ≈ = t = ( ) / = -3.6 Wald  2 = (-.03 / ) 2 = 12.97; p = These results are identical to the results calculated by the conventional method

The QV based ‘comparison intervals’ no longer overlap

Firth QV Calculator (on-line)

Table 2: Variance Covariance Matrix of Parameter Estimates for the Govt Office Region variable in Model 1 Column Row North WestYorkshire & Humberside East Midlands West Midlands East England South EastSouth WestInner London Outer London 1North West Yorkshire & Humberside East Midlands West Midlands East England South East South West Inner London Outer London

Information from the Variance-Covariance Matrix Entered into the Data Window (Model 1)

QV Conclusion – We should start using method Benefits Overcomes the reference category problem when presenting models Provides reliable results (even though based on an approximation) Easy(ish) to calculate Has extensions to other models Costs Extra column in results Time convincing colleagues that this is a good thing

Example Drew, D., Gray, J. and Sime, N. (1992) Against the odds: The Education and Labour Market Experiences of Black Young People

Comparison of Odds Greater than 1 “higher odds” Less than 1 “lower odds”

Naïve Odds In this model (after controlling for other factors) White pupils have an odds of 1.0 Afro Caribbean pupils have an odds of 3.2 Reporting this in isolation is a naïve presentation of the effect because it ignores other factors in the model

A Comparison Pupil with 4+ higher passes White Professional parents Male Graduate parents Two parent family Pupil with 0 higher passes Afro-Caribbean Manual parents Male Non-Graduate parents One parent family

Odds are multiplicative 4+ Higher Grades Ethnic Origin Social Class Gender Parental Education No. of Parents Odds

Naïve Odds Drew, D., Gray, J. and Sime, N. (1992) warn of this danger…. …Naïvely presenting isolated odds ratios is still widespread (e.g. Connolly 2006 Brit. Ed. Res. Journal 32(1),pp.3-21) We should avoid reporting isolated odds ratios where possible!

Logit scale Generally, people find it hard to directly interpret results on the logit scale – i.e. 

Log Odds, Odds, Probability Log odds converted to odds = exp(log odds) Probability = odds/(1+odds) Odds = probability / (1-probability)

Log Odds, Odds, Probability Oddsln oddsp Odds are asymmetric – beware!

Divide by 4 rule Gelman and Hill (2008) suggest dividing coefficients from logit models by 4 as a guide for assessing the effects of the  estimated for a given explanatory variable as a probability They assert that  /4 provides a ‘rule of convenience’ for estimating the upper bound of the predictive difference corresponding to a unit change in the explanatory variable. Gelman and Hill (2008) are careful to report that this is an approximation and that it performs best near the midpoint of the logistic curve We believe that this has some merit as a rough and ready method of interpreting the effects of estimates and is a useful tool especially when tables of coefficients are rapidly flashed up at a conference presentation Gelman, A. and J. Hill (2008) Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge: Cambridge University Press

Communicating Results (to non-technically informed audiences) Davies (1992) Sample Enumeration Payne (1998) Labour Party campaign data Gayle et al. (2002) War against the uninformed use of odds (e.g. on breakfast t.v.)

Sample Enumeration Methods In a nutshell… “What if” – what if the gender effect was removed 1. Fit a model (e.g. logit) 2. Focus on a comparison (e.g. boys and girls) 3. Use the fitted model to estimate a fitted value for each individual in the comparison group 4. Sum these fitted values and construct a sample enumerated % for the group

Naïve Odds Naïvely presenting odds ratios is widespread (e.g. Connolly 2006) In this model naïvely (after controlling for other factors) Girls have an odds of 1.0 Boys have an odds of.58 We should avoid this where possible!

Logit Model Example from YCS 11 (these pupils took GCSE in 2001) y=1 5+ GCSE passes (A* - C) X vars gender; family social class (NS-SEC); ethnicity; housing tenure; parental education; parental employment; school type; family type

Naïve Odds Example from YCS 11 (these pupils took GCSE in 2001) In this model naïvely (after controlling for other factors) Girls have an odds of 1.0 Boys have an odds of.66 We should avoid this where possible!

Sample Enumeration Results Percentage with 5+ GCSE (A*-C) All52% Girls58% Boys47% (Sample enumeration est. boys)(50%) Observed difference11% Difference due ‘directly’ to gender3% Difference due to other things8%

Pseudo Confidence Interval Sample Enumeration Male Effect Upper Bound50.32% Estimate49.81% Lower Bound49.30% Bootstrapping to construct a pseudo confidence interval (1000 Replications)

Reference A technical explanation of the issue is given in Davies, R.B. (1992) ‘Sample Enumeration Methods for Model Interpretation’ in P.G.M. van der Heijden, W. Jansen, B. Francis and G.U.H. Seeber (eds) Statistical Modelling, Elsevier We have recently written a working paper on logit models

Conclusion – Why have we told you this… Categorical X vars are ubiquitous Interpretation of coefficients is critical to sociological analyses –Subtleties / slipperiness –(e.g. in Economics where emphasis is often on precision rather than communication)