Brad Verhulst & Sarah Medland

Slides:



Advertisements
Similar presentations
Logistic Regression Psy 524 Ainsworth.
Advertisements

FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Models with Discrete Dependent Variables
A Method for Estimating the Correlations Between Observed and IRT Latent Variables or Between Pairs of IRT Latent Variables Alan Nicewander Pacific Metrics.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Generalized Linear Models
Separate multivariate observations
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Frühling Rijsdijk & Kate Morley
Categorical Data Frühling Rijsdijk 1 & Caroline van Baal 2 1 IoP, London 2 Vrije Universiteit, A’dam Twin Workshop, Boulder Tuesday March 2, 2004.
Nonparametric Statistics
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
Fundamentals of Data Analysis Lecture 10 Correlation and regression.
Logistic Regression: Regression with a Binary Dependent Variable.
Methods of Presenting and Interpreting Information Class 9.
Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.
CFA with Categorical Outcomes Psych DeShon.
Virtual University of Pakistan
Stats Methods at IC Lecture 3: Regression.
Chapter 6 The Normal Distribution and Other Continuous Distributions
Estimating standard error using bootstrap
Nonparametric Statistics
Chapter 4: Basic Estimation Techniques
Effect Sizes.
Chapter 4 Basic Estimation Techniques
HGEN Thanks to Fruhling Rijsdijk
MATH-138 Elementary Statistics
Bivariate & Multivariate Regression Analysis
12. Principles of Parameter Estimation
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
Normal Distribution.
Ordinal Data Sarah Medland.
Notes on Logistic Regression
Sampling Distributions and Estimation
Generalized Linear Models
CH 5: Multivariate Methods
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
MRC SGDP Centre, Institute of Psychiatry, Psychology & Neuroscience
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Chapter 25 Comparing Counts.
Generalized Linear Models
Elementary Statistics
Threshold Liability Models (Ordinal Data Analysis)
How to handle missing data values
Inverse Transformation Scale Experimental Power Graphing
Correlation and Regression
Statistical Methods For Engineers
Nonparametric Statistics
AP Exam Review Chapters 1-10
Basic Statistical Terms
Discrete Event Simulation - 4
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Liability Threshold Models
Sarah Medland faculty/sarah/2018/Tuesday
More on thresholds Sarah Medland.
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Simple Linear Regression
Chapter 26 Comparing Counts.
Statistics II: An Overview of Statistics
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Chapter 26 Comparing Counts.
12. Principles of Parameter Estimation
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Brad Verhulst & Sarah Medland Ordinal Data Brad Verhulst & Sarah Medland Special thanks to Frühling Rijsdijk and those who came before

Analysis of ordinal variables The session aims to provide an intuition for how we estimate correlations from ordinal data (as twin analyses rely on covariance/correlation) For this we need to introduce the concept of ‘Liability’ or ‘liability threshold models’ This is followed by a more mathematical description of the model But its hard to see how we can go from count data to generating these tetrachoric correlations, when we are assessing complex disorders that are assessed as diagnoses (categorical variables) rather than quantitative dimension. The answer is to use LTM

Ordinal data Measuring instrument discriminates between two or a few ordered categories e.g.: Absence (0) or presence (1) of a disorder Score on a single likert item Number of symptoms In such cases the data take the form of counts, i.e. the number of individuals within each category of response

Problems with the treating binary variables as continuous Normality – Binary Variables are not distributed normally, obviously. This means that the error terms cannot be normally distributed

Two Ways of Thinking about Binary Dependent Variables Assume that the observed binary variable is indicative of an underlying, latent (unobserved) continuous, normally distributed variable. We call the unobserved variable a Liability Assume the Binary Variable as a random draw from a Binomial (or Bernuilli) Distribution (Non- Linear Probability Model).

Binary Variables as indicators of Latent Continuous Variables Assume that the observed binary variable is indicative of an underlying, latent (unobserved) continuous, normally distributed variable. Assumptions: Categories reflect an imprecise measurement of an underlying normal distribution of liability The liability distribution has 1 or more thresholds

Intuition behind the Liability Threshold Model (LTM) { 1 if y* > τ 0 if y* < τ y = Distribution of y* Observed values of y Liability

For a single questionnaire item score e.g: 0 = not at all For disorders: Affected individuals The risk or liability to a disorder is normally distributed When an individual exceeds a threshold they have the disorder. Prevalence: proportion of affected individuals. For a single questionnaire item score e.g: 0 = not at all 1 = sometimes 2 = always 1 2 Does not make sense to talk about prevalence: we simply count the endorsements of each response category

Intuition behind the Liability Threshold Model (LTM) We can only observe binary outcomes, affected or unaffected, but people can be more or less affected. Since the variables are latent (and therefore not directly observed) we cannot estimate the means and variances we did for continuous variables. Thus, we have to make assumptions about them (pretend that they are some arbitrary value).

Identifying Assumptions The traditional assumption Mean Assumption The intercept (mean) is 0 or The threshold is 0 (τ = 0) Either of these two assumptions provide equivalent model fit and the intercept is a transformation of τ. Variance Assumption Var(ε|x) = 1 in the normal-ogive model Var(ε|x) = π2/3 in the logit model. Assumption 3 The conditional mean of ε is 0. This is the same assumption as we make for continuous variables, and allows the parameters to be unbiased The Probit Model The Logit Model

Identifying Assumptions of Ordinal Associations The assumptions are arbitrary. We can make slightly different assumptions and still estimate the model, but that could come at the cost of efficiency, bias and ease of interpretation. The assumptions are necessary. The magnitude of the covariance depends on the scale of the dependent variable. If we don’t make some assumptions about the variances, then the correlation coefficients are unidentified.

Intuitive explanation of thresholds in the univariate normal distribution The threshold is just a z score and can be interpreted as such

Intuitive explanation of thresholds in the univariate normal distribution If τ is -1.65 then 5% of the distribution will be to the left of τ and 95% will be to the right The threshold is just a z score and can be interpreted as such If we had 1000 people, 50 would be less than τ and 950 would be more than τ

Intuitive explanation of thresholds in the univariate normal distribution If τ is -1.65 then 5% of the distribution will be to the left of τ and 95% will be to the right The threshold is just a z score and can be interpreted as such If we had 1000 people, 50 would be less than τ and 950 would be more than τ If τ is 1.96 then 97.5% of the distribution will be to the left of τ and .025% will be to the right If we had 1000 people, 975 would be less than τ and 25 would be more than τ

Bivariate normal distribution There are 2 variables We need to say something about the covariance r =.90 r =.00

Two binary traits (e.g. data from twins) Contingency Table with 4 observed cells: Cell a: pairs concordant for unaffected Cell b/c: pairs discordant for the disorder Cell d: pairs concordant for affected What happens when we have Categorical Data for Twins ? When the measured trait is dichotomous i.e. a disorder is either present or not, we can partition our observations into pairs concordant for not having the disorder (a), pairs concordant for the disorder (d) and discordant pairs in which one is affected and one is unaffected (b and c). These frequencies are summarized in a 2x2 contingency table: Twin1 Twin2 1 a b c d 0 = unaffected 1 = affected

Joint Liability Threshold Model for twin pairs Assumed to follow a bivariate normal distribution, where both traits have a mean of 0 and standard deviation of 1, and the correlation between them is what we want to know. The shape of a bivariate normal distribution is determined by the correlation between the traits r =.90 r =.00 When we have twin pair data, the assumption now is that the joint distribution of liabilities of twin pairs follows a bivariate normal distribution, where both traits have a mean of 0 and standard deviation 1, but the correlation between them is unknown. The shape of such a bivariate normal distribution is determined by the correlation.

The observed cell proportions relate to the proportions of the Bivariate Normal Distribution with a certain correlation between the latent variables (y1 and y2), each cut at a certain threshold i.e. the joint probability of a certain response combination is the volume under the BND surface bounded by appropriate thresholds on each liability y2 y1 1 00 01 10 11

e.g. the probability that both twins are above Tc : To calculate the cell proportions we rely on Numerical integration of the Bivariate Normal Distribution over the two liabilities e.g. the probability that both twins are above Tc : Like the SND the expected proportion under the curve between any ranges of values of the two traits can be calculated by means of numerical integration. How is numerical integration performed? There are programmed mathematical subroutines that can do these calculations. OpenMx uses one of them. Φ is the bivariate normal probability density function, y1 and y2 are the liabilities of twin1 and twin2, with means of 0, and  the correlation between the two liabilities Tc1 is threshold (z-value) on y1, Tc2 is threshold (z-value) on y2

Expected cell proportions

Estimation of Correlations and Thresholds Since the Bivariate Normal distribution is a known mathematical distribution, for each correlation (∑) and any set of thresholds on the liabilities we know what the expected proportions are in each cell. Therefore, observed cell proportions of our data will inform on the most likely correlation and threshold on each liability. y2 y1 1 .87 .05 .03 r = 0.60 Tc1=Tc2 = 1.4 (z-value)

Intuition behind the Liability Threshold Model with Multiple Cutpoints τ1 τ2 τ3 τ4 Distribution of the liability Observed values 1 2 3 4

Comparison between the regression of the latent y* and the observed y It is important to keep in mind that the scale of the ordinal variable is arbitrary, and therefore it is virtually impossible to compare the slopes of the two graphs (even though they look pretty similar)

What happens if we change the default assumptions? Mean Assumption The intercept (mean) is 0 or The threshold is 0 (τ = 0) Variance Assumption Var(ε|x) = 1 in the normal-ogive model Remember that we can make slightly different assumptions with equal model fit

What alternative assumptions could we make? Mean of the distribution τ1 τ2 τ3 Distribution of the liability τ4 The distance between τ1 and τ2 is 1 The mean is freely estimated τ1 is is fixed to 0

It is important to reiterate that the model fit is the same and that all the parameters can be transformed from one set of assumptions to another.

Bivariate Ordinal Likelihood The likelihood for each observed ordinal response pattern is computed by the expected proportion in the corresponding cell of the bivariate normal distribution The maximum-likelihood equation for the whole sample is the sum of -2* log of of the likelihood of each row of data (e.g. twin pairs) This -2LL is minimized to obtain the maximum likelihood estimates of the correlation and thresholds Tetra-choric correlation if y1 and y2 reflect 2 categories (1 Threshold); Poly-choric when >2 categories per liability

Twin Models Estimate correlation in liabilities separately for MZ and DZ pairs from contingency table Variance decomposition (A, C, E) can be applied to the liability of the trait Correlations in liability are determined by path model Estimate of the heritability of the liability When data on MZ and DZ twin pairs are available, then of course we can estimate a correlation in liability for each type of twins. However, we can also go further by fitting a model for the liability that would explain these MZ and DZ correlations. We can decompose the liability correlation into A, C and E, as we do for continuous traits, where correlations in liability are determined by path model. This leads to an estimate of the heritability of the liability.

ACE Liability Model Variance constraint Threshold model Vc Ve Vc Va Vc Va/.5Va Variance constraint L L This is not a proper path diagram, in that the path tracing rules do not apply for the arrows below the latent liability circles. It shows the underlying assumptions of liability, thresholds etc. And shows that what we start from is just counts for pairs of twins. 1 1 Threshold model Unaf ¯ Aff Unaf ¯ Aff Twin 1 Twin 2

Summary OpenMx models ordinal data under a threshold model Assumptions about the (joint) distribution of the data (Standard Bivariate Normal) The relative proportions of observations in the cells of the Contingency Table are translated into proportions under the Multivariate Normal Distribution The most likely thresholds and correlations are estimated Genetic/Environmental variance components are estimated based on these correlations derived from MZ and DZ data So that’s the theory behind how ordinal analysis works, Questions?? now I’ll describe some tips on how to do it for different types of data.

Power issues Ordinal data / Liability Threshold Model: less power than analyses on continuous data Neale, Eaves & Kendler 1994 Solutions: 1. Bigger samples 2. Use more categories Please do not categorize continuous variables Sometimes it is necessary to convert v skewed data into an ordinal scale, but if possible its better to analyse continuous data since using the LTM has lower power than analysing the same phenotype if you have continuous data. This means you’ll be less certain of your parameter estimates (because they will have larger confidence intervals). But there are some things you can do to get some of this power back. Firstly, you can get more people, but this can be expensive and you may have finished data collection by the time you realise your variable needs to be categorical. Rather than binary data, try using several categories, such as fine-graining your cases into those severely affected and those mildly affected (e.g. model 2 thresholds, which might separate people with no disease, moderate disease and severe disease) Neale, Eaves & Kendler 1994, The power of the classical twin study to resolve variation in threshold traits controls controls cases cases sub-clinical