An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and.

Slides:



Advertisements
Similar presentations
A Gentle Introduction to Linear Mixed Modeling and PROC MIXED
Advertisements

Tests of Hypotheses Based on a Single Sample
Statistical Techniques I EXST7005 Start here Measures of Dispersion.
Previous Lecture: Distributions. Introduction to Biostatistics and Bioinformatics Estimation I This Lecture By Judy Zhong Assistant Professor Division.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Mathematics SL Internal Assessment
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
Using the Rule Normal Quantile Plots
Business Statistics for Managerial Decision
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
LIAL HORNSBY SCHNEIDER
Multiple Logistic Regression RSQUARE, LACKFIT, SELECTION, and interactions.
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
Slides by JOHN LOUCKS St. Edward’s University.
Experimental Evaluation
The Analysis of Variance
Inferences About Process Quality
Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)
Measures of Central Tendency
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Introduction to Statistical Inferences
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
Example 16.1 Ordering calendars at Walton Bookstore
Investment Analysis and Portfolio management Lecture: 24 Course Code: MBF702.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Copyright © 2009 Cengage Learning Chapter 10 Introduction to Estimation ( 추 정 )
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Generalized Linear Mixed Modeling and PROC GLIMMIX Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and Psychometrics.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
Analysis of Variance 1 Dr. Mohammed Alahmed Ph.D. in BioStatistics (011)
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
ANOVA, Regression and Multiple Regression March
Copyright © 2009 Pearson Education, Inc. 8.1 Sampling Distributions LEARNING GOAL Understand the fundamental ideas of sampling distributions and how the.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Principal Component Analysis
STAT03 - Descriptive statistics (cont.) - variability 1 Descriptive statistics (cont.) - variability Lecturer: Smilen Dimitrov Applied statistics for testing.
Richard Charnigo Professor of Statistics and Biostatistics University of Kentucky March 2016
+ Unit 5: Estimating with Confidence Section 8.3 Estimating a Population Mean.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 3 Polynomial and Rational Functions Copyright © 2013, 2009, 2005 Pearson Education, Inc.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Dr. Richard Charnigo Professor of Statistics and Biostatistics 07 December 2015.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Estimating standard error using bootstrap
Chapter 8: Estimating with Confidence
Virtual University of Pakistan
BINARY LOGISTIC REGRESSION
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Sampling Distributions and The Central Limit Theorem
A Gentle Introduction to Linear Mixed Modeling and PROC MIXED
Chapter 8: Estimating with Confidence
Introduction to Estimation
Quantitative Reasoning
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Sampling Distributions and The Central Limit Theorem
Chapter 8: Estimating with Confidence
2/5/ Estimating a Population Mean.
Chapter 8: Estimating with Confidence
Presentation transcript:

An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and Psychometrics Core, CDART

Objectives First ~80 minutes: 1.Be able to describe a group-based trajectory model and, in particular, distinguish it from a conventional regression model. 2. Be able to interpret results obtained from fitting a group-based trajectory model via PROC TRAJ. Last ~40 minutes: 3. Be able to fit a group-based trajectory model via PROC TRAJ.

Motivating example The Excel file at { contains a simulated data set: Five hundred college freshmen (“ID”) were asked to estimate how many times per month they consumed marijuana during their freshman (“Y1”), sophomore (“Y2”), junior (“Y3”), and senior (“Y4”) years of high school. Later they were asked to estimate their marijuana use during freshman year of college (“Y5”). They were also assessed on reward seeking; for ease of interpretation, we standardize this variable (“X”).

Motivating example Two possible “research questions” are: i.What are prototypical trajectories of marijuana use within the population of college students from which this sample was drawn ? ii.Is the trajectory that best describes the experience of a particular student associated with that student’s level of reward seeking ? We can develop more complicated and realistic scenarios ( e.g., with additional personality variables and/or interventions ), but this simple scenario will help us begin to understand group- based trajectory modeling and PROC TRAJ.

Exploratory data analysis Before pursuing group-based trajectory ( or any other statistical ) modeling, we are well-advised to perform exploratory data analysis. This can alert us to gross mistakes in the data set, heretofore undetected, which may otherwise threaten the validity of our results. This can also suggest an appropriate probability distribution to use with the group-based trajectory model and help us to anticipate what the results may be.

Exploratory data analysis Quantiles (Definition 5) QuantileEstimate 100% Max4 99%3 95%2 90%1 75% Q31 50% Median0 25% Q10 10%0 5%0 1%0 0% Min0 Basic Statistical Measures LocationVariability Mean Std Deviation Median Variance Mode Range Interquartile Range

Exploratory data analysis Quantiles (Definition 5) QuantileEstimate 100% Max14 99%12 95%9 90%7 75% Q31 50% Median0 25% Q10 10%0 5%0 1%0 0% Min0 Basic Statistical Measures LocationVariability Mean Std Deviation Median Variance Mode Range Interquartile Range

Exploratory data analysis The preceding slides show descriptive statistics for Y1 and Y5. ( We can similarly examine descriptive statistics for Y2, Y3, and Y4. ) Here are a few observations: As anticipated, the possible values of Y1 and Y5 are nonnegative, and they appear to have been recorded ( or rounded ) to the nearest integer. The distributions of Y1 and Y5 are right-skewed, and there are lots of 0’s. Both the mean and the variance for Y5 are greater than the corresponding quantities for Y1.

Exploratory data analysis Our observations suggest the following: Because there are lots of 0’s, there is no transformation that will bring Y1 or Y5 to approximate normality. However, because Y1 and Y5 are integer-valued, a Poisson ( or similar ) probability distribution may be applicable. Since Y5 has greater mean and variance than Y1, we anticipate some divergence between trajectories over time and at least one trajectory showing increasing marijuana use over time.

A first trajectory model Let t denote time in years. If we set time 0 to be high school graduation, then we have t = -3, -2, -1, 0, and 1 corresponding to Y1 through Y5. Suppose for now --- the viability of this supposition can be assessed later --- that there are three subpopulations whose mean levels of marijuana use over time ( called “trajectories” ) are defined by exponentials of linear functions f 1 (t) = exp(a 1 + b 1 t), f 2 (t) = exp(a 2 + b 2 t), and f 3 (t) = exp(a 3 + b 3 t). The exponentials are needed because f 1 (t), f 2 (t), and f 3 (t) must be nonnegative.

A first trajectory model Suppose that the distribution of Y k ( 1 < k < 5 ) in the first subpopulation is Poisson with mean f 1 ( k-4 ), in the second is Poisson with mean f 2 ( k-4 ), and in the third is Poisson with mean f 3 ( k-4 ). Finally, suppose that the probability of belonging to subpopulation j ( 2 0, then higher levels of reward seeking increase the above ratio; if d j < 0, then they decrease the above ratio.

A first trajectory model A group-based trajectory model is thus distinguished from a conventional regression model in that a latent variable --- namely, the subpopulation to which one belongs --- is intermediate between what might be thought of as the independent variable (here, reward seeking) and the dependent variable (here, marijuana use). Consequently, and importantly, the difference between two trajectories is typically much greater than the difference between mean levels among persons “high” on the independent variable versus persons “low” on the independent variable.

A first trajectory model

The preceding figure shows results from fitting the group-based trajectory model via PROC TRAJ. Approximately 65.3% of persons belong to a subpopulation that is essentially abstinent from marijuana, about 19.4% to a subpopulation whose marijuana use increases and then decreases, and about 15.3% to a subpopulation whose marijuana use continually increases. Dashed lines represent estimates of f 1 (t), f 2 (t), and f 3 (t) when they are assumed to be exponentials of linear functions; solid lines represent estimates without such a constraint.

A first trajectory model ObsIDY1Y2Y3Y4Y5T1T2T3T4T5XGRP1PRBGRP2PRBGRP3PRBGROUP Obs_MODEL__MODEL2__TYPE__NAME_INTERC1LINEAR1INTERC2 1ZIP PARMS ObsLINEAR2INTERC3LINEAR3CONST2X2CONST3X Obs_LOGLIK__BIC1__BIC2__AIC__CONVERGE_ ObsTAVG1AVG2AVG3PRED1PRED2PRED

A first trajectory model The preceding tables display additional results. The first table shows variable values for six subjects, along with the estimated probabilities that the subjects belong to the three subpopulations. The second and third tables present estimates of a 1, b 1, a 2, b 2, a 3, b 3, c 2, d 2, c 3, and d 3. Companion output, which is displayed by PROC TRAJ on screen only, provides accompanying p-values. The fourth table provides indices of model fit, and the fifth table specifies the numbers used to construct the figure displayed earlier.

A first trajectory model Visually, the estimate of f 2 (t) appears somewhat unsatisfactory. There are corresponding discrepancies between the “AVG2” and “PRED2” columns in the fifth table. Therefore, let us consider a second group-based trajectory model in which the trajectories are defined by exponentials of quadratic functions f 1 (t) = exp(a 1 + b 1 t + g 1 t 2 ), f 2 (t) = exp(a 2 + b 2 t + g 2 t 2 ), and f 3 (t) = exp(a 3 + b 3 t + g 3 t 2 ).

A second trajectory model

ObsIDY1Y2Y3Y4Y5T1T2T3T4T5XGRP1PRBGRP2PRBGRP3PRBGROUP Obs_MODEL__MODEL2__TYPE__NAME_INTERC1LINEAR1QUADRA1 1ZIP PARMS ObsLINEAR2QUADRA2INTERC3LINEAR3QUADRA3CONST2X ObsTAVG1AVG2AVG3PRED1PRED2PRED ObsCONST3X3_LOGLIK__BIC1__BIC2__AIC__CONVERGE_

A second trajectory model Some comments are in order: The estimate of f 2 (t) looks much better now. The guess about which subpopulation subject 6 belongs to has changed ( and appears more reasonable now ). The BIC 1, BIC 2, and AIC have increased by approximately 66, 64, and 73 points respectively. These are overwhelming changes, suggesting that the second group-based trajectory model provides a much better fit to the data than the first group- based trajectory model.

Is that the best we can do ? Besides moving from linear functions to quadratic functions, other modifications are possible. One, for which I provide SAS code at { entails replacing the ordinary Poisson probability distribution by the zero-inflated Poisson probability distribution. The idea is that, especially in the first subpopulation, there may be too many 0’s to be compatible with the ordinary Poisson probability distribution. Accounting for this zero inflation may provide a better fit to the data.

Is that the best we can do ? Another possible modification is to change the quadratic functions to cubic or even quartic functions. ( With only five time points, we cannot go beyond polynomials of degree four. ) In fact, the polynomial degree need not be the same for each subpopulation. For instance, a linear function may suffice for the first and third subpopulations, while ( at least ) a quadratic function appears necessary for the second subpopulation.

Is that the best we can do ? We face the practical problem, though, of deciding which modifications to make. Rather than consider dozens ( or hundreds ) of possible competing models, a more feasible approach may be to start with the most complicated model that one is willing to entertain ( for example, with quartic polynomials for each subpopulation ) and then perform “backward elimination”.

Is that the best we can do ? To do this, remove whichever model feature has the largest p-value, while respecting the hierarchical principle that simpler features cannot be removed before more complicated features. Thus, for example, the linear term cannot be removed from a quadratic polynomial. Once all remaining model features have p-values less than 0.05 ( or are ineligible for removal ), stop and create a table of model fit indices corresponding to the various steps of the backward elimination.

Is that the best we can do ? The step in the backward elimination at which the model fit indices are optimized can be used to select a final model. ( Matters become a bit more complicated, though, if the model fit indices are not in agreement about this. ) Also, if we are unsure whether three is the best number of groups, then the above process can be repeated with, say, two groups and four groups. Model fit indices can then be used to choose among the final two-group model, the final three- group model, and the final four-group model.

Other capabilities of PROC TRAJ Worth mentioning here, though not illustrated in this presentation or in the SAS code at { are three additional capabilities of PROC TRAJ: The dependent variable need not have the (zero- inflated) Poisson probability distribution; the normal and Bernoulli probability distributions can be accommodated as well. Multiple independent variables can be accommodated.

Other capabilities of PROC TRAJ Multiple, related dependent variables can be accommodated. If there are two ( for instance, marijuana use and alcohol use ), then PROC TRAJ provides one latent variable defining subpopulations on the first dependent variable and a separate latent variable defining subpopulations on the second. Part of the output from PROC TRAJ then estimates the probabilities of membership in the subpopulations defined by the second latent variable given membership in a subpopulation defined by the first. If there are more than two, then PROC TRAJ provides a single latent variable defining subpopulations on all dependent variables simultaneously.

Trying out PROC TRAJ With this background, let us open SAS and work our way through at least some of the SAS code at { This is also an opportunity to experiment and make some changes to the SAS code. For instance, you can see what PROC TRAJ does when a quadratic function is replaced by a cubic function or when a quadratic function is retained for only one of the three subpopulations.