Cross-sectional LCA Patterns of first response to cigarettes
First smoking experience Have you ever tried a cigarette (including roll-ups), even a puff? How old were you when you first tried a cigarette? When you FIRST ever tried a cigarette can you remember how it made you feel? (tick as many as you want) –It made me cough –I felt ill –It tasted awful –I liked it –It made me feel dizzy
Aim To categorise the subjects based on their pattern of responses To assess the relationship between first-response and current smoking behaviour To try not to think too much about the possibility of recall bias
Step 1 Look at your data!!!
Examine your data structure LCA converts a large number of response patterns into a small number of ‘homogeneous’ groups If the responses in your data are fair mutually exclusive then there’s no point doing LCA Don’t just dive in
How many items endorsed? numresp | Freq. Percent Cum | | 1, | | | | Total | 2,
Frequency of each item (n ~ 2500)
Examine pattern frequency | cough ill taste liked dizzy num | | | 1. | | 2. | | 3. | | 4. | | 5. | | | | 6. | | 7. | | 8. | | 9. | | 10. | | | | 11. | | 12. | | 13. | | 14. | | 15. | | | | | cough ill taste liked dizzy num | | | 16. | | 17. | | 18. | | 19. | | 20. | | | | 21. | | 22. | | 23. | | 24. | | 25. | | | | 26. | | 27. | | 28. | | 29. | | 30. | |
Examine correlation structure Polychoric correlation matrix coughilltastelikeddizzy cough1 ill taste liked dizzy
Step 2 Now you can fit a latent class model
Latent Class models Work with observations at the pattern level rather than the individual (person) level | cough ill taste liked dizzy num | | | 1. | | 2. | | 3. | | 4. | | 5. | | | |
Latent Class models For a given number of latent classes, using application of Bayes’ rule plus an assumption of conditional independence one can calculate the probability that each pattern should fall into each class Derive the likelihood of the obtained data under each model (i.e. assuming different numbers of classes) and use this plus other fit statistics to determine optimal model i.e. optimal number of classes
Latent Class models Bayes’ rule: Conditional independence: P( pattern = ’01’ | class = i) = P(pat(1) = ‘0’ | class = i)*P(pat(2) = ‘1’ | class = i)
How many classes can I have? ~ degrees of freedom 32 possible patterns Each additional class requires –5 df to estimate the 5 prevalence of each item that class (i.e. 5 thresholds) –1 df for an additional cut of the latent variable defining the class distribution Hence a 5-class model uses up 5*5 + 4 = 29 degrees of freedom leaving up to 3df to test the model
Standard thresholds Mplus thinks of binary variables as being a dichotomised continuous latent variable The point at which a continuous N(0,1) variable must be cut to create a binary variable is called a threshold A binary variable with 50% cases corresponds to a threshold of zero A binary variable with 2.5% cases corresponds to a threshold of 1.96
Standard thresholds Figure from Uebersax webpage
Data: File is “..\smoking_experience.dta.dat"; listwise is on; Variable: Names are sex cough ill taste liked dizzy numresp less_12 less_13; categorical are cough ill taste liked dizzy ; usevariables are cough ill taste liked dizzy; Missing are all (-9999) ; classes = c(3); Analysis: proc = 2 (starts); type = mixture; starts = ; stiterations = 20; Output: tech10;
What you’re actually doing model: %OVERALL% [c#1 c#2]; %c#1% [cough$1]; [ill$1]; [taste$1]; [liked$1]; [dizzy$1]; + five more threshold parameters for %c#2% and %c#3% Defines the latent class variable Defines the within class thresholds i.e. the prevalence of the endorsement of each item
SUMMARY OF CATEGORICAL DATA PROPORTIONS COUGH Category Category ILL Category Category TASTE Category Category LIKED Category Category DIZZY Category Category
RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST LOGLIKELIHOOD VALUES Final stage loglikelihood values at local maxima, seeds, and initial stage start numbers: Etc.
How many random starts? Depends on –Sample size –Complexity of model Number of manifest variables Number of classes Aim to find consistently the model with the lowest likelihood, within each run
Success Not there yet Loglikelihood values at local maxima, seeds, and initial stage start numbers: Loglikelihood values at local maxima, seeds, and initial stage start numbers
Scary “warnings” IN THE OPTIMIZATION, ONE OR MORE LOGIT THRESHOLDS APPROACHED AND WERE SET AT THE EXTREME VALUES. EXTREME VALUES ARE AND THE FOLLOWING THRESHOLDS WERE SET AT THESE VALUES: * THRESHOLD 1 OF CLASS INDICATOR TASTE FOR CLASS 3 AT ITERATION 11 * THRESHOLD 1 OF CLASS INDICATOR DIZZY FOR CLASS 3 AT ITERATION 12 * THRESHOLD 1 OF CLASS INDICATOR ILL FOR CLASS 3 AT ITERATION 16 * THRESHOLD 1 OF CLASS INDICATOR LIKED FOR CLASS 1 AT ITERATION 34 * THRESHOLD 1 OF CLASS INDICATOR TASTE FOR CLASS 1 AT ITERATION 93 WARNING: WHEN ESTIMATING A MODEL WITH MORE THAN TWO CLASSES, IT MAY BE NECESSARY TO INCREASE THE NUMBER OF RANDOM STARTS USING THE STARTS OPTION TO AVOID LOCAL MAXIMA.
THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Loglikelihood H0 Value H0 Scaling Correction Factor for MLR Information Criteria Number of Free Parameters 17 Akaike (AIC) Bayesian (BIC) Sample-Size Adjusted BIC (n* = (n + 2) / 24)
Chi-Square Test of Model Fit for the Binary and Ordered Categorical (Ordinal) Outcomes Pearson Chi-Square Value Degrees of Freedom 14 P-Value Likelihood Ratio Chi-Square Value Degrees of Freedom 14 P-Value
FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THE ESTIMATED MODEL Latent Classes CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP Latent Classes
Entropy (fuzzyness) CLASSIFICATION QUALITY Entropy Average Latent Class Probabilities for Most Likely Latent Class Membership (Row) by Latent Class (Column)
Model results Two-Tailed Estimate S.E. Est./S.E. P-Value Latent Class 1 Thresholds COUGH$ ILL$ TASTE$ LIKED$ DIZZY$
Categorical Latent Variables Two-Tailed Estimate S.E. Est./S.E. P-Value Means C# C#
RESULTS IN PROBABILITY SCALE Latent Class 1 COUGH Category Category ILL Category Category TASTE Category Category LIKED Category Category DIZZY Category Category
Class 1 from 3-class model
Conditional independence The latent class variable accounts for the covariance structure in your dataset Conditional on C, any pair of manifest variables should be uncorrelated Harder to achieve for a cross-sectional LCA With a longitudinal LCA there tends to be a more ordered pattern of correlations based on proximity in time
Tech10 – response patterns MODEL FIT INFORMATION FOR THE LATENT CLASS INDICATOR MODEL PART RESPONSE PATTERNS No. Pattern No. Pattern
Tech10 – Bivariate model fit 5 manifest variables → number of pairs = Overall Bivariate Pearson Chi-Square Overall Bivariate Log-Likelihood Chi-Square Compare with χ² (10 df) =
Tech10 – Bivariate model fit Not bad:- Estimated Probabilities Standardized Variable Variable H1 H0 Residual (z-score) COUGH ILL Category 1 Category Category 1 Category Category 2 Category Category 2 Category Bivariate Pearson Chi-Square Bivariate Log-Likelihood Chi-Square 2.798
Tech10 – Bivariate model fit Terrible:- Estimated Probabilities Standardized Variable Variable H1 H0 Residual (z-score) COUGH ILL Category 1 Category Category 1 Category Category 2 Category Category 2 Category Bivariate Pearson Chi-Square Bivariate Log-Likelihood Chi-Square
Conditional Independence violated Need more classes
Obtain the ‘optimal’ model Assess the following for models with increasing classes aBIC Entropy BLRT (Bootstrap LRT) Conditional Independence (Tech10) Ease of interpretation Consistency with previous work / theory
Model fit stats 1 class2 class3 class4 class5 class Estimated params H0 Likelihood aBIC Entropy Tech BLRT statistic BLRT p-value -<
5-class model aBIC values are still decreasing Tech 10 is still quite high – residual correlations between ill and both liked and dizzy BLRT rejects 4-class model Not enough df to fit 6-class model so we cannot assess fit of 5-class Seems unlikely as BLRT values are decreasing slowly
Cross-sectional LCA Patterns of first response to cigarette Attempt 2
What to do? We need more degrees of freedom There were only 5 questions on response to smoking Add something else: –How old were you when you first tried a cigarette? –Split into pre-teen / teen 6 binary variables means 64 d.f. to play with
Model fit stats – attempt 2 3 class4 class5 class6 class7 class Estimated params H0 Likelihood aBIC Entropy Tech BLRT statistic BLRT p-value <
Model fit stats – attempt 2 3 class4 class5 class6 class7 class Estimated params H0 Likelihood aBIC Entropy Tech BLRT statistic BLRT p-value <
6-class model results CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THE ESTIMATED MODEL Latent classes % % % % % % CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP Latent classes % % % % % %
Examine entropy in more detail Model-level entropy = Class level entropy:
Pattern level entropy Save out the model-based probabilities Open in another stats package Collapse over response patterns
Save out the model-based probabilities savedata: file is "6-class-results.dat"; save cprobabilities;
Varnames shown at end of output SAVEDATA INFORMATION Order and format of variables COUGH F10.3 ILL F10.3 TASTE F10.3 LIKED F10.3 DIZZY F10.3 LESS_13 F10.3 ALN F10.3 QLET F10.3 SEX F10.3 CPROB1 F10.3 CPROB2 F10.3 CPROB3 F10.3 CPROB4 F10.3 CPROB5 F10.3 CPROB6 F10.3 C F10.3
Open / process in Stata Remove excess spaces from data file, then: insheet using 6-class-results.dat, delim(" ") local i = 1 local varnames "COUGH ILL TASTE LIKED DIZZY LESS_13 ALN QLET SEX CPROB1 CPROB2 CPROB3 CPROB4 CPROB5 CPROB6 C" foreach x of local varnames { rename v`i' `x' local i=`i'+1 } gen num = 1 collapse (mean) CPROB* C (count) num, by(COUGH ILL TASTE LIKED DIZZY LESS_13)
Check the assignment probabilities for each class coughilltastelikeddizzy< 13P_c1P_c2P_c3P_c4P_c5P_c6 Mod classn
coughilltastelikeddizzy< 13P_c1P_c2P_c3P_c4P_c5P_c6 Mod classn Check the assignment probabilities for each class
coughilltastelikeddizzy< 13P_c1P_c2P_c3P_c4P_c5P_c6 Mod classn Check the assignment probabilities for each class
Bad taste (30.1%)
Positive experience (21.7%)
Coughed (18.2%)
Dizziness (15.9%)
V negative experience (11.9%)
Felt ill (2.1%)
Well that was a complete waste of time! You might think that those resulting classes could have been derived just looking at the response patterns and making some arbitrary decisions e.g. –Group all of those who had >1 negative experience –Keep separate each group who had 1 experience You would have ended up with a bunch of weird patterns with no clue of what to do with them Strange patterns likely to be measurement error? LCA incorporates ALL patterns and deals with uncertainty through the posterior probabilities
Conclusions / warning Like EFA, LCA is an exploratory tool with the aim of summarising the variability in the dataset in a simple/interpretable way These results do not prove that there are 6 groups of young people in real life. LCA will find groupings in the data even if there is no reason to think such groups might exist. It’s just mathematics and it knows no better
Remember, we are dealing with probabilities Model-based “Modal assignment” Ill % % Positive % % Dizzy % % Coughed % % Bad taste % % V negative % % Working with modal assignment is easy –chuck each pattern into it’s most likely class and pretend everything is OK –Equivalent to doing a single imputation for missing data – shudder! Unless entropy is V high, stick with the probabilities
Covariates and outcomes
Merging the classes with other data In the “olden days”, you could pass your ID variable through Mplus so when you saved your class probabilities you could merge this with other data. Now you can pass other data through Mplus as well – hurrah! Variable: auxiliary are ID sex;
Reshaping the dataset To account for the uncertainty in our class variable we will need to weight by the posterior probabilities obtained from Mplus Weighted model requires a reshaping of the dataset so that each respondent has n-rows (for an n-class model) rather than just 1
Pre-shaped – first 20 kids | ID sex dev_18 dev_42 pclass1 pclass2 pclass3 pclass4 pclass5 modclass | | | | male | | male | | male | | male | | male | | | | male | | male | | male | | male | | male | | | | male | | male | | male | | male | | male | | | | male | | male | | male | | male | | male |
Pre-shaped – first 20 kids | ID sex dev_18 dev_42 pclass1 pclass2 pclass3 pclass4 pclass5 modclass | | | | male | | male | | male | | male | | male | | | | male | | male | | male | | male | | male | | | | male | | male | | male | | male | | male | | | | male | | male | | male | | male | | male | covariatesPosterior probs Modal class
The reshaping. reshape long pclass, i(id) j(class) (note: j = ) Data wide -> long Number of obs > Number of variables 66 -> 63 j variable (5 values) -> class xij variables: pclass1 pclass2... pclass5 -> pclass
Re-shaped – first 3 kids | id sex dev_18 dev_42 pclass class | | | 1. | male | 2. | male | 3. | male | 4. | male | 5. | male | | | 6. | male | 7. | male | 8. | male | 9. | male | 10. | male | | | 11. | male | 12. | male | 13. | male | 14. | male | 15. | male | First kid Third kid Second kid Sum = 1 Constant within child
Similar with our data:. list id SEX CPROB class C in 1/ | id SEX CPROB class C | | | 1. | | 2. | | 3. | | 4. | | 5. | | 6. | | | | 7. | | 8. | | 9. | | 10. | | 11. | | 12. | | | | First respondent Second respondent
Simple crosstab. tab class SEX, row nofreq | SEX class | 1 2 | Total Ill | | Positive | | Dizzy | | Coughed | | Bad taste | | V negative | | Total | | Oops!
Simple crosstab – take 2. tab class SEX [iw = CPROB], row nofreq | SEX class | Male Female | Total Ill | 52.9% 47.1% | 100% Positive | 32.9% 67.1% | 100% Dizzy | 43.2% 56.8% | 100% Coughed | 40.8% 59.2% | 100% Bad taste | 45.2% 54.8% | 100% V negative | 39.3% 60.7% | 100% Total | 40.9% 59.1% | 100%
Compare with modal class assignment. tab C SEX if (class==1), row nofreq | SEX C | Male Female | Ill | 50.0% 50.0% | Positive | 33.0% 67.0% | Dizzy | 43.4% 56.6% | Coughed | 40.7% 59.3% | Bad taste | 45.4% 54.6% | V negative | 37.6% 62.4% | Total | 40.9% 59.1% |. tab class SEX [iw = CPROB], row nofreq | SEX class | Male Female | Ill | 52.9% 47.1% | Positive | 32.9% 67.1% | Dizzy | 43.2% 56.8% | Coughed | 40.8% 59.2% | Bad taste | 45.2% 54.8% | V negative | 39.3% 60.7% | Total | 40.9% 59.1% |
Multinomial logistic. xi: mlogit class i.SEX [iw = CPROB], rrr Multinomial logistic regression Number of obs = 2493 LR chi2(5) = Prob > chi2 = Log likelihood = Pseudo R2 = class | RRR Std. Err. z P>|z| [95% Conf. Interval] Ill | _ISEX_2 | Positive | _ISEX_2 | Dizzy | _ISEX_2 | Coughed | _ISEX_2 | V negative | _ISEX_2 | (class==Bad taste is the base outcome)
Class predicts binary outcome. Outcome = weekly smoker at age of 15 char class[omit] 5. xi: logistic sm1100 i.class [iw = CPROB] Logistic regression Number of obs = 2493 LR chi2(5) = Prob > chi2 = Log likelihood = Pseudo R2 = sm1100 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] Ill | Positive | Dizzy | Coughed | V negative |
Compare with modal class. Posterior probabilities sm1100 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] Ill | Positive | Dizzy | Coughed | V negative | Modal assignment sm1100 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] Ill | Positive | Dizzy | Coughed | V negative |
Conclusions Young people at 15yrs can report a variety of responses to their first cigarette Certain responses are associated with current regular smoking behaviour 15 year-old girls are more likely to retrospectively report a positive experience Recall bias is likely to play a part in these associations
Conclusions LCA is an exploratory tool which can be used to simplify a set of binary responses Extension to ordinal responses is straight-forward The use of ordinal data is an alternative way to boost degrees of freedom Resulting probabilities can be used model latent class variable as a risk factor or outcome A modal class variable should be used with caution