Patterns of first response to cigarettes

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Mixture modelling of continuous variables. Mixture modelling So far we have dealt with mixture modelling for a selection of binary or ordinal variables.
Logistic Regression Example: Horseshoe Crab Data
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Analysis of frequency counts with Chi square
By Wendiann Sethi Spring  The second stages of using SPSS is data analysis. We will review descriptive statistics and then move onto other methods.
Chi-square Test of Independence
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
BINARY CHOICE MODELS: LOGIT ANALYSIS
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Mixture Modeling Chongming Yang Research Support Center FHSS College.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
1 Binary Models 1 A (Longitudinal) Latent Class Analysis of Bedwetting.
Categorical Data Prof. Andy Field.
Estimation and Hypothesis Testing Faculty of Information Technology King Mongkut’s University of Technology North Bangkok 1.
Simple Linear Regression
1 Psych 5500/6500 Chi-Square (Part Two) Test for Association Fall, 2008.
Social patterning in bed-sharing behaviour A longitudinal latent class analysis (LLCA)
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Education 793 Class Notes Presentation 10 Chi-Square Tests and One-Way ANOVA.
Analysis of Qualitative Data Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
1 Parallel Models. 2 Model two separate processes which run in tandem Bedwetting and daytime wetting 5 time points: 4½, 5½, 6½,7½ & 9½ yrs Binary measures.
Chi-square Test of Independence
Lecture 18 Ordinal and Polytomous Logistic Regression BMTRY 701 Biostatistical Methods II.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
Cross-sectional LCA Patterns of first response to cigarettes.
1 Ordinal Models. 2 Estimating gender-specific LLCA with repeated ordinal data Examining the effect of time invariant covariates on class membership The.
Nonparametric Statistics
1 BINARY CHOICE MODELS: LOGIT ANALYSIS The linear probability model may make the nonsense predictions that an event will occur with probability greater.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Stats Methods at IC Lecture 3: Regression.
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
Latent Class Analysis Computing examples
Nonparametric Statistics
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Chapter 11 – Test of Independence - Hypothesis Test for Proportions of a Multinomial Population In this case, each element of a population is assigned.
Latent Class Regression
BINARY LOGISTIC REGRESSION
Logistic Regression APKC – STATS AFAC (2016).
Notes on Logistic Regression
business analytics II ▌assignment one - solutions autoparts 
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 16: Research with Categorical Data.
Hypothesis Testing Review
Categorical Data Aims Loglinear models Categorical data
Types of T-tests Independent T-tests Paired or correlated t-tests
Estimating with PROBE II
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Spearman’s rho Chi-square (χ2)
Multiple logistic regression
Nonparametric Statistics
Review for Exam 2 Some important themes from Chapters 6-9
Association, correlation and regression in biomedical research
15.1 Goodness-of-Fit Tests
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Analyzing the Association Between Categorical Variables
Statistics II: An Overview of Statistics
Reasoning in Psychology Using Statistics
Some statistics questions answered:
Multiple Regression – Split Sample Validation
Reasoning in Psychology Using Statistics
Applied Statistics Using SPSS
Applied Statistics Using SPSS
Common Statistical Analyses Theory behind them
Nazmus Saquib, PhD Head of Research Sulaiman AlRajhi Colleges
Presentation transcript:

Patterns of first response to cigarettes Cross-sectional LCA Patterns of first response to cigarettes

First smoking experience Have you ever tried a cigarette (including roll-ups), even a puff? How old were you when you first tried a cigarette? When you FIRST ever tried a cigarette can you remember how it made you feel? (tick as many as you want) It made me cough I felt ill It tasted awful I liked it It made me feel dizzy

Aim To categorise the subjects based on their pattern of responses To assess the relationship between first-response and current smoking behaviour To try not to think too much about the possibility of recall bias

Step 1 Look at your data!!!

Examine your data structure LCA converts a large number of response patterns into a small number of ‘homogeneous’ groups If the responses in your data are fair mutually exclusive then there’s no point doing LCA Don’t just dive in

How many items endorsed? numresp | Freq. Percent Cum. ------------+----------------------------------- 0 | 69 2.75 2.75 1 | 1,597 63.70 66.45 2 | 569 22.70 89.15 3 | 202 8.06 97.21 4 | 68 2.71 99.92 5 | 2 0.08 100.00 Total | 2,507 100.00

Frequency of each item (n ~ 2500)

Examine pattern frequency +---------------------------------------+ | cough ill taste liked dizzy num | |---------------------------------------| 1. | 0 0 1 0 0 468 | 2. | 0 0 0 1 0 452 | 3. | 1 0 0 0 0 449 | 4. | 1 0 1 0 0 279 | 5. | 0 0 0 0 1 194 | 6. | 1 1 1 0 0 94 | 7. | 1 0 0 1 0 87 | 8. | 1 0 0 0 1 76 | 9. | 0 0 0 0 0 69 | 10. | 1 1 1 0 1 59 | 11. | 0 0 0 1 1 56 | 12. | 1 0 1 0 1 47 | 13. | 1 0 0 1 1 35 | 14. | 0 1 0 0 0 34 | 15. | 0 0 1 0 1 27 | +---------------------------------------+ | cough ill taste liked dizzy num | |---------------------------------------| 16. | 0 1 1 0 0 17 | 17. | 0 0 1 1 0 13 | 18. | 1 1 0 0 1 9 | 19. | 1 1 0 0 0 8 | 20. | 0 1 1 0 1 7 | 21. | 1 0 1 1 1 7 | 22. | 1 0 1 1 0 6 | 23. | 0 1 0 0 1 5 | 24. | 1 1 1 1 1 2 | 25. | 0 1 0 1 1 2 | 26. | 0 1 0 1 0 1 | 27. | 1 1 1 1 0 1 | 28. | 1 1 0 1 1 1 | 29. | 0 0 1 1 1 1 | 30. | 1 1 0 1 0 1 |

Examine correlation structure Polychoric correlation matrix cough ill taste liked dizzy 1   0.371 0.049 0.468 -0.510 -0.542 -0.786 -0.030 0.246 -0.241 -0.158

Now you can fit a latent class model Step 2 Now you can fit a latent class model

Latent Class models Work with observations at the pattern level rather than the individual (person) level +---------------------------------------+ | cough ill taste liked dizzy num | |---------------------------------------| 1. | 0 0 1 0 0 468 | 2. | 0 0 0 1 0 452 | 3. | 1 0 0 0 0 449 | 4. | 1 0 1 0 0 279 | 5. | 0 0 0 0 1 194 |

Latent Class models For a given number of latent classes, using application of Bayes’ rule plus an assumption of conditional independence one can calculate the probability that each pattern should fall into each class Derive the likelihood of the obtained data under each model (i.e. assuming different numbers of classes) and use this plus other fit statistics to determine optimal model i.e. optimal number of classes

Latent Class models Bayes’ rule: Conditional independence: P( pattern = ’01’ | class = i) = P(pat(1) = ‘0’ | class = i)*P(pat(2) = ‘1’ | class = i)

How many classes can I have? ~ degrees of freedom 32 possible patterns Each additional class requires 5 df to estimate the 5 prevalence of each item that class (i.e. 5 thresholds) 1 df for an additional cut of the latent variable defining the class distribution Hence a 5-class model uses up 5*5 + 4 = 29 degrees of freedom leaving 3df to test the model If all 32 patterns are present in the dataset, there are 32 degrees of freedom to exploit. If some patterns are absent, the available df may be reduced depending on the test you employ. Similarly, collapsing over patterns to improve cell counts can also reduce d.f.. Similarly, collapsing over patterns to improve cell counts can also reduce d.f. In the book “The analysis and interpretation of multivariate data for social scientists” (Bartholemew/Steele/Moustaki/Galbraith) this issue is mentioned. For the goodness of fit test used in this book, the df ARE affected by the number of patterns in the dataset. For tests used by Mplus, it is the number of possible patterns which governs d.f.

Standard thresholds Mplus thinks of binary variables as being a dichotomised continuous latent variable The point at which a continuous N(0,1) variable must be cut to create a binary variable is called a threshold A binary variable with 50% cases corresponds to a threshold of zero A binary variable with 2.5% cases corresponds to a threshold of 1.96

Standard thresholds Figure from Uebersax webpage Here’s an example, with the figure taken from Uebersax’s webpage. If all we know is that some subjects are depressed and some are not, but we assume that the continuous, latent measure of depression severity is normally distributed with mean zero and SD one, then we can use the observed proportion depressed, along with Tables of the standard normal distribution function, to calculate the threshold, namely the point along the distribution of Depression where the cut-point must have been in order to achieve the given proportion of depressed people. Figure from Uebersax webpage

Data: File is “..\smoking_experience.dta.dat"; listwise is on; Variable: Names are sex cough ill taste liked dizzy numresp less_12 less_13; categorical are cough ill taste liked dizzy ; usevariables are cough ill taste liked dizzy; Missing are all (-9999) ; classes = c(3); Analysis: proc = 2 (starts); type = mixture; starts = 1000 500; stiterations = 20; Output: tech10;

What you’re actually doing model: %OVERALL% [c#1 c#2]; %c#1% [cough$1]; [ill$1]; [taste$1]; [liked$1]; [dizzy$1]; + five more threshold parameters for %c#2% and %c#3% Defines the latent class variable Defines the within class thresholds i.e. the prevalence of the endorsement of each item

SUMMARY OF CATEGORICAL DATA PROPORTIONS COUGH Category 1 0.537 Category 2 0.463 ILL Category 1 0.904 Category 2 0.096 TASTE Category 1 0.590 Category 2 0.410 LIKED Category 1 0.735 Category 2 0.265 DIZZY Category 1 0.789 Category 2 0.211

RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST LOGLIKELIHOOD VALUES Final stage loglikelihood values at local maxima, seeds, and initial stage start numbers: -6343.937 685561 9973 -6343.937 172907 9395 -6343.937 497824 9464 -6343.937 770684 7725 -6343.937 584663 5193 -6343.937 872295 2899 -6343.937 116150 3570 -6343.937 271339 4768 -6343.937 472383 9650 -6343.937 707126 3683 Etc.

How many random starts? Depends on Sample size Complexity of model Number of manifest variables Number of classes Aim to find consistently the model with the lowest likelihood, within each run

Success Not there yet Loglikelihood values at local maxima, seeds, and initial stage start numbers: -10148.718 987174 1689 -10148.718 777300 2522 -10148.718 406118 3827 -10148.718 51296 3485 -10148.718 997836 1208 -10148.718 119680 4434 -10148.718 338892 1432 -10148.718 765744 4617 -10148.718 636396 168 -10148.718 189568 3651 -10148.718 469158 1145 -10148.718 90078 4008 -10148.718 373592 4396 -10148.718 73484 4058 -10148.718 154192 3972 -10148.718 203018 3813 -10148.718 785278 1603 -10148.718 235356 2878 -10148.718 681680 3557 -10148.718 92764 2064 Loglikelihood values at local maxima, seeds, and initial stage start numbers -10153.627 23688 4596 -10153.678 150818 1050 -10154.388 584226 4481 -10155.122 735928 916 -10155.373 309852 2802 -10155.437 925994 1386 -10155.482 370560 3292 -10155.482 662718 460 -10155.630 320864 2078 -10155.833 873488 2965 -10156.017 212934 568 -10156.231 98352 3636 -10156.339 12814 4104 -10156.497 557806 4321 -10156.644 134830 780 -10156.741 80226 3041 -10156.793 276392 2927 -10156.819 304762 4712 -10156.950 468300 4176 -10157.011 83306 2432

Scary “warnings” IN THE OPTIMIZATION, ONE OR MORE LOGIT THRESHOLDS APPROACHED AND WERE SET AT THE EXTREME VALUES. EXTREME VALUES ARE -15.000 AND 15.000. THE FOLLOWING THRESHOLDS WERE SET AT THESE VALUES: * THRESHOLD 1 OF CLASS INDICATOR TASTE FOR CLASS 3 AT ITERATION 11 * THRESHOLD 1 OF CLASS INDICATOR DIZZY FOR CLASS 3 AT ITERATION 12 * THRESHOLD 1 OF CLASS INDICATOR ILL FOR CLASS 3 AT ITERATION 16 * THRESHOLD 1 OF CLASS INDICATOR LIKED FOR CLASS 1 AT ITERATION 34 * THRESHOLD 1 OF CLASS INDICATOR TASTE FOR CLASS 1 AT ITERATION 93 WARNING: WHEN ESTIMATING A MODEL WITH MORE THAN TWO CLASSES, IT MAY BE NECESSARY TO INCREASE THE NUMBER OF RANDOM STARTS USING THE STARTS OPTION TO AVOID LOCAL MAXIMA.

THE MODEL ESTIMATION TERMINATED NORMALLY TESTS OF MODEL FIT Loglikelihood H0 Value -6343.937 H0 Scaling Correction Factor 1.006 for MLR Information Criteria Number of Free Parameters 17 Akaike (AIC) 12721.873 Bayesian (BIC) 12820.930 Sample-Size Adjusted BIC 12766.916 (n* = (n + 2) / 24)

Chi-Square Test of Model Fit for the Binary and Ordered Categorical (Ordinal) Outcomes Pearson Chi-Square Value 623.040 Degrees of Freedom 14 P-Value 0.0000 Likelihood Ratio Chi-Square Value 563.869

FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THE ESTIMATED MODEL 1 600.41143 0.23949 2 1517.83320 0.60544 3 388.75538 0.15507 CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP 1 630 0.25130 2 1396 0.55684 3 481 0.19186

Entropy (fuzzyness) CLASSIFICATION QUALITY Entropy 0.832 Average Latent Class Probabilities for Most Likely Latent Class Membership (Row) by Latent Class (Column) 1 2 3 1 0.952 0.048 0.000 2 0.000 0.979 0.021 3 0.000 0.252 0.748

Model results Two-Tailed Estimate S.E. Est./S.E. P-Value Latent Class 1 Thresholds COUGH$1 1.604 0.133 12.103 0.000 ILL$1 7.371 4.945 1.490 0.136 TASTE$1 15.000 0.000 999.000 999.000 LIKED$1 -15.000 0.000 999.000 999.000 DIZZY$1 1.890 0.139 13.604 0.000

Categorical Latent Variables Two-Tailed Estimate S.E. Est./S.E. P-Value Means C#1 0.435 0.124 3.500 0.000 C#2 1.362 0.135 10.058 0.000

Latent Class 1 COUGH Category 1 0.833 0.018 45.072 0.000 RESULTS IN PROBABILITY SCALE Latent Class 1 COUGH Category 1 0.833 0.018 45.072 0.000 Category 2 0.167 0.018 9.059 0.000 ILL Category 1 0.999 0.003 321.448 0.000 Category 2 0.001 0.003 0.202 0.840 TASTE Category 1 1.000 0.000 0.000 1.000 Category 2 0.000 0.000 0.000 1.000 LIKED Category 1 0.000 0.000 0.000 1.000 Category 2 1.000 0.000 0.000 1.000 DIZZY Category 1 0.869 0.016 54.848 0.000 Category 2 0.131 0.016 8.284 0.000

Class 1 from 3-class model

Conditional independence The latent class variable accounts for the covariance structure in your dataset Conditional on C, any pair of manifest variables should be uncorrelated Harder to achieve for a cross-sectional LCA With a longitudinal LCA there tends to be a more ordered pattern of correlations based on proximity in time

Tech10 – response patterns MODEL FIT INFORMATION FOR THE LATENT CLASS INDICATOR MODEL PART RESPONSE PATTERNS No. Pattern No. Pattern No. Pattern No. Pattern 1 10000 2 00100 3 00010 4 11100 5 11101 6 00001 7 10101 8 10010 9 10100 10 00101 11 10001 12 00000 13 00011 14 01101 15 10011 16 00110 17 11000 18 10111 19 11011 20 01100 21 10110 22 01000 23 01001 24 11111 25 01010 26 11001 27 01011 28 11010 29 00111 30 11110

Tech10 – Bivariate model fit 5 manifest variables → number of pairs = Overall Bivariate Pearson Chi-Square 215.353 Overall Bivariate Log-Likelihood Chi-Square 214.695 Compare with χ² (10 df) = 18.307

Tech10 – Bivariate model fit Not bad:- Estimated Probabilities Standardized Variable Variable H1 H0 Residual (z-score) COUGH ILL Category 1 Category 1 0.511 0.506 0.457 Category 1 Category 2 0.026 0.031 -1.321 Category 2 Category 1 0.393 0.398 -0.467 Category 2 Category 2 0.070 0.065 0.925 Bivariate Pearson Chi-Square 2.726 Bivariate Log-Likelihood Chi-Square 2.798

Tech10 – Bivariate model fit Terrible:- Estimated Probabilities Standardized Variable Variable H1 H0 Residual (z-score) COUGH ILL Category 1 Category 1 0.566 0.534 3.149 Category 1 Category 2 0.338 0.370 -3.255 Category 2 Category 1 0.024 0.056 -6.850 Category 2 Category 2 0.072 0.040 7.977 Bivariate Pearson Chi-Square 116.657 Bivariate Log-Likelihood Chi-Square 117.162

Conditional Independence violated Need more classes

Obtain the ‘optimal’ model Assess the following for models with increasing classes aBIC Entropy BLRT (Bootstrap LRT) Conditional Independence (Tech10) Ease of interpretation Consistency with previous work / theory

Model fit stats 1 class 2 class 3 class 4 class 5 class Estimated params 5 11 17 23 29 H0 Likelihood -6962.1 -6458.7 -6343.9 -6200.1 -6100.8 aBIC 13947.4 12968.5 12766.9 12507.1 12336.5 Entropy - 0.944 0.832 0.894 0.844 Tech 10 625.2 228.1 214.7 135.9 17.6 BLRT statistic 1006.8 229.5 287.8 198.4 BLRT p-value < 0.0001

5-class model aBIC values are still decreasing Tech 10 is still quite high – residual correlations between ill and both liked and dizzy BLRT rejects 4-class model Not enough df to fit 6-class model so we cannot assess fit of 5-class Seems unlikely as BLRT values are decreasing slowly

Patterns of first response to cigarette Attempt 2 Cross-sectional LCA Patterns of first response to cigarette Attempt 2

What to do? We need more degrees of freedom There were only 5 questions on response to smoking Add something else: How old were you when you first tried a cigarette? Split into pre-teen / teen 6 binary variables means 64 d.f. to play with

Model fit stats – attempt 2 3 class 4 class 5 class 6 class 7 class Estimated params 20 27 34 41 48 H0 Likelihood -7866.3 -7720.2 -7616.0 -7582.4 -7576.2 aBIC 15825.6 15565.7 15389.9 15355.1 15375.2 Entropy 0.823 0.893 0.812 0.876 0.850 Tech 10 228.9 144.6 16.8 1.2 0.29 BLRT statistic 123.3 146.1 104.2 67.3 12.4 BLRT p-value < 0.0001 0.2100

Model fit stats – attempt 2 3 class 4 class 5 class 6 class 7 class Estimated params 20 27 34 41 48 H0 Likelihood -7866.3 -7720.2 -7616.0 -7582.4 -7576.2 aBIC 15825.6 15565.7 15389.9 15355.1 15375.2 Entropy 0.823 0.893 0.812 0.876 0.850 Tech 10 228.9 144.6 16.8 1.2 0.29 BLRT statistic 123.3 146.1 104.2 67.3 12.4 BLRT p-value < 0.0001 0.2100

6-class model results CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THE ESTIMATED MODEL Latent classes 1 53.23894 2.1% 2 541.96140 21.7% 3 396.04196 15.9% 4 454.89294 18.2% 5 750.87470 30.1% 6 295.99007 11.9% CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP Latent classes 1 34 1.4% 2 540 21.7% 3 403 16.2% 4 447 17.9% 5 840 33.7% 6 229 9.2%

Examine entropy in more detail Model-level entropy = 0.876 Class level entropy: 1 2 3 4 5 6 1 0.953 0.000 0.000 0.000 0.026 0.020 2 0.000 0.997 0.000 0.000 0.002 0.001 3 0.000 0.000 0.958 0.000 0.017 0.025 4 0.000 0.000 0.000 0.949 0.041 0.011 5 0.025 0.005 0.000 0.036 0.851 0.083 6 0.000 0.000 0.043 0.003 0.036 0.918

Pattern level entropy Save out the model-based probabilities Open in another stats package Collapse over response patterns

Save out the model-based probabilities savedata: file is "6-class-results.dat"; save cprobabilities;

Varnames shown at end of output SAVEDATA INFORMATION Order and format of variables COUGH F10.3 ILL F10.3 TASTE F10.3 LIKED F10.3 DIZZY F10.3 LESS_13 F10.3 ALN F10.3 QLET F10.3 SEX F10.3 CPROB1 F10.3 CPROB2 F10.3 CPROB3 F10.3 CPROB4 F10.3 CPROB5 F10.3 CPROB6 F10.3 C F10.3

Open / process in Stata Remove excess spaces from data file, then: insheet using 6-class-results.dat, delim(" ") local i = 1 local varnames "COUGH ILL TASTE LIKED DIZZY LESS_13 ALN QLET SEX CPROB1 CPROB2 CPROB3 CPROB4 CPROB5 CPROB6 C" foreach x of local varnames { rename v`i' `x' local i=`i'+1 } gen num = 1 collapse (mean) CPROB* C (count) num, by(COUGH ILL TASTE LIKED DIZZY LESS_13) Also easily done in SPSS using the ‘aggregate’ command

Check the assignment probabilities for each class cough ill taste liked dizzy < 13 P_c1 P_c2 P_c3 P_c4 P_c5 P_c6 Mod class n 1 0.052 0.948 6 64 0.003 0.001 0.996 34 0.027 0.973 30 0.135 0.062 0.803 29 25 0.154 0.032 0.815 18 0.071 0.054 0.874 0.073 0.012 0.915 4 0.303 0.696 0.329 0.671 0.411 0.589 3 0.065 0.024 0.912 0.055 0.029 0.917 2 0.023 0.977 0.039 0.96 0.044 0.955

Check the assignment probabilities for each class cough ill taste liked dizzy < 13 P_c1 P_c2 P_c3 P_c4 P_c5 P_c6 Mod class n 1 0.052 0.948 6 64 0.003 0.001 0.996 34 0.027 0.973 30 0.135 0.062 0.803 29 25 0.154 0.032 0.815 18 0.071 0.054 0.874 0.073 0.012 0.915 4 0.303 0.696 0.329 0.671 0.411 0.589 3 0.065 0.024 0.912 0.055 0.029 0.917 2 0.023 0.977 0.039 0.96 0.044 0.955

Check the assignment probabilities for each class cough ill taste liked dizzy < 13 P_c1 P_c2 P_c3 P_c4 P_c5 P_c6 Mod class n 1 0.052 0.948 6 64 0.003 0.001 0.996 34 0.027 0.973 30 0.135 0.062 0.803 29 25 0.154 0.032 0.815 18 0.071 0.054 0.874 0.073 0.012 0.915 4 0.303 0.696 0.329 0.671 0.411 0.589 3 0.065 0.024 0.912 0.055 0.029 0.917 2 0.023 0.977 0.039 0.96 0.044 0.955 The lower the value of entropy, the less certainty behind the assignment of individual response patterns. Often find that one class is more wooly than the rest as it’s made up of all the unusual patterns that won’t go elsewhere.

Bad taste (30.1%)

Positive experience (21.7%)

Coughed (18.2%)

Dizziness (15.9%)

V negative experience (11.9%)

Felt ill (2.1%)

Well that was a complete waste of time! You might think that those resulting classes could have been derived just looking at the response patterns and making some arbitrary decisions e.g. Group all of those who had >1 negative experience Keep separate each group who had 1 experience You would have ended up with a bunch of weird patterns with no clue of what to do with them Strange patterns likely to be measurement error? LCA incorporates ALL patterns and deals with uncertainty through the posterior probabilities

Conclusions / warning Like EFA, LCA is an exploratory tool with the aim of summarising the variability in the dataset in a simple/interpretable way These results do not prove that there are 6 groups of young people in real life. LCA will find groupings in the data even if there is no reason to think such groups might exist. It’s just mathematics and it knows no better

Remember, we are dealing with probabilities Model-based “Modal assignment” Ill 53.24 2.1% 34 1.4% Positive 541.96 21.7% 540 21.7% Dizzy 396.04 15.9% 403 16.2% Coughed 454.89 18.2% 447 17.9% Bad taste 750.87 30.1% 840 33.7% V negative 295.99 11.9% 229 9.2% Working with modal assignment is easy chuck each pattern into it’s most likely class and pretend everything is OK Equivalent to doing a single imputation for missing data – shudder! Unless entropy is V high, stick with the probabilities

Covariates and outcomes

Merging the classes with other data In the “olden days”, you could pass your ID variable through Mplus so when you saved your class probabilities you could merge this with other data. Now you can pass other data through Mplus as well – hurrah! Variable: <snip> auxiliary are ID sex; Add desired covariates, repeat the savedata exercise, read the results into stata, but don’t collapse over patterns

Reshaping the dataset To account for the uncertainty in our class variable we will need to weight by the posterior probabilities obtained from Mplus Weighted model requires a reshaping of the dataset so that each respondent has n-rows (for an n-class model) rather than just 1

Pre-shaped – first 20 kids | ID sex dev_18 dev_42 pclass1 pclass2 pclass3 pclass4 pclass5 modclass | |--------------------------------------------------------------------------------------------------| | 30004 male 3 . .001 0 .803 0 .197 3 | | 30008 male 2 1 .908 0 0 .007 .085 1 | | 30010 male 2 2 .053 .001 .052 0 .894 5 | | 30023 male 1 3 .115 0 .596 .001 .288 3 | | 30031 male 3 4 0 0 .983 0 .016 3 | | 30033 male 4 4 .392 0 .397 0 .211 3 | | 30042 male 1 3 0 0 .983 0 .016 3 | | 30050 male 3 2 0 0 .983 0 .016 3 | | 30051 male 2 2 0 0 0 1 0 4 | | 30057 male 1 3 .135 0 .002 0 .864 5 | | 30058 male 1 4 0 0 .958 0 .041 3 | | 30064 male 2 4 0 0 .983 0 .016 3 | | 30068 male 4 3 .001 0 .803 0 .197 3 | | 30070 male 3 4 0 0 .983 0 .016 3 | | 30072 male 1 1 0 0 .983 0 .016 3 | | 30075 male 3 3 0 0 .982 0 .018 3 | | 30088 male 3 4 .03 .002 .889 .003 .076 3 | | 30095 male 3 . 0 0 .983 0 .016 3 | | 30098 male 3 . .068 .158 .173 .018 .583 5 | | 30104 male 4 1 .008 0 .775 0 .217 3 | +--------------------------------------------------------------------------------------------------+

Pre-shaped – first 20 kids | ID sex dev_18 dev_42 pclass1 pclass2 pclass3 pclass4 pclass5 modclass | |--------------------------------------------------------------------------------------------------| | 30004 male 3 . .001 0 .803 0 .197 3 | | 30008 male 2 1 .908 0 0 .007 .085 1 | | 30010 male 2 2 .053 .001 .052 0 .894 5 | | 30023 male 1 3 .115 0 .596 .001 .288 3 | | 30031 male 3 4 0 0 .983 0 .016 3 | | 30033 male 4 4 .392 0 .397 0 .211 3 | | 30042 male 1 3 0 0 .983 0 .016 3 | | 30050 male 3 2 0 0 .983 0 .016 3 | | 30051 male 2 2 0 0 0 1 0 4 | | 30057 male 1 3 .135 0 .002 0 .864 5 | | 30058 male 1 4 0 0 .958 0 .041 3 | | 30064 male 2 4 0 0 .983 0 .016 3 | | 30068 male 4 3 .001 0 .803 0 .197 3 | | 30070 male 3 4 0 0 .983 0 .016 3 | | 30072 male 1 1 0 0 .983 0 .016 3 | | 30075 male 3 3 0 0 .982 0 .018 3 | | 30088 male 3 4 .03 .002 .889 .003 .076 3 | | 30095 male 3 . 0 0 .983 0 .016 3 | | 30098 male 3 . .068 .158 .173 .018 .583 5 | | 30104 male 4 1 .008 0 .775 0 .217 3 | +--------------------------------------------------------------------------------------------------+ Modal class covariates Posterior probs

The reshaping . reshape long pclass, i(id) j(class) (note: j = 1 2 3 4 5) Data wide -> long --------------------------------------------------------- Number of obs. 5584 -> 27920 Number of variables 66 -> 63 j variable (5 values) -> class xij variables: pclass1 pclass2 ... pclass5 -> pclass

Re-shaped – first 3 kids First kid Second kid Third kid Sum = 1 +--------------------------------------------------+ | id sex dev_18 dev_42 pclass class | |--------------------------------------------------| 1. | 30004 male 3 . .001 1 | 2. | 30004 male 3 . 0 2 | 3. | 30004 male 3 . .803 3 | 4. | 30004 male 3 . 0 4 | 5. | 30004 male 3 . .197 5 | 6. | 30008 male 2 1 .908 1 | 7. | 30008 male 2 1 0 2 | 8. | 30008 male 2 1 0 3 | 9. | 30008 male 2 1 .007 4 | 10. | 30008 male 2 1 .085 5 | 11. | 30010 male 2 2 .053 1 | 12. | 30010 male 2 2 .001 2 | 13. | 30010 male 2 2 .052 3 | 14. | 30010 male 2 2 0 4 | 15. | 30010 male 2 2 .894 5 | First kid Second kid Third kid Sum = 1 Constant within child

Similar with our data: First respondent Second respondent . list id SEX CPROB class C in 1/12 +---------------------------------+ | id SEX CPROB class C | |---------------------------------| 1. | 30012 2 0 1 4 | 2. | 30012 2 0 2 4 | 3. | 30012 2 0 3 4 | 4. | 30012 2 .945 4 4 | 5. | 30012 2 .045 5 4 | 6. | 30012 2 .01 6 4 | 7. | 30024 2 0 1 5 | 8. | 30024 2 0 2 5 | 9. | 30024 2 0 3 5 | 10. | 30024 2 0 4 5 | 11. | 30024 2 .991 5 5 | 12. | 30024 2 .009 6 5 | First respondent Second respondent

Simple crosstab Oops! . tab class SEX , row nofreq | SEX class | 1 2 | Total -----------+----------------------+---------- Ill | 40.87 59.13 | 100.00 Positive | 40.87 59.13 | 100.00 Dizzy | 40.87 59.13 | 100.00 Coughed | 40.87 59.13 | 100.00 Bad taste | 40.87 59.13 | 100.00 V negative | 40.87 59.13 | 100.00 Total | 40.87 59.13 | 100.00 Oops!

Simple crosstab – take 2 . tab class SEX [iw = CPROB], row nofreq class | Male Female | Total -----------+-------------------+------- Ill | 52.9% 47.1% | 100% Positive | 32.9% 67.1% | 100% Dizzy | 43.2% 56.8% | 100% Coughed | 40.8% 59.2% | 100% Bad taste | 45.2% 54.8% | 100% V negative | 39.3% 60.7% | 100% Total | 40.9% 59.1% | 100% I’ve removed the cell counts as they give a false impression. We’re now dealing with probabilities, not real whole people

Compare with modal class assignment . tab class SEX [iw = CPROB], row nofreq | SEX class | Male Female | -----------+-----------------+ Ill | 52.9% 47.1% | Positive | 32.9% 67.1% | Dizzy | 43.2% 56.8% | Coughed | 40.8% 59.2% | Bad taste | 45.2% 54.8% | V negative | 39.3% 60.7% | Total | 40.9% 59.1% | . tab C SEX if (class==1), row nofreq | SEX C | Male Female | -----------+-----------------+ Ill | 50.0% 50.0% | Positive | 33.0% 67.0% | Dizzy | 43.4% 56.6% | Coughed | 40.7% 59.3% | Bad taste | 45.4% 54.6% | V negative | 37.6% 62.4% | Total | 40.9% 59.1% |

Multinomial logistic . xi: mlogit class i.SEX [iw = CPROB], rrr Multinomial logistic regression Number of obs = 2493 LR chi2(5) = 24.52 Prob > chi2 = 0.0002 Log likelihood = -4053.3746 Pseudo R2 = 0.0030 ------------------------------------------------------------------------------ class | RRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Ill | _ISEX_2 | .7322787 .2081189 -1.10 0.273 .4195259 1.278186 Positive | _ISEX_2 | 1.677364 .1965463 4.41 0.000 1.333175 2.110413 Dizzy | _ISEX_2 | 1.082775 .1355213 0.64 0.525 .8472297 1.383807 Coughed | _ISEX_2 | 1.194885 .1437877 1.48 0.139 .9438344 1.512712 V negative | _ISEX_2 | 1.274734 .1782148 1.74 0.083 .9692081 1.676572 (class==Bad taste is the base outcome)

Class predicts binary outcome . Outcome = weekly smoker at age of 15 char class[omit] 5 . xi: logistic sm1100 i.class [iw = CPROB] Logistic regression Number of obs = 2493 LR chi2(5) = 229.03 Prob > chi2 = 0.0000 Log likelihood = -1168.697 Pseudo R2 = 0.0892 ------------------------------------------------------------------------------ sm1100 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Ill | 2.132652 .9125838 1.77 0.077 .9218961 4.933531 Positive | 7.190203 1.231216 11.52 0.000 5.140265 10.05766 Dizzy | 7.899915 1.413907 11.55 0.000 5.562583 11.21937 Coughed | 3.686492 .6831946 7.04 0.000 2.563689 5.301041 V negative | 2.243034 .497619 3.64 0.000 1.452099 3.46478

Compare with modal class . Posterior probabilities ------------------------------------------------------------------------------ sm1100 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- Ill | 2.132652 .9125838 1.77 0.077 .9218961 4.933531 Positive | 7.190203 1.231216 11.52 0.000 5.140265 10.05766 Dizzy | 7.899915 1.413907 11.55 0.000 5.562583 11.21937 Coughed | 3.686492 .6831946 7.04 0.000 2.563689 5.301041 V negative | 2.243034 .497619 3.64 0.000 1.452099 3.46478 Modal assignment Ill | 2.560182 1.291868 1.86 0.062 .9522577 6.88315 Positive | 7.802047 1.313428 12.20 0.000 5.609367 10.85184 Dizzy | 8.3454 1.467249 12.07 0.000 5.912796 11.77881 Coughed | 4.224301 .7686958 7.92 0.000 2.957071 6.034592 V negative | 2.861537 .6548723 4.59 0.000 1.827254 4.481255

Conclusions Young people at 15yrs can report a variety of responses to their first cigarette Certain responses are associated with current regular smoking behaviour 15 year-old girls are more likely to retrospectively report a positive experience Recall bias is likely to play a part in these associations

Conclusions LCA is an exploratory tool which can be used to simplify a set of binary responses Extension to ordinal responses is straight-forward The use of ordinal data is an alternative way to boost degrees of freedom Resulting probabilities can be used model latent class variable as a risk factor or outcome A modal class variable should be used with caution