Download presentation
Presentation is loading. Please wait.
1
Missing Data Michael C. Neale International Workshop on Methodology for Genetic Studies of Twins and Families Boulder CO 2006 Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University Vrije Universiteit Amsterdam
2
Various forms of missing data Twins volunteer to participate or not Twins are obtained from hospital records Attrition during longitudinal studies Equipment failure: Systematic (if < value then scored zero) Random Structured interviews Don’t bother asking question 2 if response to Q1 is no Designs to increase statistical power
3
Plan of Talk 1. Review likelihood & Mx facilities 2. Conceptual view of ascertainment 1. Ordinal 2. Continuous 3. Ascertainment in twin studies 1. Ordinal 2. Continuous 4. Missing data 1. Systematic (if < value then scored zero) 2. Random 5. Structured interviews 1. Don’t bother asking question 2 if response to Q1 is no
4
Likelihood approach Usual nice properties of ML remain Flexible Simple principle Consideration of possible outcomes Re-normalization May be difficult to compute Advantages & Disadvantages
5
Maximum Likelihood Estimates Asymptotically unbiased Minimum variance of all asymptotically unbiased estimators Invariant to transformations a 2 = (a) 2 Have nice properties ^ ^
6
Example: Two Coin Toss 3 outcomes HHHT/THTT Outcome 0 0.5 1 1.5 2 2.5 Frequency Probability i = freq i / sum (freqs)
7
Example: Two Coin Toss 2 outcomes HHHT/THTT Outcome 0 0.5 1 1.5 2 2.5 Frequency Probability i = freq i / sum (freqs)
8
Non-random ascertainment Probability of observing TT globally 1 outcome from 4 = 1/4 Probability of observing TT if HH is not ascertained 1 outcome from 3 = 1/3 or 1/4 divided by 'ascertainment correction' of 3/4 = 1/3 Example
9
Correcting for ascertainment Univariate continuous case; only subjects > t ascertained 01234-2-3-4 0 0.1 0.2 0.3 0.4 0.5 : G xixi N likelihood t
10
Correcting for ascertainment Univariate continuous case; only subjects > t ascertained 01234-2-3-4 0 0.1 0.2 0.3 0.4 0.5 : G xixi N likelihood t
11
Correcting for ascertainment Without ascertainment, we compute With ascertainment, the correction is Dividing by the realm of possibilities Does likelihood increase or decrease after correction? pdf, N(: ij, G ij ), at observed value x i divided by: I -4 N(: ij, G ij ) dx = 1 4 t I -4 N(: ij, G ij ) dx < 1
12
Ascertainment in Practice BG Studies Studies of patients and controls Patients and relatives Twin pairs with at least one affected Single ascertainment pi 6 0 Complete ascertainment pi = 1 Incomplete0 < pi <1 Linkage studies Affected sib pairs, DSP etc Multiple affected families pi = probability that someone is ascertained given that they are affected
13
Correction depends on model 1 Correction independent of model parameters: "sample weights" 2 Correction depends on model parameters: weights vary during optimization In twin data almost always case 2 continuous data binary/ordinal data
14
High correlation Concordant>t txtx 0 1 I tx I ty N(x,y) dy dx 4 4 1010 tyty
15
Medium correlation txtx tyty 0 1 I tx I ty N(x,y) dy dx 4 4 1010
16
Low correlation tyty 0 1 I tx I ty N(x,y) dy dx 4 4 1010 txtx
17
Two approaches for twin data in Mx Contingency table approach Automatic Limited to two variable case Raw data approach Manual Multivariate Moderator / Covariates
18
Contingency Table Case Feed program contingency table as usual Use -1 for frequency for non-ascertained cells Correction for ascertainment handled automatically Binary data
19
At least one twin affected txtx tyty 1010 0 1 1-I -4 I -4 N(x,y) dy dx txty Ascertainment Correction
20
Ascertain iff twin 1 > t txtx tyty 1010 0 1 I ty I -4 N(x,y) dx dy 4 4 I ty N(y) dy = 4 Twin 1 Twin 2
21
Contingency Tables Use -1 for cells not ascertained Can be used for ordinal case Need to start thinking about thresholds Supply estimated population values Estimate them jointly with model
22
Mx Syntax Classical Twin Study: Contingency Table ftp://views.vcu.edu/pub/mx/examples/ncbook2/categor.mx G1: Model parameters Data Calc NGroups=4 Begin Matrices; X Lower 1 1 Free Y Lower 1 1 Free Z Lower 1 1 Free W Lower 1 1 End Matrices; ! parameters are fixed by default, unless declared free Begin Algebra; A= X*X'; C= Y*Y'; E= Z*Z'; D= W*W'; End Algebra: End
23
Mx Syntax Group 2 G2: young female MZ twin pairs Data Ninput=2 CTable 2 2 329 83 95 83 Begin Matrices= Group 1 T full 2 1 Free End Matrices; Covariances A+C+D+E | A+C+D _ A+C+D | A+C+D+E ; Thresholds T ; Options RSidual End
24
Mx Syntax Group 3 G3: young female DZ twin pairs Data Ninput=2 CTable 2 2 201 94 82 63 Begin Matrices= Group 1 H Full 1 1 Q Full 1 1 T Full 2 1 Free End Matrices; Matrix H.5 Matrix Q.25 Start.6 All Covariances A+C+D+E | H@A+C+Q@D _ H@A+C+Q@D | A+C+D+E / Thresholds T ; Options RSidual NDecimals=4 End
25
Mx Syntax Group 4 Group 4: constrain variance to 1 Constraint NI=1 Begin Matrices = Group 1 ; I unit 1 1 End Matrices; Constraint I = A+C+E+D ; Option Multiple End Specify 2 t 8 8 Specify 3 t 8 8 End
26
Raw data approach Correction not always necessary ML MCAR/MAR Prediction of missingness Correct through weight formula
27
What can we do with Mx? Normal theory likelihood function for raw data in Mx j=1 ln L i = f i ln [ 3 w j g(x i,: ij, G ij )] m x i - vector of observed scores on n subjects :ij - vector of predicted means Gij - matrix of predicted covariances - functions of parameters
28
Likelihood Function Itself Example: Normal pdf The guts of it j=1 ln L i = f i ln [ 3 w ij g(x i,: ij, G ij )] m g(x i,: ij, G ij ) - likelihood function
29
Normal distribution N(: ij, G ij ) Likelihood is height of the curve 01234-2-3-4 0 0.1 0.2 0.3 0.4 0.5 : G xixi N likelihood
30
Weighted mixture of models Finite mixture distribution j=1 m j = 1....m models w ij Weight for subject i model j e.g., Segregation analysis ln L i = f i ln [ 3 w ij g(x i,: ij, G ij )]
31
Mixture of Normal Distributions Two normals, propotions w1 & w2, different means But Likelihood Ratio not Chi-Squared - what is it? 0123-2-3-4 0 0.1 0.2 0.3 0.4 0.5 :1:1 xixi g :2:2 w 1 x l 1 w 2 x l 2
32
General Likelihood Function Sample frequencies binary data Sometimes 'sample weights' Might also vary over model j Finally the frequencies f i - frequency of case i j=1 ln L i = f i ln [ 3 w j g(x i,: ij, G ij )] m
33
General Likelihood Function Model for Means can differ Model for Covariances can differ Weights can differ Frequencies can differ Things that may differ over subjects i = 1....n subjects (families) j=1 ln L i = f i ln [ 3 w ij g(x i,: ij, G ij )] m
34
Raw Ordinal Data Syntax Read in ordinal file May use frequency command to save space Weight uses \mnor function \mnor(R_M_U_L_K) R - covariance matrix (p x p) M - mean vector (1xp) U - upper threshold (1xp) L - lower threshold (1xp) K - indicator for type of integration in each dimension (1xp) 0: L=-4; 1: U=+4 2: Iu; 3: L=-4 U=4 L
35
Mx Syntax G1: Model parameters Data Calc NGroups=4 Begin Matrices; X Lower 1 1 Free Y Lower 1 1 Free Z Lower 1 1 Free W Lower 1 1 End Matrices; ! parameters are fixed by default, unless declared free Begin Algebra; A= X*X'; C= Y*Y'; E= Z*Z'; D= W*W'; End Algebra: End
36
Mx Syntax G2: MZ twin pairs Data Ninput=3 Ordinal File=mz.frq Labels T1 T2 Freq Definition Freq ; Begin Matrices= Group 1 T full 2 1 Free F full 1 1 ! Frequency End Matrices; Specify F Freq Covariances A+C+D+E | A+C+D _ A+C+D | A+C+D+E ; Thresholds T ; Frequency F; Options RSidual End
37
Mx Syntax G3: DZ twin pairs Data Ninput=3 Labels T1 T2 Freq Ordinal File=dz.frq Definition Freq ; Begin Matrices= Group 1 H Full 1 1 Q Full 1 1 T Full 2 1 Free F full 1 1 ! Frequency End Matrices; Specify F Freq Matrix H.5 Matrix Q.25 Start.6 All Covariances A+C+D+E | H@A+C+Q@D _ H@A+C+Q@D | A+C+D+E / Thresholds T ; Frequency F ; Options RSidual NDecimals=4 End
38
Mx Syntax Group 4: constrain variance to 1 Constraint NI=1 Begin Matrices = Group 1 ; I unit 1 1 End Matrices; Constraint I = A+C+E+D ; Option Multiple End Specify 2 t 8 9 Specify 3 t 8 9 End
39
Ascertainment additional commands Why inverse of J and K? Begin Algebra; M=(A+C+E|A+C_A+C|A+C+E); N=(A+C+E|h@A+C_h@A+C|A+C+E); J=I-\mnor(M_Z_T_T_Z); ! Z = [0 0] K=I-\mnor(N_Z_T_T_Z); ! DZ case End Algebra; Weight J~; ! for MZ group Weight K~; ! DZ group
40
Correcting for ascertainment Multivariate selection: multiple integrals double integral for ASP four double integrals for EDAC Use (or extend) weight formula Precompute in a calculation group unless they vary by subject Linkage studies
41
When ascertainment is NOT necessary Various flavors of missing data mechanisms MCAR: Missing completely at random MAR: Missing at random NMAR: Not missing at random Little & Rubin Terminology
42
Simulation: 3 types of missing data Selrand: MCAR missingness function of independent random variable Selonx: MAR missingness predicted by other measured variable in analysis + MCAR Selony: NMAR missingness mechanism related to ''residual" variance in dependent variable
43
Method Simulate bivariate normal data X,Y Sigma = 1.5 .5 1 Mu = 0, 0 Make some variables missing Generate independent random normal variable, Z, if Z>0 then Y missing If X>0 then Y missing If Y>0 then Y missing Estimate elements of Sigma & Mu Constrain elements to population values 1,.5, 0 etc Compare fit Ideally, repeat multiple times and see if expected 'null' distribution emerges
44
Method Check to see if estimates ‘look right’ Quantify ‘look right’ Test model with parameters fixed to population values Look at chi-squared Repeat! Examine distribution of chi-squareds
45
R simulation 'model'
46
R simulation script: selrand.R r<-0.5 #Correlation (r) N<-1000 #Number of pairs of scores outfile<-"selrand.rec"#output file for the simulation vals<-matrix(rnorm((N*3)),,3);#3 random normals per pair of observed scores rootr<-sqrt(r);#Square root of r root1mr<-sqrt(1-r); #Square root of (1-r) #Data generation follows: pairs<-matrix(cbind(rootr*vals[,1]+root1mr*vals[,2],rootr*vals[,1] +root1mr*vals[,3]),,2) # pairs now contains pairs of scores drawn from population # with unit variances and correlation r
47
# Now to make some of the scores missing, using a loop for (i in 1:N) { x<-rnorm(1,0,1) # x is an independent random number from N(0,1) if (x>0) pairs[i,2] <- NA } write.table (pairs,file=outfile,row.names=FALSE,na=".",col.names=FALSE) R simulation script: selrand.R
48
R simulation script: selonx.R # Now to make some of the scores missing, using a loop for (i in 1:N) { if (pairs[i,1]>0) pairs[i,2] <- NA } write.table(pairs,file=outfile,row.names=FALSE,na=".", col.names=FALSE) # note how missingness of 2 nd column of pairs depends on # value of 1 st column
49
R simulation script: selony.R # Now to make some of the scores missing, using a loop for (i in 1:N) { if (pairs[i,2]>0) pairs[i,2] <- NA } write.table(pairs,file=outfile,row.names=FALSE,na=".", col.names=FALSE) # note how missingness of 2 nd column of pairs depends on # value of 2 nd column itself
50
Mx Script Rather basic, like Monday morning Estimate pop cov matrix of X&Y, with Y observed iff X>0 Data ngroups=1 ninput=2 Rectangular file=selonx.rec Begin Matrices; a sy 2 2 free ! covariance of x,y m fu 1 2 free ! mean of x,y End Matrices; Means M / Covariance A / matrix a 1.3 1 bound.1 2 a 1 1 a 2 2 option rs mu issat end fix all matrix 1 a 1.5 1 matrix 1 m 0 end
51
Mx Scripts & Data Check output: Summary statistics (obs means) Estimated means & covariance matrices Difference in fit between estimated values and population values Interpretation? F:\mcn\2006\sel
52
ML estimation under different missingness mechanisms Missingnessmean xmean yvar xcov xyvar y LR Chisq MCAR (rand) MLE MAR (on x) MLE NMAR (on y) MLE
53
ML estimation under different missingness mechanisms Missingnessmean xmean yvar xcov xyvar y LR Chisq MCAR (rand) MLE -0.0116-0.11.05050.49980.87696.492 sample-0.0116-0.09191.05050.8839 MAR (on x) MLE 0.00480.09981.00840.44811.10255.768 sample0.00140.44371.00840.9762 NMAR (on y) MLE -0.02040.68050.99960.13560.2894227.262 sample0.04480.73730.99960.2851
54
BG Studies: Screen + Examination Bivariate analysis of screen & exam No ascertainment correction required Example: all pairs where at least one screens positive are examined Works for continuous & ordinal Undersampling: some proportion of pairs concordant negative for screen are also examined Ascertainment correction required Different correction for screen -- vs +-/-+/++ Only a subset, selected on basis of screen, are examined
55
Other examples of necessarily missing data Consider: Measures of abuse/dependence in non-users Age at onset in the healthy Symptoms of schizophrenia in non- schizophrenics How do we know whether, e.g., Drug use is related to drug abuse Age at onset predicts severity
56
Two binary variables X and Y Both X and Y observed XY Outcomes: 00, 01, 10, 11 But Y observed only if X=1 XY Outcomes 0?, 10, 11 Only three cell frequencies “0/1/2” No use / Use with no abuse / Use and abuse No variance in use when abuse is observed Amnesiac researcher, forgot to get data from relatives
57
Two binary variables in twins Twin 1 No Yes No Yes No Yes No Yes No Yes No Yes Twin 2 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 1 0 1 Enlightened researcher, good memory, has data from relatives
58
Censored case Y iff X=1 Twin 1 No Yes No Yes No Yes No Yes No Yes No Yes Twin 2 0 ? 0 ? 1 0 0 ? 1 1 1 0 0 ?1 1 0 ? 1 0 1 0 1 1 1 1 1 0 1 1
59
Can estimate parameters! Correlation for Initiation Correlation for Dependence Correlation Twin 1 initiation with Twin 2 Dependence (“proxy”) Indirectly assess correlation between initiation and dependence A Twin 1 A Twin 2 B Twin 1 B Twin 2 b b VA VB* VAVB* rB*rA
60
Multivariate case: Stem and probe Stem = Yes STEM P2P3 P4 P5 P6 F1 l6l6 l1l1
61
Stem and probe items Stem = No Probes Missing STEM P2P3 P4 P5 P6 F1 l6l6 l1l1
62
Proxy information from cotwin identifies the model STEM P2P3 P4 P5 P6 F1 l6l6 l1l1 STEM P2P3 P4 P5 P6 F2 l6l6 l1l1 R > 0
63
Conclusion Be careful when designing studies with non-random ascertainment Usually possible to correct In principle, heritability should not change In practice, it might
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.