Simulating and Modeling Extended Twin Family Data

Slides:



Advertisements
Similar presentations
Spatial autoregressive methods
Advertisements

Structural Equation Modeling
Extended Twin Kinship Designs. Limitations of Twin Studies Limited to “ACE” model No test of assumptions of twin design C biased by assortative mating,
Point estimation, interval estimation
Extended sibships Danielle Posthuma Kate Morley Files: \\danielle\ExtSibs.
Developmental models. Multivariate analysis choleski models factor models –y =  f + u genetic factor models –P j = h G j + c C j + e E j –common pathway.
1 EE 616 Computer Aided Analysis of Electronic Networks Lecture 9 Instructor: Dr. J. A. Starzyk, Professor School of EECS Ohio University Athens, OH,
Path Analysis Frühling Rijsdijk. Biometrical Genetic Theory Aims of session:  Derivation of Predicted Var/Cov matrices Using: (1)Path Tracing Rules (2)Covariance.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
“Beyond Twins” Lindon Eaves, VIPBG, Richmond Boulder CO, March 2012.
Karri Silventoinen University of Helsinki Osaka University.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Univariate modeling Sarah Medland. Starting at the beginning… Data preparation – The algebra style used in Mx expects 1 line per case/family – (Almost)
Mx modeling of methylation data: twin correlations [means, SD, correlation] ACE / ADE latent factor model regression [sex and age] genetic association.
Model building & assumptions Matt Keller, Sarah Medland, Hermine Maes TC21 March 2008.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
Welcome  Log on using the username and password you received at registration  Copy the folder: F:/sarah/mon-morning To your H drive.
Lesson 5.1 Evaluation of the measurement instrument: reliability I.
Chapter 17 STRUCTURAL EQUATION MODELING. Structural Equation Modeling (SEM)  Relatively new statistical technique used to test theoretical or causal.
QTL Mapping Using Mx Michael C Neale Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University.
Statistics for Business and Economics 7 th Edition Chapter 7 Estimation: Single Population Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
University of Colorado at Boulder
Extended Pedigrees HGEN619 class 2007.
Chapter 23 Comparing Means.
Nature of Estimation.
Assumptions For testing a claim about the mean of a single population
Chapter 23 Comparing Means.
Univariate Twin Analysis
Gene-environment interaction
Model assumptions & extending the twin model
Introduction to Multivariate Genetic Analysis
Re-introduction to openMx
STATISTICAL INFERENCE PART I POINT ESTIMATION
MRC SGDP Centre, Institute of Psychiatry, Psychology & Neuroscience
Structural Equation Modeling
STOCHASTIC REGRESSORS AND THE METHOD OF INSTRUMENTAL VARIABLES
Chapter 10 Verification and Validation of Simulation Models
How to handle missing data values
Univariate modeling Sarah Medland.
Stochastic Hydrology Hydrological Frequency Analysis (II) LMRD-based GOF tests Prof. Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
Statistical Methods For Engineers
CHAPTER 26: Inference for Regression
Linkage in Selected Samples
Two-Variable Regression Model: The Problem of Estimation
Chapter 6: MULTIPLE REGRESSION ANALYSIS
Hidden Markov Models Part 2: Algorithms
Chapter 23 Comparing Means.
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Gerald Dyer, Jr., MPH October 20, 2016
Confidence Interval Estimation
I. Statistical Tests: Why do we use them? What do they involve?
Why general modeling framework?
(Re)introduction to Mx Sarah Medland
Some issues in multivariate regression
What are BLUP? and why they are useful?
Sarah Medland faculty/sarah/2018/Tuesday
Simple Linear Regression
Lucía Colodro Conde Sarah Medland & Matt Keller
Product moment correlation
Carey Williamson Department of Computer Science University of Calgary
Chapter 24 Comparing Means Copyright © 2009 Pearson Education, Inc.
Amos Introduction In this tutorial, you will be briefly introduced to the student version of the SEM software known as Amos. You should download the current.
Simulating and Modeling Genetically Informative Data
Tutorial 6 SEG rd Oct..
BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS
Rachael Bedford Mplus: Longitudinal Analysis Workshop 23/06/2015
Multivariate Genetic Analysis: Introduction
Presentation transcript:

Simulating and Modeling Extended Twin Family Data Matthew C. Keller Sarah E. Medland Hermine Maes (also, Lindon Eaves, Mike Neale, Pete Hatemi, Laramie Duncan)

Outline Motivation for using extended pedigrees - Briefly introduce the NTFD, Stealth, & Cascade models Simulation using GeneEvolve - Practical looking at changes in genetic variance across time Use GeneEvolve to simulate extended twin family data & run in Mx - Practical getting sensitivity analysis of CTD & NTFD

Goals Understand in general terms the reason for extended twin family designs Learn how to use GeneEvolve Understand how to derive biases & sampling distributions from simulation

NON-Goal Fully understand the logic, path diagrams, and scripting of extended twin family models

Structural Equation Modeling (SEM) in BG SEM is great because… Directs focus to effect sizes, not “significance” Forces consideration of causes and consequences Explicit disclosure of assumptions Potential weakness… Parameter reification: “Using the CTD we found that 50% of variation is due to A and 20% to C.”

Structural Equation Modeling (SEM) in BG SEM is great because… Directs focus to effect sizes, not “significance” Forces consideration of causes and consequences Explicit disclosure of assumptions Potential weakness… Parameter reification: “Using the CTD we found that 50% of variation is due to A and 20% to C.” Not necessarily. Only true under assumptions that may often be unmet (e.g., D=0) and usually go untested. To the degree assumptions wrong, estimates are biased.

Classical Twin Design (CTD) 1 A A 1/.25 C C E E a a e c c D D e d d PT1 PT2

Classical Twin Design (CTD) Assumption biased up biased down Either D or C is zero A C & D No assortative mating C D No A-C covariance C D & A 1 A A 1/.25 C C E E a a e c c D D e d d PT1 PT2

Why can’t we estimate A, C & D at same time using twins only? Solve the following two equations for A, C, & D: CVmz = A + D + C CVdz = 1/2A + 1/4D + C ‘Full’ Model: requires various different types of relative to uniquely solve for each of the unknowns. Thus let’s start with ACE.

Why does simply setting D or C to zero bias C & D down and A up? Information to estimate A comes from the ratio CVmz : 2*CVdz. The closer this ratio is to unity, the higher A is. If D & C both exist at the same time, D drives the ratio up, C drives it down. To the degree these effects ‘cancel each other out,’ it looks like A at the expense of D & C. ‘Full’ Model: requires various different types of relative to uniquely solve for each of the unknowns. Thus let’s start with ACE.

Adding parents gets us around these assumptions Assumption biased up biased down Either D or C is zero No assortative mating No A-C covariance We don’t have to make these PMa C a D d E e c A q w PFa m PT1 PT2 1/.25 µ x x

Parents also allow differentiation of C into S & F With parents, we can break “C” up into: S = env. factors shared only between sibs F = familial env factors passed from parents to offspring S C F PT1 S a D d E e s A f F PT2 1/.25 1 PT1 C a D d E e c A PT2 1/.25 1

Nuclear Twin Family Design (NTFD) q q A F F A x x S S f E E f a a s e e s D D d d PFa µ PMa m m m m Note: m estimated and f fixed to 1 1 F A A F 1/.25 S S E E f a a f e s s D D e d d PT1 PT2 Assumptions: Only can estimate 3 of 4: A, D, S, and F (bias is variable) Assortative mating due to primary phenotypic assortment (bias is variable)

Stealth Include twins and their sibs, parents, spouses, and offspring… Gives 17 unique covariances (MZ, DZ, Sib, P-O, Spousal, MZ avunc, DZ avunc, MZ cous, DZ cous, GP-GO, and 7 in-laws) 88 covariances with sex effects

Additional obs. covs with Stealth allow estimation of A, S, D, & F can be estimated simultaneously 1 F A A F 1/.25 S S E E f a a f e s s D D e d d d PT1 PT2 (Remember: we’re not just estimating more effects. More importantly, we’re reducing the bias in estimated effects!)

Stealth A F F A S S E E D D A F F A A F F A S S S S E E E E D D D D A w w q q A F F A x x S S f E E f a a s D e e s D d d PFa µ PMa m m m m w w q 1 q A F F A A F F A x 1/.25 x S S S S E E E E a f f a a f f a s e s s e e e s D D D D d d d d PFa µ PT1 PT2 µ PMa m m m m A F F A S E S E a f f a s e e s D D d d PCh PCh

Stealth Assumption biased up biased down Primary assortative mating A, D, or F A, D, or F Primary AM: mates choose each other based on phenotypic similarity Social homogamy: mates choose each other due to environmental similarity (e.g., religion) Convergence: mates become more similar over time

Cascade A F F A S S E E D D A F A F A F F A S S S S E E E E D D D D A ~ ~ ~ ~ PFa µ PMa ~ ~ d s s d w a ~ ~ q ~ ~ a w f f q ~ e ~ e A F F A x x S S f E E f a a s D e e s D d d PFa PMa m m m m ~ ~ ~ ~ ~ ~ PSp µ PT1 PT2 µ PSp ~ ~ d s s d w ~ ~ a ~ a a a ~ ~ ~ ~ w q ~ f 1 f f q ~ ~ f e ~ ~ ~ ~ e A F A s s e e F A F F A x ~ 1/.25 ~ x S S d d S S E E E E a f f a a f f a s e s s e e s D D D e D d d d d PFa PT1 PT2 PMa m m m m A F F A S E S E a f f a s D e e s D d d PCh PCh

Modeling complexity The good: Tend to be less biased PMa S a D d E e s A q x w f ~ F µ PFa m PT1 PSp PT2 PCh 1/.25 1 The good: Tend to be less biased The bad: Easy to make scripting or theoretical mistakes. Are we really modeling what we think we are?? How to know?

Part II: Simulating complex models

Simulation provides knowledge about processes that are difficult/impossible to figure out analytically Independent check of models: Model validation: Check that your models work as they are supposed to and check the statistical properties of estimates Sensitivity analysis: Check the effect on parameter estimates when assumptions are violated (e.g., different modes of assortative mating, genetic action, etc.) Method for predicting complex dynamics in population genetics

Simulation program: GeneEvolve

GeneEvolve 0.73 Implemented in R, open-source, user modifiable User specifies 31 basic parameters up front (and 17 advanced ones); no need to alter script after that. Download: www.matthewckeller.com

How GeneEvolve works: User specifies: population size, # generations for population to evolve, threshold effects, mechanisms of assortative mating, vertical transmission, etc. 3 types of genetic effects 5 types of environmental effects 13 types of moderator/covariate effects

How GeneEvolve works: Parameters of interest for present simulations A = additive genetic effects D = dominance genetic effects U = unique environmental effects F = familial environmental effects S = sibling environmental effects AM = correlation between spouses am.model = “I”: primary phenotypic “II”: social homogamy

How GeneEvolve works (cont): At adulthood, ~ x% find mates s.t. correlation b/w mating phenotypes = AM: Pairs have children : Rate determined by user-specified population growth Process iterated n times (Markov Chain)

How GeneEvolve works (cont): After n iterations, population splits into two: Parents of spouses Parents of twins Parents of twins have offspring (MZ/DZ twins & their sibs) Twins mate with spousal population & have offspring

What you get: 3 generations (grandparents, parents, & offspring) of phenotypic data written out, one row per family, potentially across repeated measures This data can be entered into structural models for model validation and sensitivity analysis A summary PDF at end shows: Basic simulation statistics Changes in variance components across time Correlations between 10 relative types

Why go through the trouble to simulate a population’s evolution rather than to simulate in one step (or analytically)? Many parameters change dynamically (evolutionarily) as functions of other parameters in models that include assortative mating and vertical transmission. Predicting changes in such parameters is impractical and approaches impossible in models where many things simultaneously going on.

GeneEvolve Practical Getting Started Copy F:/matt/GE folder into your home directory Start R. Then File -> Open Script -> home:GE/GE-73.R Running GeneEvolve Create a reality where A=.3, D=.2, F=.1, S=.1, U=.3 & AM = .2 (all other parameters = 0). After it runs (~1.5 min), open the resulting PDF. What happens to the A variation across 10 generations? D? F? S? Why? Do the same thing but change the mode of mating to social homogamy (am.model <- “II”). What happens? Run another model you find interesting & see what happens

Why does variance of A increase in presence of AM?

Part III: Model validation and sensitivity analysis of extended twin family models

Using complex models without independent validation (e. g Using complex models without independent validation (e.g., simulation) is like…

Process of model validation Simulate a dataset that has parameters that your model can estimate. Run your model on the simulated dataset Obtain and store parameter estimates Repeat steps 1-3 many (e.g., 1000) times

Results of model validation If the mean parameter estimate = the simulated parameter estimate, the estimate is unbiased. If your model has no mistakes, parameters should generally be unbiased (there are exceptions) The standard deviation of an estimates corresponds to its standard error and its distribution to its sampling distribution You can also easily study the multivariate sampling distribution and statistics. E.g., how correlated parameters are.

Graphical representation: model validation Simulate parameters & get simulated dataset .5 .4 .3 .2 .1 .0 A S U

Graphical representation: model validation Simulate parameters & get simulated dataset Run Mx, get estimates e.g., A=.34, D=.17,U=.49 .5 .4 .3 .2 .1 .0 A D U

Graphical representation: model validation Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 e.g., A=.34, D=.17,U=.49 e.g., A=.31, D=.19,U=.50 .5 .4 .3 .2 .1 .0 A D U

Graphical representation: model validation Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 Repeat 2 e.g., A=.34, D=.17,U=.49 e.g., A=.31, D=.19,U=.50 e.g., A=.25, D=.26,U=.48 .5 .4 .3 .2 .1 .0 A D U

Graphical representation: model validation Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 Repeat 2 Repeat 3 e.g., A=.34, D=.17,U=.49 e.g., A=.31, D=.19,U=.50 e.g., A=.25, D=.26,U=.48 e.g., A=.29, D=.21,U=.51 .5 .4 .3 .2 .1 .0 A D U

Graphical representation: model validation Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 Repeat 2 Repeat 3 . . . Repeat 20 .5 .4 .3 .2 .1 .0 A D U

Graphical representation: model validation Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 Repeat 2 Repeat 3 . . . Repeat 1000 .5 .4 .3 .2 .1 .0 A D U

Graphical representation: model validation Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 Repeat 2 Repeat 3 . . . Repeat 1000 .5 .4 .3 .2 .1 .0 Mean estimate ~ true estimate: Unbiased! A D U

Graphical representation: model validation Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 Repeat 2 Repeat 3 . . . Repeat 1000 .5 .4 .3 .2 .1 .0 Boxplots give idea about variance and shape of sampling distributions A D U

Graphical representation: model validation & multivariate distributions Simulate parameters & get simulated dataset Simulate parameters & get simulated dataset Run Mx, get estimates e.g., A=.34, D=.17,U=.49 D A

Graphical representation: model validation & multivariate distributions Simulate parameters & get simulated dataset Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 e.g., A=.34, D=.17,U=.49 e.g., A=.31, D=.19,U=.50 D A

Graphical representation: model validation & multivariate distributions Simulate parameters & get simulated dataset Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 Repeat 2 Repeat 3 . . . Repeat 20 D A

Graphical representation: model validation & multivariate distributions Simulate parameters & get simulated dataset Simulate parameters & get simulated dataset Run Mx, get estimates Repeat 1 Repeat 2 Repeat 3 . . . Repeat 1000 D A & D estimates negatively correlated - suggesting they use overlapping information to be estimated A

Process of sensitivity analysis Simulate a dataset that has one or more parameters that your model cannot estimate. Run your model on the simulated dataset Obtain and store parameter estimates Repeat steps 1-3 many (e.g., 1000) times

Results of sensitivity analysis Because we are simulating violations of assumptions, we expect parameters to be biased. The question becomes: how biased? I.e., how big of a deal are these violations? We should be able to quantify the answers to these questions.

Graphical representation: model sensitivity Simulate parameters that include a violation (here, both D & S exist simultaneously) & get simulated dataset Run Mx, get estimates Repeat 1 Repeat 2 Repeat 3 . . . Repeat 1000 .5 .4 .3 .2 .1 .0 A D S U

Sensitivity analysis practical Run GeneEvolve Create a reality where A=.4, D=.1, S=.2, U=.3 & AM = 0. Datasets MZM, MZF, DZM, DZF, DZOS are made automatically. Run Mx Run the script “GE.Twin_ASE.mx”. This is an ASE script where D is fixed to 0 Run the script “NTF.mx” This is a nuclear twin family script where A, D, and S are simultaneously estimated. Once you have estimates of A, D, and S from both scripts, come up and write them into the Excel spreadsheet. They are found in the 7th, 8th, and 9th elements of the P matrix in “GE.Twin_ASE.mx” and ??? In the ??? matrix in “NTF.mx”

If there’s time… Model validation and sensitivity results for 4 models

Reality: A=.5, D=.2

Reality: A=.5, S=.2

Reality: A=.4, D=.15, S=.15

Reality: A=.35, D=.15, F=.2, S=.15, T=.15, AM=.3

Reality: A=.45, D=.15, F=.25, AM=.3 (Soc Hom)

Reality: A=.4, A*A=.15, S=.15

Reality: A=.4, A*Age=.15, S=.15

A,D, & F estimates are highly correlated in Stealth & Cascade models

Simulation is not a panacea Simulation can be said to provide “knowledge without understanding.” It is a helpful tool for understanding, but doesn’t provide understanding in and of itself. Simulations themselves rely on assumptions about how processes work. If these are wrong, our simulation results may not reflect reality.

GeneEvolve Limitations Sex limitation possible only for A at the moment No multivariate except for longitudinal Download: www.matthewckeller.com

Conclusions All models require assumptions. Generally, more assumptions = more biased estimates Extended twin family designs require fewer assumptions and tend to be less biased Simulation is a powerful tool for checking comlex models (and not just extended twin family models)

Why does variance of A increase in presence of AM? PMa F a D d E e f A q w PFa m PT1 PT2 1/.25 µ Answer: When two spouses are phenotypically similar, they also tend to have similar A effects. Offspring A is a weighted sum of parental A. Therefore, variance of A increases for same reason that the variance of any sum increases when components are correlated. For similar reasons, variance of the other transmitted parameter, F, also increases.