Simulating and Modeling Genetically Informative Data

Slides:



Advertisements
Similar presentations
Chap 8-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 8 Estimation: Single Population Statistics for Business and Economics.
Advertisements

Extended Twin Kinship Designs. Limitations of Twin Studies Limited to “ACE” model No test of assumptions of twin design C biased by assortative mating,
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
Extended sibships Danielle Posthuma Kate Morley Files: \\danielle\ExtSibs.
1 EE 616 Computer Aided Analysis of Electronic Networks Lecture 9 Instructor: Dr. J. A. Starzyk, Professor School of EECS Ohio University Athens, OH,
“Beyond Twins” Lindon Eaves, VIPBG, Richmond Boulder CO, March 2012.
The Campbell Collaborationwww.campbellcollaboration.org Introduction to Robust Standard Errors Emily E. Tanner-Smith Associate Editor, Methods Coordinating.
A very brief introduction to using R & MX - Matthew Keller Some material cribbed from: UCLA Academic Technology Services Technical Report Series (by Patrick.
Karri Silventoinen University of Helsinki Osaka University.
Univariate modeling Sarah Medland. Starting at the beginning… Data preparation – The algebra style used in Mx expects 1 line per case/family – (Almost)
PBG 650 Advanced Plant Breeding
Chapter 7 Probability and Samples: The Distribution of Sample Means
The importance of the “Means Model” in Mx for modeling regression and association Dorret Boomsma, Nick Martin Boulder 2008.
Confidence Interval & Unbiased Estimator Review and Foreword.
Mx modeling of methylation data: twin correlations [means, SD, correlation] ACE / ADE latent factor model regression [sex and age] genetic association.
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
Model building & assumptions Matt Keller, Sarah Medland, Hermine Maes TC21 March 2008.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
QTL Mapping Using Mx Michael C Neale Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
Statistics for Business and Economics 7 th Edition Chapter 7 Estimation: Single Population Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Methods of Presenting and Interpreting Information Class 9.
Partitioning of variance – Genetic models
University of Colorado at Boulder
Chapter 7 Probability and Samples
Extended Pedigrees HGEN619 class 2007.
Chapter 7 Confidence Interval Estimation
Nature of Estimation.
Multiple Regression Prof. Andy Field.
PBG 650 Advanced Plant Breeding
Univariate Twin Analysis
Gene-environment interaction
Model assumptions & extending the twin model
Introduction to Multivariate Genetic Analysis
Re-introduction to openMx
Quantitative genetics
Statistical Tools in Quantitative Genetics
MRC SGDP Centre, Institute of Psychiatry, Psychology & Neuroscience
Structural Equation Modeling
12 Inferential Analysis.
CPSC 531: System Modeling and Simulation
STOCHASTIC REGRESSORS AND THE METHOD OF INSTRUMENTAL VARIABLES
How to handle missing data values
Statistical Methods Carey Williamson Department of Computer Science
Univariate modeling Sarah Medland.
CHAPTER 29: Multiple Regression*
Linkage in Selected Samples
GxE and GxG interactions
Multiple Regression Models
Confidence Interval Estimation
Pak Sham & Shaun Purcell Twin Workshop, March 2002
Paired Samples and Blocks
(Re)introduction to Mx Sarah Medland
What are BLUP? and why they are useful?
Sarah Medland faculty/sarah/2018/Tuesday
CHAPTER 10 Comparing Two Populations or Groups
12 Inferential Analysis.
Simple Linear Regression
Paired Samples and Blocks
Tutorial 1: Misspecification
Lucía Colodro Conde Sarah Medland & Matt Keller
Carey Williamson Department of Computer Science University of Calgary
Simulating and Modeling Extended Twin Family Data
Inferential Statistics
Tutorial 6 SEG rd Oct..
BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS
Rachael Bedford Mplus: Longitudinal Analysis Workshop 23/06/2015
Presentation transcript:

Simulating and Modeling Genetically Informative Data Matthew C. Keller Sarah E. Medland

Outline The usefulness of simulation in behavioral genetics Using GeneEvolve to simulate genetically informative data Practical simulating different designs Classical Twin Design (CTD) Nuclear Twin Family Design (NTFD)

Simulation provides knowledge about processes that are difficult/impossible to figure out analytically Independent check of models. Especially important for complex (e.g., extended twin family) models. Model verification: Check that your models work as they are supposed to. Sensitivity analysis: Check the effect on parameter estimates when assumptions are violated (e.g., different modes of assortative mating, vertical transmission, or genetic action). Method for predicting complex dynamics in population genetics

Using complex models without independent verification (e. g Using complex models without independent verification (e.g., simulation) is like…

Process of model verification Simulate a dataset that has parameters that your model can estimate. Run your model on the simulated dataset Obtain and store parameter estimates Repeat steps 1-3 many (e.g., 1000) times

Results of model verification If the mean parameter estimate = the simulated parameter estimate, the estimate is unbiased. If your model has no mistakes, parameters should generally be unbiased (there are exceptions) The standard deviation of an estimates corresponds to its standard error and its distribution to its sampling distribution You can also easily study the multivariate sampling distribution and statistics. E.g., how correlated parameters are.

Process of sensitivity analysis Simulate a dataset that has one or more parameters that your model cannot estimate. Run your model on the simulated dataset Obtain and store parameter estimates Repeat steps 1-3 many (e.g., 1000) times

Results of sensitivity analysis Because we are simulating violations of assumptions, we expect parameters to be biased. The question becomes: how biased? I.e., how big of a deal are these violations? We should be able to quantify the answers to these questions.

Reality: A=.4, D=.15, S=.15

A,D, & F estimates are highly correlated in Stealth & Cascade models

Simulation is not a panacea Simulation can be said to provide “knowledge without understanding.” It is a helpful tool for understanding, but doesn’t provide understanding in and of itself. Simulations themselves rely on assumptions about how processes work. If these are wrong, our simulation results may not reflect reality.

Simulation program: GeneEvolve

GeneEvolve 0.73 Implemented in R, open-source, user modifiable User specifies 31 basic parameters up front (and 17 advanced ones); no need to alter script after that. Fast (on AMD Opteron 3.2GHz dual 64 bit processor, 2GB RAM; OS= RHEL AS4) 10 genes, N=20,000 takes ~ 20 seconds/gen Download: www.matthewckeller.com

How GeneEvolve works: User specifies: population size, # generations for population to evolve, threshold effects, mechanisms of assortative mating, vertical transmission, etc. 3 types of genetic effects 5 types of environmental effects 13 types of moderator/covariate effects Download: www.matthewckeller.com

Diagram of GeneEvolve Model w w q q A F F A x x S S f E E f a a s D e e s D d d aa PFa µ PMa aa AxA AxA m m m m zs F A A F zd S S E E f a a f e s zaa s D D e d d PT1 aa aa PT2 AxA AxA

Diagram: GeneEvolve Age-by-A Interactions Purcell Model: Our Model: r 1 1 1 A AL AS βA+ βint(age) βAL βAS βage β0 β0 1 P 1 P 1 age open Purcell-vs-Ours.pdf; Purcell-vs-OursCorrelation.pdf

How GeneEvolve works (cont): At adulthood, ~ x% find mates s.t. phenotypic correlation b/w mating phenotypes = AM: Pairs have children : Rate determined by user-specified population growth Process iterated n times Download: www.matthewckeller.com

How GeneEvolve works (cont): After n iterations, population split into two: Parents of spouses Parents of twins Parents of twins have offspring (MZ/DZ twins & their sibs) Twins mate with spousal population & have offspring Download: www.matthewckeller.com

What you get: 3 generations of phenotypic data written out (one row per family), potentially across repeated measures This data (& subsets of it) can be entered into structural models for model verification and sensitivity analysis A summary PDF at end shows: Basic simulation statistics Changes in variance components across time Correlations between 10 relative types Download: www.matthewckeller.com

Structural Equation Modeling (SEM) in BG SEM is great because… Directs focus to effect sizes, not “significance” Forces consideration of causes and consequences Explicit disclosure of assumptions Potential weakness… Parameter reification: “Using the CTD we found that 50% of variation is due to A and 20% to C.”

Structural Equation Modeling (SEM) in BG SEM is great because… Directs focus to effect sizes, not “significance” Forces consideration of causes and consequences Explicit disclosure of assumptions Potential weakness… Parameter reification: “Using the CTD we found that 50% of variation is due to A and 20% to C.” NO! Only true under strong assumptions that probably aren’t met (e.g., D=0) and usually go untested. To the degree assumptions wrong, estimates are biased.

Classical Twin Design (CTD) 1 A A 1/.25 C C E E a a e c c D D e d d PT1 PT2

Classical Twin Design (CTD) Assumption biased up biased down Either D or C is zero A C & D No assortative mating C D No A-C covariance C D & A 1 A A 1/.25 C C E E a a e c c D D e d d PT1 PT2

Adding parents gets us around all these assumptions Assumption biased up biased down Either D or C is zero No assortative mating No A-C covariance We don’t have to make these PMa C a D d E e c A q w PFa m PT1 PT2 1/.25 µ x x

Parents also allow differentiation of S & F With parents, we can break “C” up into: S = env. factors shared only between sibs F = familial env factors passed from parents to offspring S C F PT1 S a D d E e s A f F PT2 1/.25 1 PT1 C a D d E e c A PT2 1/.25 1

Nuclear Twin Family Design (NTFD) PMa S a D d E e s A q x w f F PFa m PT1 PT2 zd zs µ Note: m estimated and f fixed to 1 Assumptions: Only can estimate 3 of 4: A, D, S, and F (bias is variable) Assortative mating due to primary phenotypic assortment (bias is variable)

Stealth Include twins and their sibs, parents, spouses, and offspring… Gives 17 unique covariances (MZ, DZ, Sib, P-O, Spousal, MZ avunc, DZ avunc, MZ cous, DZ cous, GP-GO, and 7 in-laws) 88 covariances with sex effects

Additional obs. covs with Stealth allow estimation of A, S, D, F, T can be estimated simultaneously = env. factors shared only between twins T 1 F A A F 1/.25 S S E E f a a f e s s D e 1/0 D d d d PT1 PT2 t T T t (Remember: we’re not just estimating more effects. More importantly, we’re reducing the bias in estimated effects!)

Stealth S D T E A F PMa a d t e s q x w f PFa m PT1 PT2 PCh 1/0 1/.25 µ

Stealth Assumption biased up biased down Primary assortative mating A, D, or F A, D, or F No epistasis A, D S No AxAge D, S A

Stealth Assumption biased up biased down Primary assortative mating A, D, or F A, D, or F No epistasis A, D S No AxAge D, S A Primary AM: mates choose each other based on phenotypic similarity Social homogamy: mates choose each other due to environmental similarity (e.g., religion) Convergence: mates become more similar to each other (e.g., becoming more conservative when dating a conservative)

Cascade A F F A S S E E D D T T A F F A A F F A S S S S E E E E D D D ~ ~ ~ ~ ~ PFa µ PMa ~ ~ ~ t d s s d t w a ~ ~ q ~ ~ a w f f q ~ e ~ e A F F A x x S S f E E f a a s e e s D D d d PFa PMa T t t T m m m m ~ ~ ~ ~ ~ ~ ~ PSp µ PT1 PT2 µ PSp ~ ~ ~ t d s s d t w ~ ~ a a ~ a ~ a w q ~ ~ f 1 ~ ~ f f ~ ~ f q ~ ~ e e ~ s s e ~ e A F F A A F F A x ~ 1/.25 ~ x S d d S S S E E E E a f f a ~ ~ a f f a s e s t t s e e s D e D 1/0 D D d d d d PFa PT1 PT2 PMa T t t T T t t T m m m m A F F A S E S E a f f a s D e e s D d d PCh PCh T t t T

Reality: A=.5, D=.2

Reality: A=.5, S=.2

Reality: A=.4, D=.15, S=.15

Reality: A=.35, D=.15, F=.2, S=.15, T=.15, AM=.3

Reality: A=.45, D=.15, F=.25, AM=.3 (Soc Hom)

Reality: A=.4, A*A=.15, S=.15

Reality: A=.4, A*Age=.15, S=.15

Conclusions All models require assumptions. Generally, more assumptions = more biased estimates For the first time, we have demonstrated independent assessments of the NTFD, Stealth, and Cascade models These complicated models work as designed! In all models, but especially the CTD, please don’t REIFY A, C, & D!

Acknowledgments Those who conceived of these models originally: Jinks, Fulker, Eaves, Cloninger, Reich, Rice, Heath, Neale, Maes, etc. And to Nick Martin: for his energy and enthusiasm, and for encouraging us to do this to begin with

Why use it? Modeling aid Check bias & identification: Feed PE parameters you are modeling, simulate data, & see if your model recovers the parameters Check model’s sensitivity to assumptions: Simulate violations of assumptions & note its effects on estimates Estimate power & multivariate sampling dist’s of estimates under very general conditions: Run PE multiple times given whatever condition you want Download: www.matthewckeller.com

Why use it? Predictor of population / evolutionary genetics dynamics Find changes in variance parameters & relative covariances under different modes of AM, VT, & genetic effects: Simulate random genetic drift by varying population size Introduce selection (coming) to test theories on maintenance of genetic variation Download: www.matthewckeller.com