Boulder Colorado Workshop March

Slides:



Advertisements
Similar presentations
Chapter 3 Properties of Random Variables
Advertisements

Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Summarizing Variation Matrix Algebra & Mx Michael C Neale PhD Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University.
Probability Probability; Sampling Distribution of Mean, Standard Error of the Mean; Representativeness of the Sample Mean.
QUANTITATIVE DATA ANALYSIS
Summarizing Variation Michael C Neale PhD Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University.
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
#NGroups 1 Title Calculation Begin Matrices; End Matrices; Begin Algebra; End Algebra; End Group;
Raw data analysis S. Purcell & M. C. Neale Twin Workshop, IBG Colorado, March 2002.
LECTURE 16 STRUCTURAL EQUATION MODELING.
Structural Equation Modeling Intro to SEM Psy 524 Ainsworth.
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Summarizing Variation Matrix Algebra Benjamin Neale Analytic and Translational Genetics Unit, Massachusetts General Hospital Program in Medical and Population.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Power and Sample Size Boulder 2004 Benjamin Neale Shaun Purcell.
The importance of the “Means Model” in Mx for modeling regression and association Dorret Boomsma, Nick Martin Boulder 2008.
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1 Matrix Algebra Variation and Likelihood Michael C Neale Virginia Institute for Psychiatric and Behavioral Genetics VCU International Workshop on Methodology.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Mx modeling of methylation data: twin correlations [means, SD, correlation] ACE / ADE latent factor model regression [sex and age] genetic association.
Powerful Regression-based Quantitative Trait Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Mx Practical TC20, 2007 Hermine H. Maes Nick Martin, Dorret Boomsma.
David M. Evans Multivariate QTL Linkage Analysis Queensland Institute of Medical Research Brisbane Australia Twin Workshop Boulder 2003.
Categorical Data Frühling Rijsdijk 1 & Caroline van Baal 2 1 IoP, London 2 Vrije Universiteit, A’dam Twin Workshop, Boulder Tuesday March 2, 2004.
The Measurement and Analysis of Complex Traits Everything you didn’t want to know about measuring behavioral and psychological constructs Leuven Workshop.
Linkage in Mx & Merlin Meike Bartels Kate Morley Hermine Maes Based on Posthuma et al., Boulder & Egmond.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
Computacion Inteligente Least-Square Methods for System Identification.
QTL Mapping Using Mx Michael C Neale Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University.
Biostatistics Class 3 Probability Distributions 2/15/2000.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Estimating standard error using bootstrap
Applied statistics Usman Roshan.
Regression Models for Linkage: Merlin Regress
The Maximum Likelihood Method
Physics 114: Lecture 13 Probability Tests & Linear Fitting
HGEN Thanks to Fruhling Rijsdijk
Probability Theory and Parameter Estimation I
Parameter Estimation and Fitting to Data
Model Inference and Averaging
CH 5: Multivariate Methods
MRC SGDP Centre, Institute of Psychiatry, Psychology & Neuroscience
APPROACHES TO QUANTITATIVE DATA ANALYSIS
The Maximum Likelihood Method
Path Analysis Danielle Dick Boulder 2008
Introduction to Linkage and Association for Quantitative Traits
The Maximum Likelihood Method
Statistical Methods For Engineers
Linkage in Selected Samples
The normal distribution
David M. Evans Sarah E. Medland
Integration of sensory modalities
Warmup Consider tossing a fair coin 3 times.
CHAPTER 6 Random Variables
Parametric Methods Berlin Chen, 2005 References:
Multivariate Linkage Continued
Univariate Linkage in Mx
Mathematical Foundations of BME Reza Shadmehr
Power Calculation for QTL Association
BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS
Chapter 6: Random Variables
Chapter 6: Random Variables
Probabilistic Surrogate Models
Structural Equation Modeling
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Presentation transcript:

Boulder Colorado Workshop March 5 2007 Basic Statistics for Linkage and Association Studies of Quantitative Traits Boulder Colorado Workshop March 5 2007

Overview A brief history of SEM Regression Maximum likelihood estimation Models Twin data Sib pair linkage analysis Association analysis Mixture distributions Some extensions

Origins of SEM Regression analysis ‘Reversion’ Galton 1877: Biological phenomenon Yule 1897 Pearson 1903: General Statistical Context Initially Gaussian X and Y; Fisher 1922 Y|X Path Analysis Sewall Wright 1918; 1921 Path Diagrams of regression and covariance relationships

Structural Equation Model basics Two kinds of relationships Linear regression X -> Y single-headed Unspecified covariance X<->Y double-headed Four kinds of variable Squares – observed variables Circles – latent, not observed variables Triangles – constant (zero variance) for specifying means Diamonds -- observed variables used as moderators (on paths)

Linear Regression Model Var(X) Res(Y) b X Y Models covariances only Of historical interest

Linear Regression Model Var(X) Res(Y) b X Y Mu(x) Mu(y*) 1 Models Means and Covariances

Linear Regression Model Res(Y) X i b D 1 Yi Mu(y*) 1 Models Mean and Covariance of Y only Must have raw (individual level) data X is a definition variable Mean of Y different for every observation

Single Factor Model 1.00 F S1 S2 S3 Sm l1 l2 l3 lm e1 e2 e3 e4

Factor Model with Means MF 1.00 1.00 mF1 mF2 F1 F2 lm l1 l2 l3 S1 S2 S3 Sm mSm e1 e2 e3 e4 mS2 mS3 mS1 B8

Factor model essentials The factor itself is typically assumed to be normally distributed: SEM May have more than one latent factor The error variance is typically assumed to be normal as well May be applied to binary or ordinal data Threshold model

Multifactorial Threshold Model Normal distribution of liability. ‘Affected’ when liability x > t t 0.5  0.4 0.3 0.2 0.1 -4 -3 -2 -1 1 2 3 4 x

Measuring Variation Distribution Population Sample Observed measures Probability density function ‘pdf’ Smoothed out histogram f(x) >= 0 for all x

One Coin toss 2 outcomes Probability 0.6 0.5 0.4 0.3 0.2 0.1 Heads Heads Tails Outcome

Two Coin toss 3 outcomes Probability 0.6 0.5 0.4 0.3 0.2 0.1 HH HT/TH HH HT/TH TT Outcome

Four Coin toss 5 outcomes Probability 0.4 0.3 0.2 0.1 HHHH HHHT HHTT HHHH HHHT HHTT HTTT TTTT Outcome

Ten Coin toss 9 outcomes Probability 0.3 0.25 0.2 0.15 0.1 0.05 Outcome

Fort Knox Toss Infinite outcomes 0.5 0.4 0.3 0.2 0.1 -4 -3 -2 -1 1 2 3 -4 -3 -2 -1 1 2 3 4 Heads-Tails De Moivre 1733 Gauss 1827

Variance Measure of Spread Easily calculated Individual differences

Average squared deviation Normal distribution  xi di -3 -2 -1 1 2 3 Variance =  di2/N

Measuring Variation Absolute differences? Squared differences? Weighs & Means Absolute differences? Squared differences? Absolute cubed? Squared squared?

Measuring Variation Squared differences Ways & Means Fisher (1922) Squared has minimum variance under normal distribution Concept of “Efficiency” emerges

Deviations in two dimensions x + + + + + + + + + + + + + + + y + + + + + + + + + + + + + + + + + +

Deviations in two dimensions x dx + dy y

Covariance Measure of association between two variables Closely related to variance Useful to partition variance Analysis of variance coined by Fisher

Summary x = (xi)/N x =  (xi - x ) / (N) Formulae for sample statistics; i=1…N observations x = (xi)/N x =  (xi - x ) / (N) 2 2 xy =  (xi-x )(yi-y ) / (N) r =  xy xy 2 2   x x

Variance covariance matrix Univariate Twin/Sib Data Var(Twin1) Cov(Twin1,Twin2) Cov(Twin2,Twin1) Var(Twin2) Only suitable for complete data Good conceptual perspective

Summary Means and covariances Basic input statistics for “Traditional SEM” Notion of probability density function

Maximum Likelihood Estimates: Nice Properties 1. Asymptotically unbiased Large sample estimate of p -> population value 2. Minimum variance “Efficient” Smallest variance of all estimates with property 1 3. Functionally invariant If g(a) is one-to-one function of parameter a and MLE (a) = a* then MLE g(a) = g(a*) See http://wikipedia.org

Likelihood computation Calculate height of curve -1

Height of normal curve: x = 0 Probability density function x (xi) -3 -2 -1 xi 1 2 3 (xi) is the likelihood of data point xi for particular mean & variance estimates

Height of normal curve at xi: x = .5 Function of mean x (xi) -3 -2 -1 xi 1 2 3 Likelihood of data point xi increases as x approaches xi

Height of normal curve at x1 Function of variance x (xi var = 1) (xi var = 2) (xi var = 3) -3 -2 -1 x1 1 2 3 Likelihood of data point xi changes as variance of distribution changes

Height of normal curve at x1 and x2 (x1 var = 1) (x1 var = 2) (x2 var = 2) (x2 var = 1) -3 -2 -1 x1 x2 1 2 3 x1 has higher likelihood with var=1 whereas x2 has higher likelihood with var=2

Likelihood of xi as a function of  Likelihood function L(xi) MLE ^ -3 -2 -1 1 xi 2 x 3 L(xi) is the likelihood of data point xi for particular mean & variance estimates

Likelihood as a measure of “outlierness” Unlikely observation may be an outlier Genuine Data entry error Model-specific Can use Mx feature to obtain case-wise likelihoods Raw data Option mx%p= uni_pi.out Output for each case: the contribution to the -2ll as well as z-score statistic and Mahalanobis distance, weight and weighted likelihood Generates R syntax to read in file, and sort by z-score Beeby Medland & Martin (2006) ViewPoint and ViewDist: utilities for rapid graphing of linkage distributions and identification of outliers. Behav Genet. 2006 Jan;36(1):7-11

Height of bivariate normal density function An unlikely pair of (x,y) values -3 -2 -1 1 2 3 0.3 0.4 0.5 y y1 x x1

Height of bivariate normal density function A more likely pair of (x,y) values -3 -2 -1 1 2 3 0.3 0.4 0.5 y2 x2

Likelihood of Independent Observations Chance of getting two heads L(x1…xn) = Product (L(x1), L(x2) , … L(xn)) L(xi) typically < 1 Avoid vanishing L(x1…xn) Computationally convenient log-likelihood ln (a * b) = ln(a) + ln(b) Minimization more manageable than maximization Minimize -2 ln(L)

Likelihood Ratio Tests Comparison of likelihoods Consider ratio L(data,model 1) / L(data, model 2) Ln(a/b) = ln(a) - ln(b) Log-likelihood lnL(data, model 1) - ln L(data, model 2) Useful asymptotic feature when model 2 is a submodel of model 1 -2 (lnL(data, model 1) - lnL(data, model 2)) ~  df = # parameters of model 1 - # parameters of model 2 BEWARE of gotchas! Estimates of a2 q2 etc. have implicit bound of zero Distributed as 50:50 mixture of 0 and  -3 -2 -1 1 2 3 l

Exercises: Compute Normal PDF Get used to Mx script language Use matrix algebra Taste of likelihood theory

Mx script part 1: Declare groups and matrices #NGroups 1 Title figure out likelihood by hand Calculation Begin Matrices; E symm 2 2 ! Expected Covariance Matrix H full 1 1 ! One half T full 1 1 ! Two M full 2 1 ! Mean vector P full 1 1 ! Pi X full 2 1 ! Observed Data End Matrices;

Mx script part 2: Put values in matrices Matrix E 1 .0 1 Matrix H .5 Matrix M 0 0 Matrix P 3.141592 Matrix T 2 Matrix X 1 2

Mx script part 3: Matrix Algebra Begin Algebra; O=T*P*\sqrt\det(E); ! Fractional part, 2pi*sqrt(det(e)) Q=(X-M)'&(E~); ! Mahalanobis Distance R=\exp(-H*Q); ! e to the power -.5*Mahalanobis distance S=-T*\ln(R%O);! minus twice log-likelihood Z=-T*\ln(\pdfnor(X'_M'_E)); ! A simpler way End Algebra; End Group;

Exercises 1 Bivariate normal distribution Means [110.28 112.00] Covariance matrix [299.40 174.20 281.18] Compute likelihood of observed vector x = [87 89]

Exercises 2 Bivariate normal distribution Means [1 1] Covariance matrix [1 .3 .3 1 ] Compute likelihood of observed vector x = [1 2] Compute likelihood with correlation of .0 instead Optional compute likelihood of observed vector x = [-2 -2] with correlations .5, .0, and 0 Which is the most likely combination of model and data?

Exercises 3 Univariate normal distribution Mean [1] Variance [1 ] Compute likelihood of observed vector x = [1] Compute likelihood of observed vector x = [2] Compute their product Which bivariate case does this equal?

Two Group Model: ACE MZ twins DZ twins 7 parameters

DZ by IBD status Variance = Q + F + E Covariance = πQ + F + E

Extensions to More Complex Applications Endophenotypes Linkage Analysis Association Analysis

Basic Linkage (QTL) Model   = p(IBD=2) + .5 p(IBD=1) 1 1 1 1 Pihat 1 1 1 E1 F1 Q1 Q2 F2 E2 e f q q f e P1 P2 Q: QTL Additive Genetic F: Family Environment E: Random Environment 3 estimated parameters: q, f and e Every sibship may have different model P P

Measurement Linkage (QTL) Model   = p(IBD=2) + .5 p(IBD=1) 1 F1 1 Q1 E1 e f q Pihat 1 1 1 Q2 F2 E2 q f e P1 P2 l6 l1 l1 l6 M6 M5 M4 M3 M2 M1 M1 M2 M3 M4 M5 M6 e f q E1 F1 Q1 1 1 1 Q: QTL Additive Genetic F: Family Environment E: Random Environment 3 estimated parameters: q, f and e Every sibship may have different model P P

Fulker Association Model Geno1 Geno2 G1 G2 0.50 0.50 -0.50 0.50 S D m m b w Multilevel model for the means B W 1.00 1.00 -1.00 1.00 LDL1 LDL2 R R C 0.75

Measurement Fulker Association Model (SM) G1 G2 S D B W w -0.50 -1.00 1.00 0.50 b Geno1 Geno2 m M1 M2 M3 M4 M5 M6 l6 l1 M1 M2 M3 M4 M5 M6 l6 l1 F1 F2 M G1 G2 S D B W w -0.50 -1.00 1.00 0.50 b Geno1 Geno2 m

Multivariate Linkage & Association Analyses Computationally burdensome Distribution of test statistics questionable Permutation testing possible Even heavier burden Sarah Medland’s rapid approach Potential to refine both assessment and genetic models Lots of long & wide datasets on the way Dense repeated measures: EMA; fMRI(!) Need to improve software! Open source Mx