PhenoNCSU QTL II: Yandell © 20041 Extending the Phenotype Model 1.limitations of parametric models2-9 –diagnostic tools for QTL analysis –QTL mapping with.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

MCMC estimation in MlwiN
Functional Mapping A statistical model for mapping dynamic genes.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Basics of Linkage Analysis
Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.
DATA ANALYSIS Module Code: CA660 Lecture Block 6: Alternative estimation methods and their implementation.
QTL Mapping R. M. Sundaram.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University.
1 QTL mapping in mice Lecture 10, Statistics 246 February 24, 2004.
Visual Recognition Tutorial
Multiple Correlated Traits Pleiotropy vs. close linkage Analysis of covariance – Regress one trait on another before QTL search Classic GxE analysis Formal.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 QTL mapping in mice, cont. Lecture 11, Statistics 246 February 26, 2004.
DATA ANALYSIS Module Code: CA660 Lecture Block 5.
Sampling Distributions & Point Estimation. Questions What is a sampling distribution? What is the standard error? What is the principle of maximum likelihood?
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
QTL mapping in animals. It works QTL mapping in animals It works It’s cheap.
Model Inference and Averaging
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
1 MAXIMUM LIKELIHOOD ESTIMATION Recall general discussion on Estimation, definition of Likelihood function for a vector of parameters  and set of values.
01/20151 EPI 5344: Survival Analysis in Epidemiology Maximum Likelihood Estimation: An Introduction March 10, 2015 Dr. N. Birkett, School of Epidemiology,
Bayesian Analysis and Applications of A Cure Rate Model.
Human Chromosomes Male Xy X y Female XX X XX Xy Daughter Son.
Toward a unified approach to fitting loss models Jacques Rioux and Stuart Klugman, for presentation at the IAC, Feb. 9, 2004.
October 2001Jackson Labs Workshop © Brian S. Yandell 1 Efficient and Robust Statistical Methods for Quantitative Trait Loci Analysis Brian S. Yandell University.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Complex Traits Most neurobehavioral traits are complex Multifactorial
Quantitative Genetics
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Association between genotype and phenotype
Brian S. Yandell University of Wisconsin-Madison
Lecture 2: Statistical learning primer for biologists
BayesNCSU QTL II: Yandell © Bayesian Interval Mapping 1.what is Bayes? Bayes theorem? Bayesian QTL mapping Markov chain sampling18-25.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium Mapping of Complex Binary Diseases Two types of complex traits Quantitative traits–continuous variation Dichotomous traits–discontinuous.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Why you should know about experimental crosses. To save you from embarrassment.
High resolution QTL mapping in genotypically selected samples from experimental crosses Selective mapping (Fig. 1) is an experimental design strategy for.
Efficient calculation of empirical p- values for genome wide linkage through weighted mixtures Sarah E Medland, Eric J Schmitt, Bradley T Webb, Po-Hsiu.
13 October 2004Statistics: Yandell © Inferring Genetic Architecture of Complex Biological Processes Brian S. Yandell 12, Christina Kendziorski 13,
Joint Modelling of Accelerated Failure Time and Longitudinal Data By By Yi-Kuan Tseng Yi-Kuan Tseng Joint Work With Joint Work With Professor Jane-Ling.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
ModelNCSU QTL II: Yandell © Model Selection for Multiple QTL 1.reality of multiple QTL3-8 2.selecting a class of QTL models comparing QTL models16-24.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping Fei Zou Department of Biostatistics University.
Estimating standard error using bootstrap
Plant Microarray Course
Probability Theory and Parameter Estimation I
2016 METHODS IN COMPUTATIONAL NEUROSCIENCE COURSE
Gene mapping in mice Karl W Broman Department of Biostatistics
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
examples in detail days to flower for Brassica napus (plant) (n = 108)
Inferring Genetic Architecture of Complex Biological Processes Brian S
Seattle SISG: Yandell © 2006
Note on Outbred Studies
Parametric Methods Berlin Chen, 2005 References:
Diagnostics and Remedial Measures
Diagnostics and Remedial Measures
Introductory Statistics
Presentation transcript:

PhenoNCSU QTL II: Yandell © Extending the Phenotype Model 1.limitations of parametric models2-9 –diagnostic tools for QTL analysis –QTL mapping with other parametric "families" –quick fixes via data transformations 2.semi-parametric approaches non-parametric approaches25-31 bottom line for normal phenotype model –may work well to pick up loci –may be poor at estimating effects if data not normal

PhenoNCSU QTL II: Yandell © limitations of parametric models measurements not normal –categorical traits: counts (e.g. number of tumors) use methods specific for counts binomial, Poisson, negative binomial –traits measured over time and/or space survival time (e.g. days to flowering) developmental process; signal transduction between cells TP Speed (pers. comm.); Ma, Casella, Wu (2002) false positives due to miss-specified model –how to check model assumptions? want more robust estimates of effects –parametric: only center (mean), spread (SD) –shape of distribution may be important

PhenoNCSU QTL II: Yandell © what if data are far away from ideal?

PhenoNCSU QTL II: Yandell © diagnostic tools for QTL (Hackett 1997) illustrated with BC, adapt regression diagnostics normality & equal variance (fig. 1) –plot fitted values vs. residuals--football shaped? –normal scores plot of residuals--straight line? number of QTL: likelihood profile (fig. 2) –flat shoulders near LOD peak: evidence for 1 vs. 2 QTL genetic effects –effect estimate near QTL should be (1–2r)a –plot effect vs. location

PhenoNCSU QTL II: Yandell © marker density & sample size: 2 QTL modest sample size dense vs. sparse markers large sample size dense vs. sparse markers Wright Kong (1997 Genetics)

PhenoNCSU QTL II: Yandell © robust locus estimate for non-normal phenotype Wright Kong (1997 Genetics) large sample size & dense marker map: no need for normality but what happens for modest sample sizes?

PhenoNCSU QTL II: Yandell © What shape is your histogram? histogram conditional on known QT genotype –pr(Y|qq,  )model shape with genotype qq –pr(Y|Qq,  )model shape with genotype Qq –pr(Y|QQ,  )model shape with genotype QQ is the QTL at a given locus ? –no QTLpr(Y|qq,  ) = pr(Y|Qq,  ) = pr(Y|QQ,  ) –QTL presentmixture if genotype unknown mixture across possible genotypes –sum over Q = qq, Qq, QQ –pr (Y|X,,  ) = sum Q pr (Q|X, ) pr (Y|Q,  )

PhenoNCSU QTL II: Yandell © interval mapping likelihood likelihood: basis for scanning the genome –product over i = 1,…,n individuals L( , |Y)= product i pr(Y i |X i, ) = product i sum Q pr(Q|X i, ) pr(Y i |Q,  ) problem: unknown phenotype model –parametricpr(Y|Q,  ) = f(Y | , G Q,  2 ) –semi-parametricpr(Y|Q,  ) = f(Y) exp(Y  Q ) –non-parametricpr(Y|Q,  ) = F Q (Y)

PhenoNCSU QTL II: Yandell © useful models & transformations binary trait (yes/no, hi/lo, …) –map directly as another marker –categorical: break into binary traits? –mixed binary/continuous: condition on Y > 0? known model for biological mechanism –countsPoisson –fractionsbinomial –clusterednegative binomial transform to stabilize variance –counts  Y = sqrt(Y) –concentrationlog(Y) or log(Y+c) –fractionsarcsin(  Y) transform to symmetry (approx. normal) –fractionlog(Y/(1-Y)) or log((Y+c)/(1+c-Y)) empirical transform based on histogram –watch out: hard to do well even without mixture –probably better to map untransformed, then examine residuals

PhenoNCSU QTL II: Yandell © semi-parametric QTL phenotype model pr(Y|Q,  ) = f(Y)exp(Y  Q ) –unknown parameters  = (f,  ) f (Y) is a (unknown) density if there is no QTL  = (  qq,  Qq,  QQ ) exp(Y  Q ) `tilts’ f based on genotype Q and phenotype Y test for QTL at locus –  Q = 0 for all Q, or pr(Y|Q,  ) = f(Y) includes many standard phenotype models normalpr(Y|Q,  ) = N(G Q,  2 ) Poissonpr(Y|Q,  ) = Poisson(G Q ) exponential, binomial, …, but not negative binomial

PhenoNCSU QTL II: Yandell © QTL for binomial data approximate methods: marker regression –Zeng (1993,1994); Visscher et al. (1996); McIntyre et al. (2001) interval mapping, CIM –Xu Atchley (1996); Yi Xu (2000) –Y ~ binomial(1,  ),  depends on genotype Q –pr(Y|Q) = (  Q ) Y (1 –  Q ) (1–Y) –substitute this phenotype model in EM iteration or just map it as another marker! –but may have complex

PhenoNCSU QTL II: Yandell © EM algorithm for binomial QTL E-step: posterior probability of genotype Q M-step: MLE of binomial probability  Q

PhenoNCSU QTL II: Yandell © threshold or latent variable idea "real", unobserved phenotype Z is continuous observed phenotype Y is ordinal value –no/yes; poor/fair/good/excellent –pr(Y = j) = pr(  j–1 < Z   j ) –pr(Y  j) = pr(Z   j ) use logistic regression idea (Hackett Weller 1995) –substitute new phenotype model in to EM algorithm –or use Bayesian posterior approach –extended to multiple QTL (papers in press)

PhenoNCSU QTL II: Yandell © quantitative & qualitative traits Broman (2003): spike in phenotype –large fraction of phenotype has one value –map binary trait (is/is not that value) –map continuous trait given not that value multiple traits –Williams et al. (1999) multiple binary & normal traits variance component analysis –Corander Sillanpaa (2002) multiple discrete & continuous traits latent (unobserved) variables

PhenoNCSU QTL II: Yandell © other parametric approaches Poisson counts –Mackay Fry (1996) trait = bristle number –Shepel et al (1998) trait = tumor count negative binomial –Lan et al. (2001) number of tumors exponential –Jansen (1992) Mackay Fry (1996 Genetics)

PhenoNCSU QTL II: Yandell © semi-parametric empirical likelihood phenotype model pr(Y|Q,  ) = f(Y) exp(Y  Q ) –“point mass” at each measured phenotype Y i –subject to distribution constraints for each Q: 1 = sum i f(Y i ) exp(Y i  Q ) non-parametric empirical likelihood (Owen 1988) L( , |Y,X)= product i [sum Q pr(Q|X i, ) f(Y i ) exp(Y i  Q )] = product i f(Y i ) [sum Q pr(Q|X i, ) exp(Y i  Q )] = product i f(Y i ) w i –weights w i = w(Y i |X i, , ) rely only on flanking markers 4 possible values for BC, 9 for F2, etc. profile likelihood: L( |Y,X) = max  L( , |Y,X)

PhenoNCSU QTL II: Yandell © semi-parametric formal tests partial empirical LOD –Zou, Fine, Yandell (2002 Biometrika) conditional empirical LOD –Zou, Fine (2003 Biometrika); Jin, Fine, Yandell (2004) has same formal behavior as parametric LOD –single locus test:approximately  2 with 1 d.f. –genome-wide scan:can use same critical values –permutation test:possible with some work can estimate cumulative distributions –nice properties (converge to Gaussian processes)

PhenoNCSU QTL II: Yandell © partial empirical likelihood log(L( , |Y,X)) = sum i log(f(Y i )) +log(w i ) now profile with respect to , log(L( , |Y,X)) = sum i log(f i ) +log(w i ) + sum Q  Q (1-sum i f i exp(Y i  Q )) partial likelihood:set Lagrange multipliers  Q to 0 force f to be a distribution that sums to 1 point mass density estimates

PhenoNCSU QTL II: Yandell © histograms and CDFs histograms capture shape but are not very accurate CDFs are more accurate but not always intuitive

PhenoNCSU QTL II: Yandell © rat study of breast cancer Lan et al. (2001 Genetics) rat backcross –two inbred strains Wistar-Furth susceptible Wistar-Kyoto resistant –backcross to WF –383 females –chromosome 5, 58 markers search for resistance genes Y = # mammary carcinomas where is the QTL? dash = normal solid = semi-parametric

PhenoNCSU QTL II: Yandell © what shape histograms by genotype? WF/WFWKy/WF line = normal, + = semi-parametric, o = confidence interval

PhenoNCSU QTL II: Yandell © conditional empirical LOD partial empirical LOD has problems –tests for F2 depends on unknown weights –difficult to generalize to multiple QTL conditional empirical likelihood unbiased –examine genotypes given phenotypes –does not depend on f(Y) –pr(X i ) depends only on mating design –unbiased for selective genotyping (Jin et al. 2004) pr(X i |Y i, ,,Q) = exp(Y i  Q ) pr(Q|X i | ) pr(X i ) / constant

PhenoNCSU QTL II: Yandell © spike data example Boyartchuk et al. (2001); Broman (2003) 133 markers, 20 chromosomes 116 female mice Listeria monocytogenes infection

PhenoNCSU QTL II: Yandell © new resampling threshold method EM locally approximates LOD by quadratic form use local covariance of  estimates to further approximate –relies on n independent standard normal variates Z = (Z 1,…,Z n ) –one set of variates Z for the entire genome! repeatedly resample independent standard normal variates Z –no need to recompute maximum likelihood on new samples –intermediate EM calculations used directly evaluate threshold as with usual permutation test –extends naturally to multiple QTL results shown in previous figure

PhenoNCSU QTL II: Yandell © non-parametric methods phenotype model pr(Y|Q,  ) = F Q (Y) –  = F = (F qq,F Qq,F QQ ) arbitrary distribution functions interval mapping Wilcoxon rank-sum test –replaced Y by rank(Y) (Kruglyak Lander 1995; Poole Drinkwater 1996; Broman 2003) –claimed no estimator of QTL effects non-parametric shift estimator –semi-parametric shift (Hodges-Lehmann) Zou (2001) thesis, Zou, Yandell, Fine (2002 in review) –non-parametric cumulative distribution Fine, Zou, Yandell (2001 in review) stochastic ordering (Hoff et al. 2002)

PhenoNCSU QTL II: Yandell © rank-sum QTL methods phenotype model pr(Y|Q,  ) = F Q (Y) replace Y by rank(Y) and perform IM –extension of Wilcoxon rank-sum test –fully non-parametric (Kruglyak Lander 1995; Poole Drinkwater 1996) Hodges-Lehmann estimator of shift  –most efficient if pr(Y|Q,  ) = F(Y+Q  ) –find  that matches medians problem:genotypes Q unknown resolution:Haley-Knott (1992) regression scan –works well in practice, but theory is elusive Zou, Yandell Fine (Genetics, in review)

PhenoNCSU QTL II: Yandell © non-parametric QTL CDFs estimate non-parametric phenotype model –cumulative distributions F Q (y) = pr(Y  y |Q) –can use to check parametric model validity basic idea: pr(Y  y |X, ) = sum Q pr(Q|X, ) F Q (y) –depends on X only through flanking markers –few possible flanking marker genotypes 4 for BC, 9 for F2, etc.

PhenoNCSU QTL II: Yandell © finding non-parametric QTL CDFs cumulative distribution F Q (y) = pr(Y  y |Q) F = {F Q, all possible QT genotypes Q} –BC with 1 QTL: F = {F QQ, F Qq } find F to minimize over all phenotypes y sum i [I(Y i  y) – sum Q pr(Q|X, ) F Q (y)] 2 looks complicated, but simple to implement

PhenoNCSU QTL II: Yandell © non-parametric CDF properties readily extended to censored data –time to flowering for non-vernalized plants –Fine, Zou, Yandell (2004 Biometrics J) nice large sample properties –estimates of F(y) = {F Q (y)} jointly normal –point-wise, experiment-wise confidence bands more robust to heavy tails and outliers can use to assess parametric assumptions

PhenoNCSU QTL II: Yandell © what QTL influence flowering time? no vernalization: censored survival Brassica napus –Major female needs vernalization –Stellar male insensitive –99 double haploids Y = log(days to flower) –over 50% Major at QTL never flowered –log not fully effective grey = normal, red = non-parametric

PhenoNCSU QTL II: Yandell © what shape is flowering distribution? B. napus StellarB. napus Major line = normal, + = non-parametric, o = confidence interval