Presentation is loading. Please wait.

Presentation is loading. Please wait.

Department of Biostatistics

Similar presentations


Presentation on theme: "Department of Biostatistics"— Presentation transcript:

1 Department of Biostatistics
Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping Fei Zou Department of Biostatistics

2 Outline Introduction Bayesian semi-parametric QTL Mapping Results
Experimental crosses Existing QTL Mapping Methods Bayesian semi-parametric QTL Mapping Results Remarks and Conclusions

3

4 Overview One gene one trait: very unlikely
The vast majority of biological traits are caused by complex polygenes Potentially interacting with each other Most traits have significant environmental exposure components Potentially interacting with polygenes

5 Experimental Crosses: F2
Parents P2

6 Experimental Crosses F2 Backcross(BC) P2 P1 P1 P2 F1 P1 F1 F1 F2: BC:
AA AA BB BB F1 P1 F1 F1 AA AB AB AB BB AB AB AB AA AB F2: BC:

7 QTL Data Format 0: homozygous AA, 2: homozygous BB,
1: heterozygote AB. Marker positions:

8 Linkage Analysis Data structure:
Marker data (genotypes plus positions) Phenotypic trait(s) Other nongenetic covariates, such as age, gender, environmental conditions etc Quantitative trait loci (QTL): a particular region of the genome containing one or more genes that are associated with the trait being assayed or measured

9 QTL Mapping of Experimental Crosses
Single QTL Mapping Single marker analysis (Sax, 1923 Genetics) Interval mapping: Lander & Botstein (1989, Genetics) Multiple QTL mapping Composite interval mapping (Zeng 1993 PNAS, 1994 Genetics; Jansen & Stam, 1994 Genetics) Multiple interval mapping (Kao et al., 1999 Genetics) Bayesian analysis (Satagopan et al., 1997 Genetics)

10 Single QTL Interval Mapping
For backcross, the model assumes QTL analysis: If QTL genotypes are observed, the analysis is trivial: simple t-test! However, QTL position is unknown and therefore QTL genotypes are unobserved

11 Interval Mapping For QTL between markers
QTL genotypes missing: can use marker genotypes to infer the conditional probabilities of the QTL genotypes for a given QTL position Profile likelihood (LOD score) calculated across the whole genome or candidate regions using EM algorithm In any region where the profile exceeds a (genome-wide) significance threshold, a QTL declared at the position with the highest LOD score.

12 Profile LOD

13 Multiple QTL Mapping Most complicated traits are caused by multiple (potentially interacting) genes, which also interact with environment stimuli Single QTL interval mapping Ghost QTL (Lander & Botstein 1989) Low power

14 Multiple QTL Mapping Composite interval mapping (Zeng 1993, 1994; Jansen & Stam1993): searching for a putative QTL in a given region while simultaneously fitting partial regression coefficients for "background markers" to adjust the effects of other QTLs outside the region which background markers to include; window size etc Multiple interval mapping (Kao et al 1999): fitting multiple QTLs simultaneously Computationally intensive; how many QTLs to include?

15 Multiple QTL Mapping Bayesian methods (Stephens and Fisch 1998 Biometrics; Sillanpaa and Arjas 1998 Genetics; Yi and Xu 2002 Genetic Research, and Yi et al Genetics): treat the number of QTLs as a parameter by using reversible jump Markov chain Monte Carlo (MCMC) of Green (1995 Biometrika) change of dimensionality, the acceptance probability for such dimension change, which in practice, may not be handled correctly (Ven 2004 Genetics)

16 Multiple QTL Mapping Alternative, multiple QTL mapping can be viewed as a variable selection problem Forward and step-wise selection procedures (Broman and Speed 2002 JRSSB) LASSO, etc Bayesian QTL mapping Xu (2003 Genetics), Wang et al (2005 Genetics) Huang et al (2007 Genetics): Bayesian shrinkage Yi et al (2003 Genetics): stochastic search variable selection (SSVS) of George and McCulloch (1993 JASA) Yi (2004 Genetics): composite model space of Godsill (2001 J. Comp. Graph. Stat) Software: R/qtlbim by Yi’s group

17 Multiple QTL Mapping Limitations of existing QTL mapping methods
do not model covariates at all or only model covariate effect linearly do not model interactions at all or model only lower order interactions, such as two way interactions

18 The multiple QTL mapping is a very large variable selection problem: for p potential genes, with p being in the hundreds or thousands, there are possible main effect models, possible two-way interactions and possible higher order (k > 2) interactions.

19 Semiparmetric Multiple (Potentially Interacting) QTL Mapping
Goal: map multiple potentially interacting QTLs without specifically model all potential main and higher order interaction effects Semiparametric model: where function is unspecified, QTL genotypes and represent all non- genetics factors/covariates. When equals : non-explicitly modeling the two way interaction between genes 1 and 2 and the gene-environmental interaction between gene 3 and covariate 1.

20 Bayesian Semi/non-parametric Methods
Dirichlet process (Muller et al. 1996) Splines (Smith and Kohn 1996; Denison et al and DiMatteo et al. 2001) Wavelets (Abramovich et al JRSSB) Kernel models (Liang et al 2007) Gaussian process (Neal 1997; 1996) Gaussian process priors have a large support in the space of all smooth functions through an appropriate choice of covariance kernel. Gaussian process is flexible for curve estimation because of their flexible sample path shapes Gaussian process related to smoothing spline somehow (Wahba 1978 JRSSB)

21 Prior Specification on
A Gaussian process such that all possible finite dimensional distributions follow multivariate normal with mean 0 and covariance function where , s and s are hyperparameters and

22 Hyperparameter defines the vertical scale of variations, i. e
Hyperparameter defines the vertical scale of variations, i.e., controls the magnitude of the exponential part. Hyperparameters related to length scales which characterize the distance in that particular direction over which y is expected to vary significantly controls the smoothness of : when the posterior mean of almost interpolates the data while centered around the prior mean function if When = 0, y is expected to be an essentially constant function of that input variable xj, which is therefore deemed irrelevant (Mackay 1998).

23 Priors on The original papers on the Gaussian process (Mackay 1998; Neal 1997) did not view this method as an approach for variable selection and imposed a Gamma prior on the parameters. However, does provide information about the relevance of any QTL with value near zero indicating an irrelevant QTL. For variable selection purpose, we can impose the following Gamma mixture priors on

24 Prior Specifications Inverse Gamma distributions are used for the priors of and

25 Simulations Set ups: backcross population 200 or 500 individuals
151 evenly spaced markers at 5cM intervals Four QTLs with varying heritabilities: Main effect model: all four QTL act additively Main plus two way interactions Four way interactions only

26

27

28

29

30

31

32 n=500 and pure 4 way-interaction model

33 n=500 and pure additive model

34 Real Data Analysis A mouse study # samples: 187 backcross samples
# markers: 85 with average marker distance 20 cM Phenotypes: inguinal, gonadal, retroperitoneal and mesenteric fat pad weights

35

36

37 Remarks For studies with large # of samples and/or large # of markers, MCMC converges very slowly We employed the hybrid Monte Carlo method, which merges the Metropolis-Hastings algorithm with sampling techniques based on dynamics simulation. We also estimated the maximum a posteriori (MAP) via conjugate gradient method (Hestenes et al 1952 J. Research of National Bureau of Standards) point estimate

38 Real Study: Cardiovascular Disease
2655 tag SNPs from roughly 200 selected candidate genes for cardiovascular disease 820 individuals Non-genetic covariates: gender, smoking status, age

39

40 Remarks Semiparemetric mapping is powerful in mapping multiple (potentially interacting with higher orders) QTL Picks up genes related to the trait regardless of their marginal main effects or joint epistasis effects Cannot readily differentiates genetic contributions main effect? interaction? or both? Fine tuned parametric model with selected genes

41 Remarks and Future Research
How to extend the methodologies to human genome-wide association (GWA) studies, where hundreds of thousands of markers are available Is it possible? potential solutions: pathway analysis; data reduction techniques How to extend the method to human pedigree analysis where mixed effect model is used for correlated family members? Use inheritance vector: so far results are very promising

42 Acknowledgement Joint work with Funding support Hanwen Huang
Haibo Zhou Fuxia Cheng Ina Hoeschele Funding support NIH R01 GM074175


Download ppt "Department of Biostatistics"

Similar presentations


Ads by Google