Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping Fei Zou Department of Biostatistics University of North Carolina-Chapel Hill June 2012 Finland
The Central Dogma of Molecular Biology
Copied (with modifications) from psb.stanford.edu/psb06/presentations/association_mapping.pdf tall Significant difference in genotype distributions? short
Mendel’s Experiment
Mendel’s Experiment
Parents P1 P2 Experimental Crosses: F2
Experimental Crosses F2 Backcross(BC) F1 P1 P2 F2:BC: P1P2 P1F1 AA BB AB BB AB AA BB AAAB AAAB
F2 Data Format 0: homozygous AA, 2: homozygous BB, 1: heterozygote AB.
Data Structure For each subject i (i=1,2,…,n) –Phenotype: y i –Genotypes: x ij (coded as 0, 1, 2 for genotypes AA, AB and BB, respectively) at marker j (j=1,2,…,m) –Genetic map: locations of markers –Other non-genetic covariates, such as age, sex, environmental conditions
Locations of markers
Linkage Analysis Quantitative trait loci (QTL): a particular region of the genome containing one or more genes that are associated with the trait being assayed or measured
QTL Mapping of Experimental Crosses Single QTL Mapping Single marker analysis Interval mapping: Lander & Botstein (1989, Genetics) Multiple QTL mapping Composite interval mapping Multiple interval mapping Bayesian analysis
Single Marker Analysis
Correlations of marker genotypes in experimental crosses
Interval Mapping Traditional QTL mapping method Treat QTL position as unknown and use marker genotypes to infer conditional probabilities of QTL genotypes Profile LOD scores calculated across whole genome –LOD score is a measure for strength of support for QTL –LOD = LRT/4.8 –In any region where the profile exceeds a (genome-wide) significance threshold, a QTL is declared at the position with the highest LOD score.
Profile LOD
QTL Old believe: one trait one gene – very unlikely Most traits have a significant environmental exposure component The vast majority of biological traits are caused by complex polygenic interactions –also context dependent
Multiple QTL Mapping Most complicated traits are caused by multiple (potentially interacting) genes, which also interact with environmental stimuli Single QTL interval mapping –Ghost QTL –Low power if multiple QTLs affect the trait
Two QTL Data Two QTL with opposite effects Two QTL with effects in same direction
Multiple QTL Mapping Available Methods –Composite interval mapping: searching for a putative QTL in a given region while simultaneously fitting partial regression coefficients for "background markers" to adjust the effects of other QTLs outside the region which background markers to include; window size etc –Multiple interval mapping: fitting multiple QTLs simultaneously Computationally very intensive; how many QTLs to fit?
Multiple QTL Mapping
Multiple QTL Mapping
Multiple QTL Mapping
Bayesian QTL Mapping Reversible jump Markov chain Monte Carlo (MCMC) (Green 1995): treat the number of QTLs as a parameter –Change of dimensionality, the acceptance probability for such dimension change, which in practice, may not be handled correctly (Ven 2004) Bayesian variable selection procedures –composite model space (Yi 2004) – stochastic search variable selection (SSVS) (George and McCulloch 1993)