Download presentation
Presentation is loading. Please wait.
Published byRebecca Tyler Modified over 9 years ago
1
24/07/2007ISMB/ECCB 2007 24/07/2007ISMB/ECCB 2007 Bayesian association of haplotypes and non-genetic factors to regulatory and phenotypic variation in human populations Anitha Kannan and John Winn Jim Huang * Probabilistic and Statistical Inference Group, Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Toronto, ON, Canada Microsoft Research Cambridge Machine Learning and Perception Group Cambridge, UK
2
24/07/2007ISMB/ECCB 2007 24/07/2007ISMB/ECCB 2007 Outline Main contributions: Joint Bayesian modelling of genetic variation data and quantitative trait measurements Rich probabilistic model for genotype data State-of-the-art results on predicting missing genotypes
3
24/07/2007ISMB/ECCB 2007 Outline Genotype: Unordered pair of SNPs along both chromosomes Haplotype: Ordered set of SNPs along a chromosome Presence of recombination hotspots partitions haplotypes into blocks [Daly, 2001]
4
24/07/2007ISMB/ECCB 2007 Part I: Learning haplotype block structure Our model for genotype data should: –Account for phase & parent-child information –Account for uncertainty in ancestral haplotypes –Account for uncertainty in block structure –Account for population-specific haplotype block statistics –Allow for prior knowledge of haplotype block structure
5
24/07/2007ISMB/ECCB 2007 24/07/2007ISMB/ECCB 2007 Previous models for genotype data Previous methods learn a low-dimensional representation of the genotype data: HAPLOBLOCK (Greenspan, G. and Geiger, D. RECOMB 2003) –Hard partitioning of data into set of haplotype blocks using low- dimensional “ancestral” haplotypes fastPHASE (Scheet P. and Stephens, M. Am J Hum Genet 2006) –Learn ancestral haplotypes from high-dimensional genotype data while accounting for uncertainty in haplotype blocks Jojic, N., Jojic, V. and Heckerman, D. UAI 2004.
6
24/07/2007ISMB/ECCB 2007 Low-dimensional latent representation Probabilistic generative model for genotype data High-dimensional data Unsupervised learning via maximum likelihood
7
24/07/2007ISMB/ECCB 2007 A probabilistic model for genotype data
8
24/07/2007ISMB/ECCB 2007 Maximum likelihood: Lower bound on log likelihood: Learning the model for genotype data Inference Learning/ Parameter estimation
9
24/07/2007ISMB/ECCB 2007 Exact inference is intractable! Approximate the posterior distribution: Baum-Welch-like algorithm: –Run forward-backward algorithm separately on each chain of states –Estimate transition probabilities and ancestral haplotypes given distributions over states Variational inference and parameter estimation
10
24/07/2007ISMB/ECCB 2007 Predicting missing genotype data Have we learned a good density model for genotype data? Gains from –Accounting for uncertainty in haplotype block structure –Accounting for uncertainty in ancestral haplotypes –Accounting for parental relationships Assess model using cross-validation/test prediction error
11
24/07/2007ISMB/ECCB 2007 Predicting missing genotype data Crohn’s/5q31 data set (Daly et al., 2001) –Crohn’s disease data from Chromosome 5q31 containing genotypes for 129 children + 258 parents across 103 loci (phases given for children) For each test set, make ρ fraction of data missing Retain model parameters from model learned from training data, then draw 1000 samples over missing data Compute fill-in error rate over 1000 samples, for all missing data
12
24/07/2007ISMB/ECCB 2007 Prediction error for Crohn’s/5q31 data
13
24/07/2007ISMB/ECCB 2007 Comparative performance for Crohn’s/5q31 data
14
24/07/2007ISMB/ECCB 2007 Reconstructing phase Run EM using 10 random initializations on the full data set Estimate phase from posterior Compute phase error over all loci where phase is known, unambiguous and where alleles are completely observed Compute average and standard deviation of phase error over the 10 initializations
15
24/07/2007ISMB/ECCB 2007 Reconstructing phase Daly 5q31 data (children w/ phase) (phase frozen during EM) Daly 5q31 data (children w/out phase) (phase learned during EM): Daly 5q31 data (children w/ phase + parents) (phase frozen during EM) Daly 5q31 data (children w/out phase + parents) (phase learned during EM) Mean phase error rate 0.59%8.21%0.39%9.51% Standard deviation of phase error rate 1.00%1.09%0.07%1.78% Minimum free energy (nats) 1.50 x 10 4 2.23 x 10 4 1.45 x 10 4 1.36 x 10 4
16
24/07/2007ISMB/ECCB 2007 How many ancestors?
17
24/07/2007ISMB/ECCB 2007 Establishing haplotype block boundaries Define the recombination prior γ on transition probabilities –Different γ correspond to different “blockiness” of data For each locus k, can compute the probability of transition p k –Can establish a threshold t and establish block boundaries Once blocks are defined, can assign block labels l b = (m,n)
18
24/07/2007ISMB/ECCB 2007 Establishing haplotype block boundaries Smaller number of larger blocks… Larger number of smaller blocks…
19
24/07/2007ISMB/ECCB 2007 Haplotype block structure in the ENm006 region 573 SNP markers for 270 individuals from 3 sub- populations: –90 Yoruba individuals (30 parent-parent-offspring trios) from Ibadan, Nigeria (YRI); –90 individuals (30 trios) of European descent from Utah (CEU) –45 Han Chinese individuals from Beijing (CHB+JPT)/45 Japanese individuals from Tokyo (JPT)
20
24/07/2007ISMB/ECCB 2007 Pattern usage in Chromosome 5q31
21
24/07/2007ISMB/ECCB 2007 Part II: Linking haplotype block structure and gene expression data
22
24/07/2007ISMB/ECCB 2007 A model for linking haplotype structure to quantitative trait measurements Observed quantitative trait profile + x 1.0 x 0.0 Relevance variable = Latent block profile Haplotype block 2 Individual 1 Individual 2 Individual 3 Individual 4 Individual 5 Individual 1 Individual 2 Individual 3 Individual 4 Individual 5 Haplotype block 1 Label 1 Label 2 Label 3Label 4 x x
23
24/07/2007ISMB/ECCB 2007 SbjSbj zgjzgj μbgμbg w bg ρgρg individuals j = 1,…,J blocks b = 1,…,B quantitative traits g = 1,…,G α0,β0α0,β0 τ0,μ0τ0,μ0 Noise precision Latent block profile Relevance variable Observed trait Block label π0π0 A Bayesian model for linking haplotype structure to quantitative measurements TbjTbj
24
24/07/2007ISMB/ECCB 2007 lbjlbj zgjzgj μbgμbg w bg ρgρg individuals j = 1,…,J blocks b = 1,…,B genes g = 1,…,G α0,β0α0,β0 π0π0 Noise precision Latent block profile Relevance variable Observed gene expression Block label τ0,μ0τ0,μ0 A Bayesian model for linking haplotype structure to quantitative measurements SbjSbj TbjTbj
25
24/07/2007ISMB/ECCB 2007 A Bayesian model for linking haplotype structure to quantitative measurements μbgμbg w bg ρgρg α0,β0α0,β0 π0π0 Noise precision Latent block profile Relevance variable τ0,μ0τ0,μ0
26
24/07/2007ISMB/ECCB 2007 zgjzgj μbgμbg w bg ρgρg α0,β0α0,β0 τ0,μ0τ0,μ0 Noise precision Latent block expression profile Relevance variable Observed trait π0π0 A Bayesian model for linking haplotype structure to quantitative measurements SbjSbj Block label TbjTbj Block labels
27
24/07/2007ISMB/ECCB 2007 A Bayesian model for linking haplotype structure to quantitative measurements Likelihood Joint probability Priors
28
24/07/2007ISMB/ECCB 2007 Variational Bayes for inferring relationships between haplotype blocks and quantitative measurements Posterior over block labels is held fixed Factorized variational approximation: VB Inference and Learning
29
24/07/2007ISMB/ECCB 2007 Variational Bayes updates
30
24/07/2007ISMB/ECCB 2007 Linking haplotype blocks to phenotype 387 individuals with Crohn’s (+1) or non-Crohn’s (-1) phenotype; Link 10 haplotype blocks from 5q31 to phenotype Average cross-validation error: 23.1% + 3.45% Haplotype blocks 2 and 10 most relevant to Crohn’s phenotype (p < 4.76 x 10 -5 ) Test cases (sorted) Test data splits
31
24/07/2007ISMB/ECCB 2007 Robustness of GeneSNP to irrelevant genes 5 irrelevant genes 10 irrelevant genes Adding irrelevant genes doesn’t hurt much…
32
24/07/2007ISMB/ECCB 2007 Robustness of GeneSNP to irrelevant blocks 1 irrelevant block 2 irrelevant blocks 10 irrelevant blocks …but adding irrelevant haplotype blocks does hurt (bad)! LESSON: Important to group together large numbers of SNPs into smaller number of haplotype blocks!
33
24/07/2007ISMB/ECCB 2007 Linking haplotype blocks to gene expression ENm006 data set: 19 haplotype blocks (573 SNPs) 28 gene expression profiles in ENm006 region (Stranger et al., 2007)
34
24/07/2007ISMB/ECCB 2007 Addressing population stratification …whereas variation between individuals is the effect we’re interested in The population variable affects phenotype/gene expression…
35
24/07/2007ISMB/ECCB 2007 Associations between haplotype blocks and gene expression GDI1 - HapBlock2 (YRI) GDI1 - HapBlock5 (CHB+JPT) p < 2.5 x 10 -4 p < 3.33 x 10 -4
36
24/07/2007ISMB/ECCB 2007 Summary Enhanced version of Jojic et al. (UAI 2004) model for haplotype inference/ discovering block structure Novel Bayesian model for associating haplotype blocks to gene expression We re-discover population-specific block structures across populations in the HapMap data Predictions for Crohn’s disease from Chromosome 5q31 data Cis- associations between blocks and gene expression in ENm006 in presence of non-genetic factors Cis- association between HapBlocks 2 and 5 and GDI1
37
24/07/2007ISMB/ECCB 2007 The road ahead… Applying to larger portions of the HapMap data Finding trans- associations Non-linear models for associating block structure to quantitative traits Joint learning of haplotype block structure and associations Accounting for patterns of gene co-expression/similar phenotypes
38
24/07/2007ISMB/ECCB 2007 Acknowledgements Manolis Dermitzakis and Richard Durbin, Wellcome Trust Sanger Institute Nebojsa Jojic, Microsoft Research Redmond Paul Scheet, University of Michigan - Ann Arbor US National Science Foundation (NSF)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.