Presentation is loading. Please wait.

Presentation is loading. Please wait.

California Pacific Medical Center

Similar presentations


Presentation on theme: "California Pacific Medical Center"— Presentation transcript:

1 California Pacific Medical Center
Genotype Imputation Dan Evans California Pacific Medical Center Research Institute

2 Outline Overview Elements of a Hidden Markov Model (HMM)
Methods used by MACH Method comparison with IMPUTEv2 Implementation with MACH Implementation with IMPUTEv2 Software evaluation

3 Impute missing genotypes
Li et al., Annu Rev Genomics Hum Genet 2009

4 Benefits of imputation
Expanded set of SNPs tested for association Facilitate meta-analysis among studies using different genotyping arrays Marchini et al., Nature Genetics 2007

5 Imputation steps Phase study genotypes
Impute missing genotypes from phased haplotypes Phase 1 M1 M2 M3 M4 ID1 G G T A ID1 G A C A Phase 2 M1 M2 M3 M4 ID1 G A T A ID1 G G C A

6 Phasing genome-wide EM algorithm – treats all possible haplotype configurations as equally likely a priori Computational constraints when markers > 10 Hidden Markov Models – new haplotypes derived from older haplotypes by mutation and recombination Limits the possible haplotype configurations

7 Outline Overview Elements of a Hidden Markov Model (HMM)
Methods used by MACH Method comparison with IMPUTEv2 Implementation with MACH Implementation with IMPUTEv2 Software evaluation

8 Elements of a Hidden Markov Model (HMM)
Probabilistic model for sequence annotation – identify the 5’ splice site Exons, splice sites, and introns have different base composition 3 states Each state has emission probabilities Each state has transition probabilities path = π Path is a Markov chain Eddy, Nature Biotechnology 2004

9 Probability model of sequence
Sequence x1 … xL, ith symbol xi transition between different states in the path emission – probability that symbol b is seen at position i when the ith state in the path is k Joint probability of sequence and path 𝑃 𝑥,𝜋 = 𝑎 0𝜋1 𝑖=1 𝐿 𝑒 𝜋𝑖 ( 𝑥 𝑖 ) 𝑖=2 𝐿 𝑎 𝜋 𝑖 𝜋 𝑖−1 Durbin et al., Biological sequence analysis, 1998 start transition emission transition

10 𝑃 𝑥,𝜋 = ∗0.9 ∗ 0.25∗0.9 ∗ 0.95∗0.1 ∗ 0.4∗1.0 ∗ 0.4∗0.9 … 𝑃 𝑥,𝜋 = 𝑎 0𝜋1 𝑖=1 𝐿 𝑒 𝜋𝑖 ( 𝑥 𝑖 ) 𝑖=2 𝐿 𝑎 𝜋 𝑖 𝜋 𝑖−1

11 What if state path, emission probabilities and transition probabilities are unknown?
Dynamic programming algorithms to determine path Viterbi algorithm Forward – backward algorithm Baum-Welch algorithm to estimate transition and emission probabilities

12 Forward algorithm Probability of observed sequence up to and including xi, given statei = k Sum over all states At each position transition emission

13 Backward algorithm Probability of observed sequence starting from the end and working backwards: Start at end Sum over all states at each position

14 Posterior state probabilities
Want to know probability of state k at position i when the emitted sequence is known Posterior probability General multiplication rule Divide both sides by P(x) From posterior probability, can take most probable state, or apply function on states multiplied by posterior prob

15 Baum-Welch algorithm Initial guess at transition ( 𝑎 𝑘𝑙 ) and emission probabilities ( 𝑒 𝑘 (𝑏)) Forward-backward to find posterior probabilities of states in path Use posterior probabilities at each state to estimate new 𝑎 𝑘𝑙 and 𝑒 𝑘 (𝑏) Iterate steps 2 and 3 until stopping criteria (small difference in log likelihood) Version of EM algorithm

16 Outline Overview Elements of a Hidden Markov Model (HMM)
Methods used by MACH Method comparison with IMPUTEv2 Implementation with MACH Implementation with IMPUTEv2 Software evaluation

17 MACH Haplotyping with HMM
Hidden – sequence of mosaic states S that emit the observed genotypes Transition probabilities – recombination events Emission probabilities – mutation, error Li et al., Genet Epidemiol 2010 𝑃 𝑥,𝜋 = 𝑎 0𝜋1 𝑖=1 𝐿 𝑒 𝜋𝑖 ( 𝑥 𝑖 ) 𝑖=2 𝐿 𝑎 𝜋 𝑖 𝜋 𝑖−1 start transition emission transition

18 MACH Path estimation Forward – backward algorithm to estimate path
Update transition and emission probabilities with each estimated path, Baum algorithm Rounds is the number of updates, 20 is suggested to estimate path and parameters

19 MACH genotype imputation
HMM again, but this time include reference haplotypes count frequency that genotype was sampled at each position across iterations Most probable genotype sampled most often Expected number of allele counts (dosage) = 2*hom counts + het counts/# samples

20 MACH imputation quality measures
Quality of genotype = proportion of iterations where the final imputed genotype was selected Quality of marker = genotype quality score averaged across all individuals r2 = observed/expected variance of genotype scores p=mean(g)/2 Var(g)/[2*p*(1-p)]

21 Outline Overview Elements of a Hidden Markov Model (HMM)
Methods used by MACH Method comparison with IMPUTEv2 Implementation with MACH Implementation with IMPUTEv2 Software evaluation

22 IMPUTEv2 vs MACH Transmission and emission probabilities
IMPUTEv2 uses fixed values for these parameters. Emission probability is constant assuming a uniform mutation rate. Transmission probability from the fine-scaled recombination map of human genome. MACH estimates these parameters using Baum-Welch algorithm

23

24 IMPUTEv2 vs MACH Potential states
IMPUTEv2 considers study and reference haplotypes Reduces complexity using Hamming distance to select genetically more similar haplotypes Can accommodate large reference panels MACH randomly selects 200 haplotypes, doesn’t leverage all haplotypes

25 Outline Overview Elements of a Hidden Markov Model (HMM)
Methods used by MACH Method comparison with IMPUTEv2 Implementation with MACH Implementation with IMPUTEv2 Software evaluation

26 MACH ./mach1 \ -d ../examples/sample.dat \ -p ../examples/sample.ped \
-h ../examples/hapmap.haplos \ -s ../examples/hapmap.snps \ --rounds 50 \ #number of iterations --states 200 \ #number of haplotypes to sample --dosage \ #output dosage, not best genotypes --prefix ../output/test \ > ../output/dosage.log

27 Outline Overview Elements of a Hidden Markov Model (HMM)
Methods used by MACH Method comparison with IMPUTEv2 Implementation with MACH Implementation with IMPUTEv2 Software evaluation

28 IMPUTEv2 ./impute2 \ -m ./Example/example.chr22.map \ ##recombination map -h ./Example/example.chr22.1kG.haps \ ##reference haplotypes -l ./Example/example.chr22.1kG.legend \ ##SNP annotation for ref haplo -g ./Example/example.chr22.study.gens \ ##study genotypes -strand_g ./Example/example.chr22.study.strand \ ##study SNP strand -int 20.4e6 20.5e6 \ ##genomic interval -Ne \ ##effective population size, ##scales recombination rates -o ./Example/example.chr22.one.phased.impute2

29 Run parameters reference haplotypes : 112 [Panel 0] study individuals : 250 [Panel 2] sequence interval : [ , ] buffer : 250 kb Ne : 20000 input call thresh : #genotypes with P<0.9 are missing burn-in MCMC iterations : 10 #forward-backward that don’t contribute to imputation probabilities total MCMC iterations : 30 (20 used for inference) HMM states for phasing : 80 [Panel 2] HMM states for imputation : 112 [Panel 0->2] #make this large

30 Outline Overview Elements of a Hidden Markov Model (HMM)
Methods used by MACH Method comparison with IMPUTEv2 Implementation with MACH Implementation with IMPUTEv2 Software evaluation

31 Howie et al., PLoS Genetics, 2009

32 Howie et al., PLoS Genetics, 2009

33 Pre-phasing Reference panels updated frequently
Phase study haplotypes with SHAPEIT2 Impute ungenotyped SNPs with IMPUTEv2

34 Hip OA GWAS


Download ppt "California Pacific Medical Center"

Similar presentations


Ads by Google